<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0YB4EN/SN_web_lightmode.png?1677516598099" width="300" alt="cognitiveclass.ai logo"  />
</center>

# *Investigation of BTC/BUSD cryptocurrency using ADOSC, NATR, TRANGE indicators, and other cryptocurrencies.*

# Lab 3. Exploratory Data Analysis

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics of your main cryptocurrency
*   Explore features and charecteristics of your main cryptocurrency based on other cryptocurrencies


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><u> Import Data from Module</u></li>
    <li><u>Analyzing Individual Feature Patterns using Visualization</u></li>
    <li><u>Descriptive Statistical Analysis</u></li>
    <li><u>Basics of Grouping</u></li>
    <li><u>Correlation and Causation</u></li>
    <li><u>ANOVA</u></li>
    <li><u>Granger Causality Test</u></li>
</ol>

</div>

<hr>


#### ***Dataset description***

The dataset used in this lab contains time-series data on various attributes related to Bitcoin (BTC) and other cryptocurrencies, aggregated at 1-minute intervals. The dataset index represents the time period for which the data is reported(1 minute).

<hr>

**Attributes:**

* ***General:***
    * `open` - the opening price of a **BTC** during a specific time period.
    * `high` - the highest price of a **BTC** during a specific time period.
    * `low` - the lowest price of a **BTC** during a specific time period.
    * `close` - the closing price of a **BTC** during a specific time period.
    * `rec_count` - the number of records or data points in the dataset for a given time period.
    * `volume` - the total amount of trading activity (buying and selling) for a **BTC** during a specific time period.
    * `avg_price` - the average price of a **BTC** during a specific time period.


* ***Indicators***
    * `ADOSC` - an indicator used in technical analysis to measure the momentum of buying and selling pressure for ***Bitcoin***.
    * `NATR` - an indicator used in technical analysis to measure the volatility of ***Bitcoin***.
    * `TRANGE` - an indicator used in technical analysis to measure the range of prices (from high to low) for ***Bitcoin*** during a specific time period.


* ***Other cryptocurrencies:***
    * `ape_avg_price` - the average price of ***APE*** during a specific time period.
    * `bnb_avg_price` - the average price of ***BNB*** during a specific time period.
    * `doge_avg_price` - the average price of ***DOGE coin*** during a specific time period.
    * `eth_avg_price` - the average price of ***Ethereum*** during a specific time period.
    * `xrp_avg_price` - the average price of ***XRP*** during a specific time period.
    * `matic_avg_price` - the average price of ***MATIC*** during a specific time period.

<hr>

*The indicators `ADOSC`, `NATR`, and `TRANGE` are used in technical analysis to provide insights into the momentum, volatility, and price ranges of financial instruments or assets. The other attributes represent the average prices of different cryptocurrencies during a specific time period.*

<hr>

### **What are the main characteristics that have the most impact on the average price of cryptocurrency?**


## 1. Import Data from Module 2


<h4>Setup</h4>


In [ ]:
!pip install pandas
!pip install --upgrade pandas
!pip install matplotlib
!pip install scipy
!pip install seaborn
!pip install mplfinance
!pip install xyzservices
!pip install jupyter_bokeh
!pip install astropy

Import libraries:


In [ ]:
import warnings
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mplfinance as fplt
import statsmodels.api as sm

from astropy.visualization import astropy_mpl_style # 
from scipy import stats
from statsmodels.stats.stattools import durbin_watson
from statsmodels.tsa.stattools import grangercausalitytests
from typing import List
%matplotlib inline 

Specifying link to the dataset that we obtained in the previous lab.

In [ ]:
data_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0YB4EN/BTCBUSD_1min.csv'

Read dataframe from previous lab.

In [ ]:
df = pd.read_csv(data_path, index_col='ts')
# converting index to datetime
df.index = pd.to_datetime(df.index)
df.head()

Initialize list of currencies which we have in our dataframe.

In [ ]:
currencies = ['ape', 'bnb','doge', 'eth', 'xrp', 'matic']

We will use this list later on.

## 2. Analyzing Individual Feature Patterns Using Visualization


For visualization, we will use 'Matplotlib', 'Seaborn', 'mplfinance', but you may use other tools. 

<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
# list the data types for each column
print(df.dtypes)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1:</b>

  <b>What is the data type of the column 'open'?</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df['open'].dtypes
```

</details>


For example, we can calculate the correlation between variables of type "int64" or "float64" using the method "corr"(for more information refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html , also note that we explain correlation later in this notebook, so do not worry if it is not clear at the moment):


In [ ]:
corr = df.corr()
corr

Now to improve representation of correlation table we can use <code>heatmap</code> from 'Seaborn'.

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


In [ ]:
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, ax=ax)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2:</b>

<p>Find the correlation between the following columns: open, volume, rec_count, and avg_price.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[['open', 'volume', 'rec_count', 'avg_price']]</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['open', 'volume', 'rec_count', 'avg_price']].corr()
```

</details>


Now let's calculate correlation only between our main currency and other cryptocurrencies.

In [ ]:
crypto_corr = df[['avg_price'] + [f'{name}_avg_price' for name in currencies]].corr()
crypto_corr

Now let's build heatmap for above correlation.

In [ ]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(crypto_corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, ax=ax)

### Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range. They can be of type `int64` or `float64`. A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the price, we can use `regplot` which plots the scatterplot plus the fitted regression line for the data.


Let's see several examples of different linear relationships:


#### Linear Relationships


Let's find the scatterplot of 'avg_price' and 'matic_avg_price'.


In [ ]:
# Engine size as potential predictor variable of price
sns.regplot(x='avg_price', y='eth_avg_price', data=df)
plt.ylim(0,)

<p>As the eth_avg_price goes up, the avg_price goes up as well: this indicates a positive direct correlation between these two variables. eth_avg_price size seems like a pretty good predictor of avg_price since dots fit our line really well.</p>


We can examine the correlation between 'avg_price' and 'eth_avg_price' and see that it's approximately 0.90.


In [ ]:
df[['avg_price', 'eth_avg_price']].corr()

matic_avg_price is a potential predictor of avg_price. Let's find the scatterplot of 'avg_price' and 'matic_avg_price'.


In [ ]:
sns.regplot(x='avg_price', y='matic_avg_price', data=df)

<p>As matic_avg_price goes up, the avg_price goes up: this indicates an positive relationship between these two variables. matic_avg_price could potentially be a predictor of avg_price.</p> 

<p>It is good to understand that sometimes we may have an inverse/negative relationship between these two variables. Example: matic_avg_price going up, the avg_price going down. Such relationship could be a potential predictor as well.</p>



We can examine the correlation between 'avg_price' and 'bnb_avg_price'and see it's approximately 0.80.


In [ ]:
df[['avg_price', 'matic_avg_price']].corr()

Let's see if "ape_avg_price" is a predictor variable of "avg_price".


In [ ]:
# change
sns.regplot(x='ape_avg_price', y='avg_price', data=df)

<p>ape_avg_price does not seem like a good predictor of the avg_price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between 'ape_avg_price' and 'avg_price' and see it's approximately -0.032.


In [ ]:
df[['avg_price', 'ape_avg_price']].corr()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question  3 a):</b>

<p>Find the correlation  between 'xrp_avg_price' and 'bnb_avg_price'.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[['xrp_avg_price', 'bnb_avg_price']].  </p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is 0.0823, the non-diagonal elements of the table.

df[['xrp_avg_price', 'bnb_avg_price']].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question  3 b):</b>

<p>Given the correlation results between 'xrp_avg_price' and 'bnb_avg_price', do you expect a linear relationship?</p> 
<p>Verify your results using the function "regplot()".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python

#There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well. We can see this using "regplot" to demonstrate this.

#Code: 
sns.regplot(x='xrp_avg_price',  y='bnb_avg_price', data=df)

```

</details>


Now let's visualize values of our main cryptocurrency.

In [ ]:
df[['open', 'high', 'low', 'close']].plot()

As you can see it is not convinient to make any assumptions from the above visualization, so let's draw candlesticks using <code>fplt</code>.

We are going to use slice of data from our dataframe. Note: you can customize style of plot, just pick any value for style from <code>['binance', 'blueskies', 'brasil', 'charles', 'checkers', 'classic', 'default', 'ibd', 'kenan', 'mike', 'nightclouds', 'sas', 'starsandstripes', 'yahoo']</code>

In [ ]:
fplt.plot(
            df.iloc[:10, :],
            type='candle',
            style='charles',
            title='BTC',
            ylabel='Price (BUSD)'
        )

Now let's plot candlesticks with volume; expanding our slice from 10 to 20 values.

In [ ]:
fplt.plot(
            df.iloc[:20, :],
            type='candle',
            style='classic',
            title='BTC',
            ylabel='Price (BUSD)',
            volume=True,
            ylabel_lower='Volume',
        )

You can also use plotly or bqplot to display even more data, but we won't cover this here, because such things is a topic of another course.

Great! Now we know how to display a lot of candlesticks in one plot and we can save the plot as image.

#### Categorical Variables

Let's create bins for all currencies inside out `dataFrame`.


In [ ]:
group_names = ['low', 'medium-low', 'medium', 'medium-high', 'high']

Function that converts column of values to column of bins.

In [ ]:
def to_categorical(column: pd.Series, labels: List[str]) -> pd.Series:
    bins = np.linspace(min(column), max(column), len(labels) + 1)
    res = pd.cut(column, bins, labels=labels, include_lowest=True)
    return res

Create 'category' column that is created from 'avg_price' of our main cryptocurrency.

In [ ]:
df['category'] = to_categorical(df['avg_price'], group_names)

Now let's use <code>unique()</code> command to see categories we have created.

In [ ]:
df['category'].unique()

Now let's do the same with other cryptocurrecies

In [ ]:
for name in currencies:
    category_column_name = f'{name}_category'
    df[category_column_name] = to_categorical(df[f'{name}_avg_price'], group_names)

In [ ]:
df.head()

## 3. Descriptive Statistical Analysis


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. We do not have NaN values in our dataframe, but it is worth mentioning that any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type category. We can apply the method "describe" on the variables of type 'category' as follows:


In [ ]:
df.describe(include=['category'])

#### Value Counts


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method 'category' only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['category']</code>, not two brackets <code>df[['category']]</code>.</p>


In [ ]:
df['category'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
category_counts = df['category'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe 'category_counts' and rename the column  'category' to 'category_value_counts'.


In [ ]:
category_counts.rename(columns={'category': 'category_value_counts'}, inplace=True)
category_counts.index.name = 'categories'
category_counts

Now let's add other currencies into category_counts dataframe:


In [ ]:
all_category_counts = category_counts.copy()

for name in currencies:
    category_column_name = f'{name}_category'
    curr_counts = df[category_column_name].value_counts().to_frame()
    curr_counts.rename(columns={category_column_name: f'{category_column_name}_value_counts'}, inplace=True)
    curr_counts.index.name = 'categories'
    # add new column to all_category_counts
    all_category_counts = all_category_counts.merge(curr_counts, on='categories')
    

In [ ]:
all_category_counts

## 4. Basics of Grouping


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "category".</p>


In [ ]:
df['category'].unique()

<p>If we want to know, on average, which type of drive wheel is most valuable, we can group 'category' and then average them.</p>

<p>We can select the columns 'category', and 'avg_price', then assign it to the variable "df_group_one".</p>


In [ ]:
df_group_one = df[['avg_price','category']]

We can then calculate the average price for each of the different categories of data.


In [ ]:
# grouping results
df_group_one = df_group_one.groupby(['category'],as_index=False).mean()
df_group_one

<p>You can also group by multiple variables. For example, let's group by both 'category' and 'ape_category'. This groups the dataframe by the unique combination of 'category' and 'ape_category'. We can store the results in the variable 'grouped_test1'.</p>


In [ ]:
# grouping results
df_gptest = df[['avg_price', 'category','ape_category']]
grouped_test1 = df_gptest.groupby(['category', 'ape_category'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.</p>

<p>In this case, we will leave the category variable as the rows of the table, and pivot avg_price to become the columns of the table:</p>


In [ ]:
grouped_pivot = grouped_test1.pivot(index='category',columns='ape_category')
grouped_pivot

<p>Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.</p>


In [ ]:
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question 4:</b>
    
<p>Use the "groupby" function to find the average price of each car based on "category".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
# grouping results

<details><summary>Click here for the solution</summary>

```python
# grouping results
df_gptest2 = df[['category','avg_price']]
grouped_test_bodystyle = df_gptest2.groupby(['category'],as_index= False).mean()
grouped_test_bodystyle

```

</details>


Let's use a heat map to visualize the relationship between avg_price vs btc_category.


In [ ]:
# Astropy should not be here
# from astropy.visualization import astropy_mpl_style
astropy_mpl_style['axes.grid'] = False
plt.style.use(astropy_mpl_style)

In [ ]:
#use the grouped
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

<p>The heatmap plots the target variable (price) proportional to colour with respect to the variables 'ape_category' and 'avg_price' on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to 'ape_category' and 'avg_price'.</p>

<p>The default labels convey no useful information to us. Let's change that:</p>


In [ ]:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is, "Which other cryptocurrencies have the greatest impact on the average price of our cryptocurrency?".</p>

<p>To get a better measure of the important characteristics, we look at the correlation of our currency with other cryptocurrencies. In other words: how is the avarege price dependent on other average price variables?</p>


## 5. Correlation and Causation


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


### APE average price vs BTC average price


Let's calculate the  Pearson Correlation Coefficient and P-value of 'ape_avg_price' and 'avg_price'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['ape_avg_price'], df['avg_price'])
print(f'The Pearson Correlation Coefficient is {pearson_coef:.5f} with a P-value of P = {p_value}')  

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between ape_avg_price and avg_price is statistically significant, although the linear relationship isn't extremely strong (~0.580).</p>


Similarly to the above let's calculate Pearson Correlation Coefficient and P-value between our main cryptocurrency and other cryptocurrencies

In [ ]:
for name in currencies:
    pearson_coef, p_value = stats.pearsonr(df['avg_price'], df[f'{name}_avg_price'])
    print(f'The Pearson Correlation Coefficient between our main cryptocurrency and {name} is {pearson_coef:.6f} with a P-value of P = {p_value:.5f}') 

From the rules and conclusions represented above, you can easily draw your own conclusions for all values we calculated. 

## 6. ANOVA


### ANOVA: Analysis of Variance

The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

**F-test score**: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

**P-value**:  P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.


### Category


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

In [ ]:
grouped_test2=df_gptest[['avg_price', 'category']].groupby(['category'])
grouped_test2.head(2)

We can obtain the values of the method group using the method "get_group".


In [ ]:
grouped_test2.get_group('high')['avg_price']

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('low')['avg_price'], grouped_test2.get_group('medium-low')['avg_price'], grouped_test2.get_group('medium')['avg_price'], grouped_test2.get_group('medium-high')['avg_price'], grouped_test2.get_group('high')['avg_price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all five tested groups are all this highly correlated?

Let's examine them separately.


#### medium-low and medium


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('medium-low')['avg_price'], grouped_test2.get_group('medium')['avg_price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )

Let's examine the other groups.


#### medium and medium-high


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('medium')['avg_price'], grouped_test2.get_group('medium-high')['avg_price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   

Let's create a for loop and calculate ANOVA for every pair

In [ ]:
# make sure you created group_names in previous steps
names = group_names.copy()
# generating column and row names for our future dataframe
cols = [f"{name}_x" for name in names]
idxs = [f"{name}_y" for name in names]

anova_f_df = pd.DataFrame(columns=cols, index=idxs)
anova_p_df = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.permutations(names, 2):
    f_val, p_val = stats.f_oneway(grouped_test2.get_group(curr1)['avg_price'], grouped_test2.get_group(curr2)['avg_price'])  
    anova_f_df.loc[f"{curr2}_y", f"{curr1}_x"] = f_val
    anova_p_df.loc[f"{curr2}_y", f"{curr1}_x"] = p_val

Now let's display f and p value dataframes

In [ ]:
anova_f_df

In [ ]:
anova_p_df

Result dataframes have NaN values; diagonal elements always going to be NaN(we are not supposed to calculate them). So let's replace them with space character.

In [ ]:
np.fill_diagonal(anova_f_df.values, ' ')
np.fill_diagonal(anova_p_df.values, ' ')

And display one of f value and p value dataframes one more time.

In [ ]:
anova_f_df  

In [ ]:
anova_p_df  

## 7. Durbin-Watson Test

What is Durbin-Watson Test?
In regression analysis, Durbin-Watson (DW) is useful for checking the first-order autocorrelation (serial correlation). It analyzes the residuals for independence over time points (autocorrelation). The autocorrelation varies from -1 (negative autocorrelation) to 1 (positive autocorrelation).

Durbin-Watson test analyzes the following hypotheses,

Null hypothesis (H0): Residuals from the regression are not autocorrelated (autocorrelation coefficient, ρ = 0) Alternative hypothesis (Ha): Residuals from the regression are autocorrelated (autocorrelation coefficient, ρ > 0)

We will use durbin_watson for Durbin-Watson Test and OLS to get residuals from "statsmodels" library

Let's implement a function that creates regression models

In [ ]:
def get_reg(x: pd.DataFrame, y: pd.Series):
    # to get intercept
    X = sm.add_constant(x)
    # fit the regression model
    reg = sm.OLS(y, X).fit()
    return reg

In [ ]:
warnings.filterwarnings("ignore", category=FutureWarning)
X = df[["ape_avg_price"]] # independent
y = df["avg_price"] # dependent
reg = get_reg(X, y)

In [ ]:
durbin_watson(resids=np.array(reg.resid))

Durbin-Watson test statistic is 0.0014, which is very close to 0. This suggests the presence of strong positive autocorrelation in the residuals of our regression model.

Now that we now how to calculate Durbin-Watson we can evaluate Durbin-Watson between main and other currencies.

In [ ]:
# define names that corresponds to names of all cryptocurrecies differnt from main.
names = [name for name in currencies]

In [ ]:
for name in names:
    X = df[[f"avg_price"]] # independent
    y = df[f"{name}_avg_price"] # dependent
    reg = get_reg(X, y)
    dw = durbin_watson(resids=np.array(reg.resid))
    print(f'Durbin-Watson between main currency and {name}: {dw:.5f}')

Let's calculate Durbin-Watson for all other cryptocurrencies

In [ ]:
# generating column names for our new dataframe
cols = [f"{name}_dep" for name in names]
# generating row names for our new dataframe
idxs = [f"{name}_ind" for name in names]

# empty dataframe creation
dw_df = pd.DataFrame(columns=cols, index=idxs)

# we use itertools.permutations to generate pairs of different cryptocurrencies
# refer to https://docs.python.org/3/library/itertools.html#itertools.permutations for more details
for (curr1, curr2) in itertools.permutations(names, 2):
    X = df[[f"{curr1}_avg_price"]]
    y = df[f"{curr2}_avg_price"]
    reg = get_reg(X, y)
    dw = durbin_watson(resids=np.array(reg.resid))
    dw_df.loc[f"{curr2}_ind", f"{curr1}_dep"] = dw
dw_df

Replace NaN values with empty values.

In [ ]:
np.fill_diagonal(dw_df.values, ' ')
dw_df

## 8. Granger Causality Test

<b>Granger causality test</b> is a statistical test that helps us to determine if an observed time series y and a given lag of it is co-varying. If the null hypothesis holds true then we are not supposed to find any Granger causality. We reject the null hypothesis if in our sample p-values are smaller than a desired significance level.

Run cell above to reassing plt to 'Matplotlib' pyplot(we used other library previously it may break visualizations now).

In [ ]:
left_ax, right_ax = 'avg_price', 'xrp_avg_price'
df_to_test = df[[left_ax, right_ax]]
# assign variables for plot
t = df.index # ts
data1 = df_to_test[left_ax]
data2 = df_to_test[right_ax]

Now let's draw plot based on variables we created in previous cell. 

In [ ]:
fig, ax1 = plt.subplots(figsize = (16,9))

color = 'tab:red'
ax1.set_xlabel('Time', fontsize=14)
ax1.set_ylabel('Main cryptocurrency(ADA)', color=color, fontsize=14)
ax1.plot(t, data1, color=color)
ax1.tick_params(axis='x', labelcolor=color)

ax2 = ax1.twinx()

color = 'tab:blue'
ax2.set_ylabel('XRP', color=color, fontsize=14)
ax2.plot(t, data2, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.show()

In [ ]:
def grangers_causation_matrix(data, maxlag, variables, test='ssr_chi2test', verbose=False):    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    """
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df

In [ ]:
grangers_causation_matrix(df_to_test, 1, variables=df_to_test.columns)

How to interpret the p-values?

In our case, the value of 0.2117 in the first row and second column suggests that the past values of xrp_avg_price can help predict the future values of avg_price, after accounting for the past values of avg_price and xrp_avg_price. On the other hand, the value of 0.0494 in the second row and first column suggests that the past values of avg_price have a weaker influence on predicting the future values of xrp_avg_price, after accounting for the past values of both variables.

Let's calculate Granger Causality Test between our cryptocurrency and others.

In [ ]:
names = currencies.copy()

In [ ]:
for name in names:
    tmp_df = df[[f"avg_price", f"{name}_avg_price"]]
    curr_res_df = grangers_causation_matrix(tmp_df, 1, variables=tmp_df.columns)
    print(curr_res_df)

We can see a many small dataframes and draw some conclusions. Let's take first matrix into consideration:

The value of __0.2902 in the upper right corner__ indicates that there is a __strong positive correlation__ between avg_price and ape_avg_price. This means that __changes in the average price of our main cryptocurrency tend to coincide with changes in the average price of another cryptocurrency APE__.
The value of __0.1238 in the lower left corner__ indicates that there is a __weaker positive correlation__ between ape_avg_price_y and avg_price_y. This means that __changes in the average price of cryptocurrency APE tend to coincide with changes in the average price of cryptocurrency BTC__, but the relationship is not as strong as the other way around.

*Overall, __this suggests that changes in the average price of our main cryptocurrency(BTC) are more likely to cause changes in the average price of cryptocurrency APE than the other way around__. However, the relationship is not completely one-sided, as changes in the average price of APE can also have some impact on the average price of BTC.*

***In conclusion, DOGE and APE are more likely to affect or main currency(BTC).***

Let's calculate Granger Causality Test for all available pairs.

In [ ]:
cols = [f"{name}_x" for name in names]
idxs = [f"{name}_y" for name in names]

gc_df = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.permutations(names, 2):
    df_to_test_2 = df[[f"{curr1}_avg_price", f"{curr2}_avg_price"]]
    res_df = grangers_causation_matrix(df_to_test_2, 1, variables=df_to_test_2.columns)
    p1 = res_df[f"{curr1}_avg_price_x"][f"{curr2}_avg_price_y"]
    p2 = res_df[f"{curr2}_avg_price_x"][f"{curr1}_avg_price_y"]
    gc_df.loc[f"{curr1}_y", f"{curr2}_x"] = p1
    gc_df.loc[f"{curr2}_y", f"{curr1}_x"] = p2

# replace diagonal values with space char
np.fill_diagonal(gc_df.values, ' ')
    
gc_df

Now, you can draw your own conclusions based on the matrix above and the ideas we showed you previously.

It's worth noting that the __Granger Causality test does not prove causation in the traditional sense, but rather identifies potential causal relationships between variables based on statistical patterns__. Therefore, __it's important to interpret the results with caution and consider other factors that may be influencing the relationship between the variables__.

Finally, let's save out dataframe, so we can use it later(we will include categorical data to our current one).

In [ ]:
df.to_csv('BTCBUSD_1min_categories.csv', index=True)

### Conclusion

Now we have a better idea of our data and understand what cryptocurrency affects our main cryptocurrency(BTCBUSD) more.

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

# **Thank you for completing Lab 3!**

## Authors

<a href="https://author.skills.network/instructors/nazar_kohut">Nazar Kohut</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By   | Change Description                                         |
| ----------------- | ------- | -------------| ---------------------------------------------------------- |
|     2023-03-11    |   1.0   | Nazar Kohut  | Lab created                                                |

<hr>

<h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>