<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo"  />
</center>




# Data Analysis with Python on the example BTC/USD and technical financial indicators ATR, OBV, ADV, RSI, AD

# Lab 3. Exploratory Data Analysis

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics to predict price of BTC


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import Data from Module</li>
    <li>Analyzing Individual Feature Patterns using Visualization</li>
    <li>Descriptive Statistical Analysis</li>
    <li>Basics of Grouping</li>
    <li>Correlation and Causation</li>
    <li>ANOVA</li>
    <li>Durbin-Watson Test</li>
    <li>Granger Causality Test</li>
</ol>

</div>

<hr>


### What are the main characteristics that have the most impact on the BTC price?


## 1. Import Data from Module 2


#### Setup


Import libraries:


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#!  mamba install seaborn=0.9.0-y
! conda install -c conda-forge mplfinance -y

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import matplotlib.dates as mpl_dates
import mplfinance as mpf
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson as dwtest
import itertools

import warnings
warnings.filterwarnings("ignore")

Load the data and store it in dataframe `df`:


This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
#set path host where will be all our datasets 
file='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0KGFEN/BTCBUSD.csv'

Set headers for our main crypto:
    

In [ ]:
headers = ["Time","Open","High","Low","Close","Volume","Rec_count","Avg_price","ATR","OBV","Gain","Loss","Ema_gain","Ema_loss",
          "RS","RSI_14","ADTV","MFI","MFV","ADL"]

In [ ]:
# read our csv
df = pd.read_csv(file, low_memory=False)
df.columns = headers
df.dropna(axis=0, inplace=True)

# reset index, because we droped some rows
df.reset_index(drop=True, inplace=True)

df.head()

## 2. Analyzing Individual Feature Patterns Using Visualization


#### How to choose the right visualization method?
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
#set correct types
df[["Time"]] = df[["Time"]].astype("datetime64")

In [ ]:
# list the data types for each column
print(df.dtypes)

Use <code>packages mpl_finance and
 matplotlib</code> for our Candlestick Charts:

In [ ]:

# Extracting Data for plotting
ohlc = df[["Time", "Open", "High", "Low", "Close", "Volume"]].copy()

ohlc["Time"] = pd.to_datetime(ohlc["Time"])
ohlc.index = ohlc["Time"]

# Resampling to 1 day
ohlc = ohlc.resample("1d").agg({
    "Open": "first",
    "High": "max",
    "Low": "min",
    "Close": "last",
    "Volume": "sum"
})

ohlc["Time"] = ohlc.index
ohlc["Time"] = ohlc["Time"].apply(mpl_dates.date2num)

mpf.plot(ohlc, type="candle", 
         volume=True, 
         style="yahoo", 
         ylabel="Price", 
         xlabel="Date", 
         title="Daily Candlestick Chart of BTCBUSD")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #1:</b></p>

<b>What is the data type of the column "Open"? </b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df['Open'].dtypes
```

</details>


For example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":


In [ ]:
corr = df.corr()
corr

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


Now we have correlation, so we can built a <b>heatmap</b>:

In [ ]:
sns.heatmap(corr)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #2:</b></p>

<p>Find the correlation between the following columns:Avg_price, Volume, ATR, OBV, ADV, RSI, AD.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[["Avg_price","Volume", "ATR", "OBV", "ADTV", "RSI_14", "ADL"]]</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
corr = df[["Avg_price","Volume", "ATR", "OBV", "ADTV", "RSI_14", "ADL"]].corr()
sns.heatmap(corr)
```

</details>


## Continuous Numerical Variables: 

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between a Volume and other indicators, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.</p>


Let's see several examples of different linear relationships:


### Positive Linear Relationship


Let's find the scatterplot of "Avg_price" and "OBV".


In [ ]:
# OBV as potential predictor variable of price
sns.regplot(x="Avg_price", y="OBV", data=df,scatter_kws={'s':2})
plt.ylim(0,)

<p>As the OBV goes up, the price goes up: this indicates a positive direct correlation between these two variables. OBV seems like a good predictor of price since the regression line is almost a perfect diagonal line.</p>


We can examine the correlation between 'OBV' and 'Avg_price' and see that it's approximately 0.57760.


In [ ]:
df[["Avg_price", "OBV"]].corr()

Avg_price is a potential predictor variable of ADL. Let's find the scatterplot of "Avg_price" and "ADL".


In [ ]:
sns.regplot(x="Avg_price", y="ADL", data=df,scatter_kws={'s':2})

<p>As price goes up, the ADL goes down: this indicates a negative relationship between these two variables. Avg_price could potentially be a predictor of ADL.</p>


We can examine the correlation between 'Avg_price' and 'ADL' and see it's approximately -0.46447.


In [ ]:
df[['Avg_price', 'ADL']].corr()

### Weak Linear Relationship


Let's see if "Avg_price" is a predictor variable of "Volume".


In [ ]:
sns.regplot(x="Avg_price", y="Volume", data=df,scatter_kws={'s':2})

<p>Avg_price does not seem like a good predictor of the Volume at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between 'Avg_price' and 'Volume' and see it's approximately -0.03253.


In [ ]:
df[['Avg_price','Volume']].corr()

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #3a):</b></p>

<p>Find the correlation  between x="Avg_price" and y="ATR".</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[["Avg_price","ATR"]].  </p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is 0.491169.

df[["Avg_price","ATR"]].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #3b:</b></p>

<p>Given the correlation results between "Avg_price" and "ATR", do you expect a linear relationship?</p> 
<p>Verify your results using the function "regplot()".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python

#There is a positive correlation between the variable 'Avg_price' and 'ATR.'. We can see this using "regplot" to demonstrate this.

#Code: 
sns.regplot(x="Avg_price", y="ATR", data=df,scatter_kws={'s':2})

```

</details>


### Make Categorical Variables for next point
<p>Let`s make categorical values as in the past lab.</p>

<p>We would like 5 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of Avg_price, we want to set start_value = min(df["Avg_price"]).</p>
<p>Since we want to include the maximum value of Avg_price, we want to set end_value = max(df["Avg_price"]).</p>
<p>Since we are building 5 bins of equal length, there should be 6 dividers, so numbers_generated = 6.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = np.linspace(min(df["Avg_price"]), max(df["Avg_price"]), 6)
bins

We set group  names:


In [ ]:
group_names = ['Low','Lower Medium' ,'Medium', 'Upper Medium' ,'High']

We apply the function "cut" to determine what each value of `df['Avg_price']` belongs to.


In [ ]:
df['Avg_price-binned'] = pd.cut(df['Avg_price'], bins, labels=group_names, include_lowest=True )
df[['Avg_price','Avg_price-binned']].head()

Same for Volume:

In [ ]:
bins = np.linspace(min(df["Volume"]), max(df["Volume"]), 6)
bins

In [ ]:
df['Volume-binned'] = pd.cut(df['Volume'], bins, labels=group_names, include_lowest=True )
df[['Volume','Volume-binned']].head()

Same for Rec_count:
    

In [ ]:
bins = np.linspace(min(df["Rec_count"]), max(df["Rec_count"]), 6)
bins

In [ ]:
df['Rec_count-binned'] = pd.cut(df['Rec_count'], bins, labels=group_names, include_lowest=True )
df[['Rec_count','Rec_count-binned']].head()

### Categorical Variables

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>


Let's look at the relationship between "Avg_price" and "Avg_price-binned".


In [ ]:
sns.boxplot(x="Avg_price-binned", y="Avg_price", data=df)

<p>Here we see that the distribution of price between these five categories are distinct enough to take in which catagery price will be as a potential good predictor of price. Let's examine "Volume" and "Volume-bined":</p>


In [ ]:
sns.boxplot(x="Volume-binned", y="Volume", data=df)

<p>We can see the same.</p>


Let's examine "Avg_price-binned" and "Volume".


In [ ]:
# drive-wheels
sns.boxplot(x="Avg_price-binned", y="Volume", data=df)

<p><p>We see that the distributions of price between the different categories have a significant overlap, so Avg_price-binned would not be a good predictor of Volume. </p>
</p>


## 3. Descriptive Statistical Analysis


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type category. We can apply the method "describe" on the variables of type 'category' as follows:


In [ ]:
df.describe(include='category')

### Value Counts


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "Avg_price-binned". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['Avg_price-binned']</code>, not two brackets <code>df[['Avg_price-binned']]</code>.</p>


In [ ]:
df['Avg_price-binned'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df['Avg_price-binned'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "Avg_price_binned_counts" and rename the column  'Avg_price-binned' to 'value_counts'.


In [ ]:
BTCBUSD_avg_price_binned_counts = df['Avg_price-binned'].value_counts().to_frame()
BTCBUSD_avg_price_binned_counts.rename(columns={'Avg_price-binned': 'value_counts'}, inplace=True)
BTCBUSD_avg_price_binned_counts

Now let's rename the index to 'Avg_price-binned':


In [ ]:
BTCBUSD_avg_price_binned_counts.index.name = 'Avg_price-binned'
BTCBUSD_avg_price_binned_counts

<p>After examining the value counts of the Avg_price-binned, we see that this category would be a good predictor variable for the price. This is because we only have mostly our prices in 'Medium'.</p>


## 4. Basics of Grouping


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "Avg_price-binned". We see that there are 5 different categories of Avg_price-binned.</p>


In [ ]:
df['Avg_price-binned'].unique()

<p>If we want to know, on average, which type of Avg_price-binned is most valuable, we can group "Avg_price-binned" and then average them.</p>

<p>We can select the columns 'Avg_price-binned' and 'Avg_price', then assign it to the variable "df_group_one".</p>


In [ ]:
df_group_one = df[['Avg_price-binned','Avg_price']]

We can then calculate the average price for each of the different categories of data.


In [ ]:
# grouping results
df_group_one = df_group_one.groupby(['Avg_price-binned'],as_index=False).mean()
df_group_one

<p>From our data, it seems High category are the most expensive.</p>

<p>You can also group by multiple variables. For example, let's group by both 'Avg_price-binned' and 'Volume-binned'. This groups the dataframe by the unique combination of 'Avg_price-binned' and 'Volume-binned'. We can store the results in the variable 'grouped_test1'.</p>


In [ ]:
# grouping results
df_gptest = df[['Avg_price-binned','Volume-binned','Avg_price']]
grouped_test1 = df_gptest.groupby(['Avg_price-binned','Volume-binned'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.</p>

<p>In this case, we will leave the Avg_price-binned variable as the rows of the table, and Volume-binned to become the columns of the table:</p>


In [ ]:
grouped_pivot = grouped_test1.pivot(index='Volume-binned',columns='Avg_price-binned')
grouped_pivot

<p>Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.</p>


In [ ]:
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

Also we can use a crossed table to see how many values correspond to each other in the table:

In [ ]:
crossed_table = pd.crosstab(df['Volume-binned'],df['Avg_price-binned'])
crossed_table

As we can see, mostly they correspond when bin for Volume is 'Low' and bin for BTC is 'Medium'.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4:</b></p>

<p>Use the "groupby" function to find the average "Volume" of each trade based on "Volume-binned".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# grouping results
df_gptest2 = df[['Volume','Volume-binned']]
grouped_test = df_gptest2.groupby(['Volume-binned'],as_index= False).mean()
grouped_test

```

</details>


#### Variables: Avg_price-binned vs. Volume-binned


Let's use a heat map to visualize the relationship between Avg_price-binned vs Volume-binned.


In [ ]:
#use the grouped results
plt.grid(False)
plt.pcolor(crossed_table, cmap='RdBu')
plt.colorbar()
plt.show()

<p>The heatmap plots the target variable (price) proportional to colour with respect to the variables 'Volume-binned' and 'Avg_price-binned' on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to 'Avg_price-binned' and 'Volume-binned'.</p>

<p>The default labels convey no useful information to us. Let's change that:</p>


In [ ]:
fig, ax = plt.subplots()
plt.grid(False)
im = ax.pcolor(crossed_table, cmap='RdBu')

#label names
row_labels = crossed_table.columns
col_labels = crossed_table.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)
fig.colorbar(im)
plt.show()

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is, "What are the main characteristics that have the most impact on the BTC price?".</p>

<p>To get a better measure of the important characteristics, we look at the correlation of these variables with the BTC avg price. In other words: how is the BTC avg price dependent on this variable?</p>


## 5. Correlation and Causation


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


### Avg_price vs. Volume


Let's calculate the  Pearson Correlation Coefficient and P-value of 'Avg_price' and 'Volume'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['Volume'])
print("The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P =",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and Volume is statistically significant, although the linear relationship is quite strong (~  -0.033).</p>


### Avg_price vs. OBV


Let's calculate the  Pearson Correlation Coefficient and P-value of 'Avg_price' and 'OBV'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['OBV'])
print("The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:

<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and OBV is statistically significant, and the linear relationship is quite strong (~0.578).</p>


### Avg_price vs. ADTV

Let's calculate the  Pearson Correlation Coefficient and P-value of 'Avg_price' and 'ADTV'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['ADTV'])
print("The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and ADTV is statistically significant, and the linear relationship is moderately strong (~ -0.134).</p>


### Avg_price vs. Gain


Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg_price' and 'Gain':


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['Gain'])
print("The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P =",'%.3f' % p_value ) 

#### Conclusion:

Since the p-value is < 0.001, the correlation between Avg_price and Gain is statistically significant, and the linear relationship is moderately strong (\~  -0.064).


### Avg_price vs. Loss


Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg_price' and 'Loss':


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['Loss'])
print( "The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and Loss is statistically significant, and the linear relationship is quite strong (~ -0.071).</p>


### Avg_price vs. ATR


Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg_price' and 'ATR':


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['ATR'])
print( "The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and ATR is statistically significant, and the linear relationship is quite strong (~ -0.173).</p>


### Avg_price vs. RSI_14

Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg_price' and 'RSI_14':


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['RSI_14'])
print( "The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and RSI_14 is statistically significant, and the linear relationship is quite strong (~ 0.034).</p>


### Avg_price vs. ADL

Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg_price' and 'ADL':


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg_price'], df['ADL'])
print( "The Pearson Correlation Coefficient is",'%.3f' % pearson_coef, " with a P-value of P = ",'%.3f' % p_value)  

#### Conclusion:
<p>Since the p-value is $<$ 0.001, the correlation between Avg_price and RSI_14 is statistically significant, and the linear relationship is quite strong (~ -0.464).</p>


## 6. ANOVA


### ANOVA: Analysis of Variance
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


### Avg_price-binned

<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'drive-wheels' impact  'price', we group the data.</p>


In [ ]:
grouped_test2=df_gptest[[ 'Avg_price-binned','Avg_price']].groupby(['Avg_price-binned'])
grouped_test2.head(2)

We can obtain the values of the method group using the method "get_group".


In [ ]:
grouped_test2.get_group('Low')['Avg_price']

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Avg_price'], 
                              grouped_test2.get_group('Lower Medium')['Avg_price'],
                              grouped_test2.get_group('Medium')['Avg_price'],
                              grouped_test2.get_group('Upper Medium')['Avg_price'],
                              grouped_test2.get_group('High')['Avg_price'])  
 
print( "ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all five tested groups are all this highly correlated?

Let's examine them separately.


#### Low and Lower Medium

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Avg_price'], 
                              grouped_test2.get_group('Lower Medium')['Avg_price'])  
 
print( "ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val )

Let's examine the other groups.


#### Low and Medium


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Avg_price'], 
                              grouped_test2.get_group('Medium')['Avg_price'])  
   
print( "ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Low and Upper Medium


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Avg_price'], 
                              grouped_test2.get_group('Upper Medium')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Low and High

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Avg_price'], 
                              grouped_test2.get_group('High')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Lower Medium and Medium

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Lower Medium')['Avg_price'], 
                              grouped_test2.get_group('Medium')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Lower Medium and Upper Medium

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Lower Medium')['Avg_price'], 
                              grouped_test2.get_group('Upper Medium')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Lower Medium and High

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Lower Medium')['Avg_price'], 
                              grouped_test2.get_group('High')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Medium and Upper Medium

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Medium')['Avg_price'], 
                              grouped_test2.get_group('Upper Medium')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Medium and High

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Medium')['Avg_price'], 
                              grouped_test2.get_group('High')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

#### Upper Medium and High

In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Upper Medium')['Avg_price'], 
                              grouped_test2.get_group('High')['Avg_price'])  
 
print("ANOVA results: F=",'%.3f' % f_val, ", P =",'%.3f' % p_val)   

## 7. Durbin-Watson Test

#### What is Durbin-Watson Test?


In regression analysis, Durbin-Watson (DW) is useful for checking the first-order autocorrelation (serial correlation). It analyzes the residuals for independence over time points (autocorrelation). The autocorrelation varies from -1 (negative autocorrelation) to 1 (positive autocorrelation).

<p><p>Durbin-Watson test analyzes the following hypotheses, </p></p><p><p>Null hypothesis (H<sub>0</sub>): Residuals from the regression are not autocorrelated (autocorrelation coefficient, ρ = 0)\nAlternative hypothesis (H<sub>a</sub>): Residuals from the regression are autocorrelated (autocorrelation coefficient, ρ > 0)</p></p>
We will use <b>durbin_watson</b> for Durbin-Watson Test and <b>OLS</b> to get residuals from "statsmodels" library

In [ ]:
X = df["Avg_price"] # independent
y = df["Volume"] # dependent
# to get intercept
X = sm.add_constant(X)
# fit the regression model
reg = sm.OLS(y, X).fit()

In [ ]:
print('%.5f' % dwtest(resids=np.array(reg.resid)))

Let's calculate Durbin-Watson for all available pairs

In [ ]:

cols = ["Avg_price_dep","ATR_dep","ADTV_dep","OBV_dep","ADL_dep",
         "Volume_dep","RSI_14_dep"]
idxs = ["Avg_price_indep","ATR_indep","ADTV_indep","OBV_indep","ADL_indep",
         "Volume_indep","RSI_14_indep"]

files = ["Avg_price","ATR","ADTV","OBV","ADL",
         "Volume","RSI_14"]


durbin = pd.DataFrame(columns=cols, index=idxs)

for (file1, file2) in itertools.permutations(files, 2):
    curr1 = file1
    curr2 = file2
    X = df[curr1] # independent
    y = df[curr2] # dependent
    # to get intercept
    X = sm.add_constant(X)
    # fit the regression model
    reg = sm.OLS(y, X).fit()
    dw = dwtest(resids=np.array(reg.resid))
    durbin.loc[f"{curr1}_indep", f"{curr2}_dep"] = dw
    
np.fill_diagonal(durbin.values, "—")
pd.options.display.float_format = '{:.5f}'.format 
durbin

Now we can see a dataframe that consists of all the p-values. <p>How to interpret these values?</p> We take any value, its column will be responsible for the dependent value, and the row - for the independent value

## 8. Granger Causality Test

<b>What is Granger causality test?</b>
<p>The Granger causality test is a statistical hypothesis test for determining whether one time series is a factor and offer useful information in forecasting another time series.</p>

<p>For example, given a question: Could we use today’s Apple’s stock price to predict tomorrow’s Tesla’s stock price? If this is true, our statement will be Apple’s stock price Granger causes Tesla’s stock price. If this is not true, we say Apple’s stock price does not Granger cause Tesla’s stock price.</p>

In [ ]:
from statsmodels.tsa.stattools import grangercausalitytests

def GCT(coin_name):
    test_df = df[["Avg_price", "Volume"]]
    x = df.index
    y1 = test_df["Avg_price"]
    y2 = test_df["Volume"]
    
    fig, ax1 = plt.subplots(1, 1, figsize = (16,9), dpi = 80)
    ax1.plot(x, y1, color = 'tab:red')
    
    ax2 = ax1.twinx()
    ax2.plot(x, y2, color = 'tab:blue')
    
    # decor for ax1
    ax1.set_xlabel("Time", fontsize = 20)
    ax1.tick_params(axis = 'x', rotation = 0, labelsize = 12)
    ax1.set_ylabel("BTCBUST_avg_price", color = 'tab:red', fontsize = 20)
    ax1.tick_params(axis = 'y', rotation = 0, labelcolor = 'tab:red')
    ax1.grid(alpha=.4)
    
    #decor for ax2
    ax2.set_ylabel("Volume", color = 'tab:blue', fontsize = 20)
    ax2.tick_params(axis = 'y', rotation = 0, labelcolor = 'tab:blue')
    ax2.set_xticks(np.arange(0, len(x), 60))
    ax2.set_xticklabels(x[::60], rotation = 90, fontdict = {'fontsize':10})
    ax2.set_title("Visualizing Leading Indicator Phenomenon", fontsize = 22)
    
    fig.tight_layout()
    plt.show()

Call our function:

In [ ]:
GCT("BTC")

Now let's define custom function which will do Granger Causality Test and return result as <b>pd.DataFrame</b>:

In [ ]:
def grangers_causation_matrix(data, maxlag, variables, test='ssr_chi2test', verbose=False):    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    """
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df

In [ ]:
pd.options.display.float_format = '{:.2f}'.format
grangers_causation_matrix(df[["Avg_price", "Volume"]], 1, variables=["Avg_price", "Volume"])

How to interpret the p-values?

Assuming a significance level of 0.05, if the p-value is lesser than 0.05, then we do NOT reject the null hypothesis that X does NOT granger cause Y.

So, in the above table, the p-value for Avg_price_x and Volume_y is 0.00. So we reject the null hypothesis and conclude that (Avg_price) granger causes (Volume).

That means, Avg_price will likely be helpful in predicting the Volume.

But the p-value for Avg_price_x and Volume is 0.42.

Since the p-value isn't less than 0.05, we can't reject the null hypothesis. That is, "Volume_x" is predictive of "Avg_price_y".

Let's calculate Granger Causality Test for all available pairs

In [ ]:
grangers_causation_matrix(df[["Avg_price", "OBV"]], 1, variables=["Avg_price", "OBV"])

In [ ]:
grangers_causation_matrix(df[["Avg_price", "ATR"]], 1, variables=["Avg_price", "ATR"])

In [ ]:
grangers_causation_matrix(df[["Avg_price", "ADL"]], 1, variables=["Avg_price", "ADL"])

In [ ]:
grangers_causation_matrix(df[["Avg_price", "RSI_14"]], 1, variables=["Avg_price", "RSI_14"])

In [ ]:
grangers_causation_matrix(df[["Avg_price", "ADTV"]], 1, variables=["Avg_price", "ADTV"])

In [ ]:
grangers_causation_matrix(df[["Volume", "OBV"]], 1, variables=["Volume", "OBV"])

In [ ]:
grangers_causation_matrix(df[["Volume", "ADTV"]], 1, variables=["Volume", "ADTV"])

### Conclusion:


<p>We now have a better idea of what our data looks like and which indicators are more related to our price - the BTC's price:</p>

<ul>
    <li>RSI_14</li>
    <li>ADL</li>

</ul>
<p> On Volume:</p>
<ul>
    <li>OBV</li>
</ul>
<p> Also we can say after <b>Granger Causality Test</b> that the BTC's price affects on:</p>
<ul>
    <li>Volume</li>
    <li>ATR</li>
    <li>ADL</li>

</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


Save our new dataset to a file:

In [ ]:
df.to_csv("BTC.csv", index=False)

### Thank you for completing this lab!

## Authors

<a href="https://www.linkedin.com/in/bohdan-tsisinskyi-539913255/ " target="_blank" >Bohdan Tsisinskyi</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk">Prof. Mariya Fleychuk, DrSc, PhD</a>.

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
| 2023-03-11        | 1.0     | Bohdan Tsisinskyi   | Lab created                   |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
