<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="400" alt="cognitiveclass.ai logo">
</center>

# **Investigation relationships between exchange rate BTC/BUSD and ADOSC, NATR, TRANGE indicators**

## Lab 3. Data Analysis with Python

Estimated time needed: **30** minutes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
Для Марії
### The tasks:
*   

</div>

### Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics to predict price of cryptocurrency
*   Visualizes cryptocurrency dynamics using Candlestick Chart
*   Estimate high or low relationships level between cryptocurrency charecteristics and indicators
*   Perform financial statistic tests

<h3>Table of Contents</h3>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import and Load Data</li>
    <li>Analyzing Individual Feature Patterns using Visualization</li>
    <li>Descriptive Statistical Analysis</li>
    <li>Basics of Grouping</li>
    <li>Correlation and Causation</li>
    <li>ANOVA</li>
    <li>Durbin Watson Test</li>
    <li>Granger Causality Test</li>
</ol>

</div>
<hr>


## Dataset Description

### Context
Dataset contains historical changes of the ***BTC/BUSD*** and ***ADOSC, NATR, TRANGE indicators*** for the period from *11/11/2022 to 11/24/2022* with an *1-minute* aggregation time.

### Columns

#### Input columns
* ***Time*** - the timestamp of the record
* ***Open*** -  the price of the asset at the beginning of the trading period
* ***High*** -  the highest price of the asset during the trading period
* ***Low*** - the lowest price of the asset during the trading period.
* ***Close*** - the price of the asset at the end of the trading period
* ***Volume*** - the total number of shares or contracts of a particular asset that are traded during a given period
* ***Count*** -  the number of individual trades or transactions that have been executed during a given time period
* ***ADOSC*** - Chaikin oscillator indicator
* ***NATR*** - normalized average true range (ATR) indicator
* ***TRANGE*** - true range indicator

#### Target column
* ***Price*** - the average price at which a particular asset has been bought or sold during a given period


----


## 1. Import and Load Data


### Setup


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#!  mamba install seaborn=0.9.0-y
! conda install -c conda-forge mplfinance -y
! conda install -c conda-forge astropy -y

Import libraries:


In [ ]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from typing import List
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson
from statsmodels.tsa.stattools import grangercausalitytests
import matplotlib.pyplot as plt
import seaborn as sns
import mplfinance as fplt
%matplotlib inline 
from astropy.visualization import astropy_mpl_style
import itertools
from itertools import combinations
from IPython.display import display

### Load Data


First, we assign the URL of the dataset to <code>"path"</code<>.


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX043BEN/BTCBUSD_trades_1m.csv"

Use the Pandas method <code>read_csv()</code> to load the data from the web address. Set the parameter  <code>index_col=0</code> in order to use the first column of cvs file as the index of the dataframe.


In [ ]:
df = pd.read_csv(path, index_col=0)

Set dataframe index column type to <strong>datetime</strong> using <code>pd.to_datetime()</code> method for correct time series analysis. 


In [ ]:
df.index = pd.to_datetime(df.index)

In the previous lab we calculated technical financial indicators. Since the values of previous periods had to be taken into account for their calculation, the first few lines of the dataframe contain `NaN` values.

We will use different methods for recovering missing data in this module that do not work correctly with recovering data in the first rows of time series. Therefore, we need to remove `NaN` values with `df.dropna(inplace=True)` method.


In [ ]:
df.dropna(inplace=True)
df.head()

Let's check the dimensionality of our dataframe.


In [ ]:
df.shape

## 2. Analyzing Individual Feature Patterns Using Visualization


#### How to choose the right visualization method?
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
# list the data types for each column
print(df.dtypes)

As you can see, all columns have correct types regarding their meanings. 


### Candlestick visualization


<p>A type of financial chart called a <strong>candlestick chart</strong> is used to show how the price of a security, derivative, or currency has changed over time.
</p>

Each candlestick contains all four crucial pieces of information for that day: open and close in the thick body; high and low in the "candle wick." This makes it similar to a bar chart. Due to its high information density, it frequently depicts trade trends over brief periods of time, such as a few days or trading sessions.


#### Candlestick chart for cryptocurrency

A candlestick displays the change in an asset's price over time and it is ideal for our case. Each candlestick, which serves as the fundamental indicator in a crypto chart, depicts a <em>particular price movement</em>, including <em>the opening</em> and <em>closing values</em> as well as <em>the highest</em> and <em>lowest price points</em>.


<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX043BEN/candlestick_chart.png" width="400," height="500">
</center>
More about candlestick chart read <a href="https://en.wikipedia.org/wiki/Candlestick_chart?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX043BEN2378-2023-01-01">here</a>.


As we checked before, our dataframe has more than 16000 rows. It is too huge to visualize the entire time period of the price movement. To deal with, it we perform <em><strong>data resampling</strong></em> from the previous module.


Let's downsample the series from 1-minute into 1-day intervals.


In [ ]:
df_candlestick = df.copy()

df_candlestick = df_candlestick.loc[:, 'Price':'Low'].resample("1D").agg({
    'Open': 'first',
    'High': 'max',
    'Close': 'last',
    'Low': 'min',
    'Volume': 'sum',
    'Count': 'sum',
    'Price': 'mean'
})
df_candlestick.head()

Now let's plot candlesticks for BTCBUSD currency.


In [ ]:
df_candlestick.columns = ['open', 'high', 'low', 'close', 'volume', 'count', 'price']
fplt.plot(
            df_candlestick,
            type='candle',
            style='charles',
            title='BTCBUSD',
            ylabel='Price (BUSD)'
        )

From this candlestick chart, we can follow the historical prices of an BTCBUSD asset, obtaining a good summary of the price's behavior.


<code>mplfinance</code> library also provides a function to display the amount of stocks traded on that day. You can display the volume chart below the candlestick chart by simply passing the <code>volume=True</code> to the <code>plot()</code> method. You can also pass ylabel_lower to change the y-axis label of the volume chart.

#### Why do we need volume in a candlestick chart?

The higher the trading volume, the wider the candlestick. On low-volume days, the candlesticks will be thinner. Volume is also displayed at the bottom of the chart as a series of rectangles. Red volume bars are low-price days, and green bars are high-price days.


In [ ]:
fplt.plot(
            df_candlestick,
            type='candle',
            style='charles',
            title='BTCBUSD',
            ylabel='Price',
            volume=True,
            ylabel_lower='Volume',
            )

### Correlation


#### What is a correlation?

A statistical measure known as <strong>correlation</strong> expresses how closely two variables are related linearly (meaning they change together at a constant rate). It's a typical technique for describing straightforward connections without explicitly stating cause and effect.


We can calculate the correlation between different variables. For instance, for types <code>int64</code> or <code>float64</code> using the method <code>corr()</code>:


Let's calculate the correlation for BTCBUSD currency paramters and its indicators.


In [ ]:
# find correlation
corr = df.corr()
corr

#### How to interpret correlation results?


The magnitude of the correlation coefficient indicates the strength of the association. A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation. If the correlation coefficient is greater than zero, it is a positive relationship. Conversely, if the value is less than zero, it is a negative relationship.


To better understand the concept of correlation we should visualize the strength of relationships between numerical variables using <b>correlation heatmaps</b>. 

#### What is Correlation Heatmap?

<strong>Correlation heatmaps</strong> are a type of plot that visualize the strength of relationships between numerical variables. 

Correlation plots are used to understand which variables are related to each other and the strength of this relationship.


In [ ]:
sns.heatmap(corr)

#### How do you interpret a correlation heatmap?

Correlation ranges from -1 to +1. Values closer to 0 means there is no linear trend between the two variables. The closer the correlation is to 1, the more positively associated they are. The diagonal elements are always one. 

 From the heatmap, it can be seen that the most correlated among the indicators for the price of our cryptocurrency are the Volume and NATR parameters. However, their correlation values are not that high.


We will study correlation more precisely (Pearson correlation in-depth) at the end of the notebook.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: 600;">Question #1:</b>

<p>Find the correlation between the BTCBUSD currency <strong>Price</strong> and its <strong>Volume, ADOSC, NATR, TRANGE indicators</strong> columns. 

Build heatmap to understand the resuluts more precisely.</p>
</div>


<strong><em>Note:</em></strong> if you would like to select certain columns, use the following syntax: <code>df[['Price','Volume', 'ADOSC', 'NATR', 'TRANGE']]</code>.


In [ ]:
# Write your code below and press Shift+Enter to execute

# find correlation
corr = df[['Price', 'Volume', 'ADOSC', 'NATR', 'TRANGE']].corr()
corr

<details><summary>Click here for the solution</summary>

```python
# find correlation
corr = df[['Price', 'Volume', 'ADOSC', 'NATR', 'TRANGE']].corr()
corr
```

</details>


In [ ]:
# Write your code below and press Shift+Enter to execute

# build heatmap
sns.heatmap(corr, annot=True)

<details><summary>Click here for the solution</summary>

```python
# build heatmap
sns.heatmap(corr, annot=True)
```

</details>


The Volume and NATR parameters are the two price indicators for our coin that have the highest correlation, as can be seen from the heatmap. Their correlation values are not very great, though.
It is also noticeable how the parameters NATR and TRANGE are highly correlated with Volume. They can be useful for determining his future performance.


### Continuous Numerical Variables

#### What are Continuous numerical variables?

<p><strong>Continuous numerical variables</strong> are variables that may contain any value within some range. They can be of type <code>int64</code> or <code>float64</code>. A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the <em>linear relationship</em> between an individual variable and the price, we can use <code>regplot</code> which plots the <strong>scatterplot</strong> plus the fitted regression line for the data.</p>

<pre>
<strong>A scatter plot</strong> is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.
</pre>


Let's see several examples of different linear relationships:


#### Weak Linear Relationship


Firstly, we declare custom <code>regplot()</code> function responsible of plotting linear regression.


In [ ]:
def regplot(pd: pd.DataFrame, x: str, y: str) -> None:
    """ Return data plot and a linear regression model fit.
    """
    sns.regplot(x=x, y=y, data=pd)
    plt.ylim(0,)

    plt.xlabel(f"{x}, BUSD")
    plt.ylabel(f"{y}, BUSD")
    plt.show()

Let's find the scatterplot of BTCBUSD price and Volume. 


In [ ]:
# Open Volume as potential predictor variable of BTCBUSD price
regplot(df, 'Price', 'Volume')

<p>Volume does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between BTCBUSD price and Volume and see that it's approximately -0.0915.


In [ ]:
df[["Price", "Volume"]].corr()

#### Positive Linear Relationship


TRANGE may be a potential predictor variable of Volume. Let's find the scatterplot of "TRANGE" and "Volume".


In [ ]:
regplot(df, 'Volume', 'TRANGE')

<p>As the TRANGE goes up, the Volume goes up: this indicates a positive direct correlation between these two variables. TRANGE seems like a predictor of Volume since the regression line is almost aligned to the diagonal line.</p>


We can examine the correlation between them and see it's approximately 0.789.


In [ ]:
df[["Volume", "TRANGE"]].corr()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #2 a):</b>

<p>Find the correlation between <strong>Price</strong> and <strong>ADOSC</strong>.</p>
</div>


<strong><em>Note:</em></strong> if you would like to select those columns, use the following syntax: <code>df[['Price','ADOSC']]</code>.


In [ ]:
# Write your code below and press Shift+Enter to execute

# find the correlation
df[["Price", "ADOSC"]].corr()

<details><summary>Click here for the solution</summary>

```python
# find the correlation
df[["Price", "ADOSC"]].corr()
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: 600;">Question #2 b):</b>

<p>Given the correlation results between <strong>Price</strong> and <strong>ADOSC</strong>, do you expect a linear relationship?</p> 
<p>Verify your results using the function <code>regplot()</code>.</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

regplot(df, 'Price', 'ADOSC')

<details><summary>Click here for the solution</summary>

```python

#There is a weak correlation between the variables. As such regression will not work well. We can see this using "regplot" to demonstrate this.

#Code: 
regplot(df, 'Price', 'ADOSC')

```

</details>


There is a weak correlation between the variables Price and ADOSC. As such regression will not work well.


### Categorical Variables

<p><strong>Categorical variables</strong> are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type <code>object</code> or <code>int64</code>. A good way to visualize categorical variables is by using <strong>boxplots.</strong></p>


Let's create the following categorical values: 


In [ ]:
group_names = ["Low", "Medium", "High"]

Similiar to previous module, declare a function to create categorical variables for a given dataframe column.  


In [ ]:
def to_categorical(column: pd.Series, categories: List[str]) -> pd.Series:
    """ Return categorical variables for a given dataframe column.
    """
    bins = np.linspace(min(column), max(column), len(categories) + 1)
    return pd.cut(column, bins, labels=categories, include_lowest=True)

Let's use declared function on BTCBUSD price.


In [ ]:
df['Price_binned'] = to_categorical(df['Price'], group_names)
df[['Price', 'Price_binned']].head()

In [ ]:
df['Price_binned'].value_counts()

Great! Now we are one step closer to fully understand our dataset.


#### Visualization of categorical variables


#### What is  a boxplot?

<strong><em>A boxplot</em></strong> is a graph that gives you a good indication of how the values in the data are spread out. We use a boxplot below to analyze the relationship between a categorical feature and a continuous feature.


Let's look at the relationship between <strong>"BTCBUSD_Price"</strong> and <strong>"BTCBUSD_Price_binned"</strong>. 


In [ ]:
sns.boxplot(x='Price', y='Price_binned', data=df)

Here we see that the distribution of price between these three categories are distinct enough to take in which catagery price will be.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: 600;">Question #3:</b>

<p>Create the same categories for <strong>Volume</strong> and <strong>indicators ADOSC, NATR, TRANGE.</strong></p>
</div>


<strong><em>Note:</em></strong> use declared <code>to_categorical()</code> function.</p>


In [ ]:
# Write your code below and press Shift+Enter to execute

# create categories for currencies price
indicators = ['Volume', 'ADOSC', 'NATR', 'TRANGE']
for indicator in indicators:
    df[f'{indicator}_binned'] = to_categorical(df[indicator], group_names)

# filter display categorical columns
df.filter(regex='_binned').head()

<details><summary>Click here for the solution</summary>

```python 
# create categories for currencies price
indicators = ['Volume', 'ADOSC', 'NATR', 'TRANGE']
for indicator in indicators:
    df[f'{indicator}_binned'] = to_categorical(df[indicator], group_names)

# filter display categorical columns
df.filter(regex='_binned').head()

```

</details>


## 3. Descriptive Statistical Analysis


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:


In [ ]:
df.describe(include=['category'])

#### Value Counts


<p><strong>Value counts</strong> is a good way of understanding how many units of each characteristic/variable we have. We can apply the <code>value_counts()</code> method on the column <strong>"BTCBUSD_Price_binned"</strong>. Don’t forget the method <code>value_counts()</code> only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['BTCBUSD_Price_binned']</code>, not two brackets <code>df[['BTCBUSD_Price_binned']]</code>.</p>


In [ ]:
df['Price_binned'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df['Price_binned'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe <strong>"BTCBUSD_Price_binned_counts"</strong> and rename the column <strong>"BTCBUSD_Price_binned"</strong> to <strong>"value_counts"</strong>.


In [ ]:
BTCBUSD_Price_binned_counts = df['Price_binned'].value_counts().to_frame()
BTCBUSD_Price_binned_counts.rename(columns={'Price_binned': 'value_counts'}, inplace=True)
BTCBUSD_Price_binned_counts

Now let's rename the index to <strong>"BTCBUSD_Price_binned"</strong>:


In [ ]:
BTCBUSD_Price_binned_counts.index.name = 'category'
BTCBUSD_Price_binned_counts

Now let's add indicators into this dataframe:


In [ ]:
category_count = BTCBUSD_Price_binned_counts.copy()
category_count.rename(columns = {'value_counts':'BTCBUSD_value_counts'}, inplace = True)

for indicator in indicators:
    column_name = f'{indicator}_binned'
    # count number of values
    counts = df[column_name].value_counts().to_frame()
    # add new column to category_count
    category_count[f'{indicator}_value_counts'] = counts

category_count

## 4. Basics of Grouping


<p>The <code>groupby</code> method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable <strong>"BTCBUSD_Price"</strong>. We see that there are 5 different categories of price.</p>


In [ ]:
df['Price_binned'].unique()

<p>If we want to know, on average, which type of <strong>"BTCBUSD_Price"</strong> is most valuable, we can group <strong>"BTCBUSD_Price"</strong> and then average them.</p>

<p>We can select the columns <code>'BTCBUSD_Price_binned'</code>, <code>'BTCBUSD_Price'</code> then assign it to the variable <strong>"df_group_one".</strong></p>


In [ ]:
df_group_one = df[['Price_binned', 'Price']]

We can then calculate the price for each of the different categories of data.


In [ ]:
# grouping results
df_group_one = df_group_one.groupby(['Price_binned'], as_index=True).mean()
df_group_one

<p>From our data, it more than predictable that <strong>'High'</strong> category has the highest price value.</p>

<p>You can also group by multiple variables. For example, let's group by both <code>'BTCBUSD_Price_binned'</code> and <code>'ADOSC_Price_binned'</code>. This groups the dataframe by the unique combination of <code>'BTCBUSD_Price_binned'</code> and <code>'ADOSC_Price_binned'</code>. We can store the results in the variable <code>'grouped_test_1'</code>.</p>


In [ ]:
# grouping results
df_gptest = df[['Price_binned','ADOSC_binned','Price']]
grouped_test1 = df_gptest.groupby(['Price_binned','ADOSC_binned'],as_index=False).mean()
grouped_test1

This grouped data is much easier to visualize when it is made into a <strong>pivot table</strong>. 

#### What is a pivot table?
A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method <code>pivot</code> to create a pivot table from the groups.

<p>In this case, we will leave the <strong>'BTCBUSD_Price_binned'</strong> variable as the rows of the table, and pivot <strong>'ADABUSD_Price_binned'</strong> to become the columns of the table:</p>


In [ ]:
grouped_pivot = grouped_test1.pivot(index='Price_binned',columns='ADOSC_binned')
grouped_pivot

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0 by <code>.fillna(0)</code>, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.


Also we can use a crossed table to see how many values correspond to each other in the table.


In [ ]:
crossed_table = pd.crosstab(df['Price_binned'], df['ADOSC_binned'])
crossed_table

From that we can suggest that ADOSC indicator medium binned values is pretty good distributed for Price binned.  


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: 600;">Question #4:</b>

<p>Use the <code>groupby()</code> function to find <strong>the average TRANGE value</strong> of each category based on the <strong>'TRANGE_binned'</strong>.</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

# grouping results
df_gptest2 = df[['TRANGE_binned','TRANGE']]
grouped_test_bodystyle = df_gptest2.groupby(['TRANGE_binned'],as_index=True).mean()
grouped_test_bodystyle

<details><summary>Click here for the solution</summary>

```python

# grouping results
df_gptest2 = df[['TRANGE_binned','TRANGE']]
grouped_test_bodystyle = df_gptest2.groupby(['TRANGE_binned'],as_index=True).mean()
grouped_test_bodystyle

```

</details>


Well done! Let's move to visualization. 


### Visualization of relationships 


Visualization methods show relationships and connections between the data or show correlations between two or more variables.

Execute the following to adjust astropy.


In [ ]:
astropy_mpl_style['axes.grid'] = False
plt.style.use(astropy_mpl_style)

Let's use a heat map to visualize the relationship between <strong>BTCBUSD_Price_binned</strong> <em>vs.</em> <strong>ADOSC_binned</strong>.


In [ ]:
#use the grouped results
plt.pcolor(crossed_table, cmap='RdBu')
plt.colorbar()
plt.show()

<p>The heatmap plots the target variable <em>(price)</em> proportional to colour with respect to the variables <strong>'Price_binned'</strong> and <strong>'ADOSC_binned'</strong> on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to <strong>'Price_binned'</strong> and <strong>'ADOSC_binned.</strong></p>

<p>The default labels convey no useful information to us. Let's change that:</p>


In [ ]:
fig, ax = plt.subplots()
im = ax.pcolor(crossed_table, cmap='RdBu')

#label names
row_lables = crossed_table.columns.categories
col_labels = crossed_table.index

#move ticks and labels to the center
ax.set_xticks(np.arange(crossed_table.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(crossed_table.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(col_labels, minor=False)
ax.set_yticklabels(row_lables, minor=False)

plt.xlabel("ADOSC")
plt.ylabel("BTCBUSD")

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im, label='Count')
plt.show()

Now we can see more precisely how ADOSC medium binned value are distributed over BTCBUSD price categories. Even though, it cannot be said about other ADOSC bins.


<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is <strong><em>"What are the main characteristics that have the most impact on the price of cryptocurrency?"</em></strong>.</p>

<p>To get a better measure of the important characteristics, we look at the correlation of our currency with indicators and other cryptocurrencies. In other words: <strong><em>how is the price dependent on other variables?</em></strong></p>


## 5. Correlation and Causation


#### What is correlation and causation?

<strong>Correlation</strong>: a measure of the extent of interdependence between variables.

<strong>Causation</strong>: the relationship between cause and effect between two variables.

<em><strong>It is important to know the difference between these two</strong></em>.

<li>Correlation does not imply causation.</li>
<li>Determining correlation is much simpler than the determining causation as causation may require independent experimentation.</li>


#### Pearson Correlation
<p><strong><em>The Pearson Correlation</em></strong> measures the linear dependence between two variables X and Y. In addition, It estimates the relationship strength between the two continuous variables.</p>
<p>The resulting coefficient is a value between <strong>-1</strong> and <strong>1</strong> inclusive, where:</p>
<ul>
    <li><strong>1</strong>: Perfect positive linear correlation.</li>
    <li><strong>0</strong>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><strong>-1</strong>: Perfect negative linear correlation.</li>
</ul>

The formula for Perason Correaltion between X and Y is represented as:

$$
r = \frac{n \sum \limits _{i=1} ^{n} xy - \sum \limits _{i=1} ^{n} x \sum \limits _{i=1} ^{n} y}{\sqrt{[n \sum \limits _{i=1} ^{n} x^2 - (\sum \limits _{i=1} ^{n} x)^2][n \sum \limits _{i=1} ^{n} y^2 - (\sum \limits _{i=1} ^{n} y)^2]}}
$$


<p>Pearson Correlation is the default method of the function <code>corr()</code>. Like before, we can calculate the Pearson Correlation of the <code>int64</code> or <code>'float64'</code> variables.</p>


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


#### P-value</strong>

<strong>What is this P-value?</strong> 

<em><strong>The P-value</strong></em> is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when

<ul>
    <li><em>the p-value</em> is $<$ <strong>0.001</strong>: we say there is strong evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $<$ <strong>0.05</strong>: there is moderate evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $<$ <strong>0.1</strong>: there is weak evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $>$ <strong>0.1</strong>: there is no evidence that the correlation is significant.</li>
</ul>


Let's calculate the  Pearson Correlation Coefficient and P-value of our currency price and indicators.


In [ ]:
df_stats = pd.DataFrame({"indicator":[], "pearson": [], "p-value": []})
pd.options.display.float_format = '{:.3f}'.format

for indicator in indicators:
    pearson_coef, p_value = stats.pearsonr(df['Price'], df[indicator])
    df_stats.loc[len(df_stats.index)] = [indicator, pearson_coef, p_value]

df_stats.set_index('indicator', inplace=True)
df_stats

#### Conclusion:
<p>Since the <strong><em>p-value is $<$ 0.001</em></strong>, the correlation between our currency price and the others parameters is statistically significant, although the linear relationship isn't strong as the correlation values are too small and approach zero.</p>


## 6. ANOVA


### Analysis of Variance (ANOVA)
<p><strong>The Analysis of Variance (ANOVA)</strong> is a statistical method used to test whether there are significant differences between the means of two or more groups.</p>
    
ANOVA returns two parameters: <strong><em>F-test score</em></strong> and <strong><em>P-value</em></strong>.

<p><strong>F-test score</strong>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>
<p><strong>P-value</strong>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a <strong><em>sizeable F-test score</em></strong> and a <strong><em>small p-value</em></strong>.</p>


### Category


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of <strong>'BTCBUSD_Price_binned'</strong> impact <strong>'BTCBUSD_Price'</strong>, we group the data.</p>


In [ ]:
grouped_test2 = df[['Price', 'Price_binned']].groupby(['Price_binned'])
grouped_test2.head(2)

We can obtain the values of the method group using the method <code>get_group()</code>.


In [ ]:
grouped_test2.get_group('Medium')['Price'].to_frame()

We can use the function <code>f_oneway</code> in the module <strong>'stats'</strong> to obtain the <strong>F-test score</strong> and <strong>P-value</strong>.


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('Low')['Price'],
                              grouped_test2.get_group('Medium')['Price'],
                              grouped_test2.get_group('High')['Price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)

This is a great result with a large F-test score showing a strong correlation and a P-value of 0 implying almost certain statistical significance. But does this mean all these groups are all this highly correlated?


#### ANOVA for different category combinations


In [ ]:
df_ANOVA = pd.DataFrame({"Price categories":[], "f-score": [], "p-value": []})

combinator = list(itertools.combinations(group_names, 2))
for first, second in combinator:
    f_val, p_val = stats.f_oneway(grouped_test2.get_group(first)['Price'], 
                                  grouped_test2.get_group(second)['Price'])
    df_ANOVA.loc[len(df_ANOVA.index)] = [f'{first} and {second}', f_val, p_val]

df_ANOVA.set_index('Price categories', inplace=True)
df_ANOVA

As you can see a great result with a large F-test score showing a strong correlation and a P-value of 0 implying certain statistical significance is obtained. This means all these groups are all highly correlated.


### ANOVA for indicators


Let's check how strong each of indicators groups are correlated.


In [ ]:
for indicator in indicators:
    grouped_test_indicator = df[[indicator, f'{indicator}_binned']].groupby([f'{indicator}_binned'])

    df_ANOVA = pd.DataFrame({f'{indicator} categories':[], 'f-score': [], 'p-value': []})

    # ANOVA
    f_val, p_val = stats.f_oneway(grouped_test_indicator.get_group('Low')[indicator],
                                  grouped_test_indicator.get_group('Medium')[indicator],
                                  grouped_test_indicator.get_group('High')[indicator])
    df_ANOVA.loc[len(df_ANOVA.index)] = ['High, Medium and Low', f_val, p_val]

    combinator = list(itertools.combinations(group_names, 2))
    for first, second in combinator:
        f_val, p_val = stats.f_oneway(grouped_test_indicator.get_group(first)[f'{indicator}'], 
                                      grouped_test_indicator.get_group(second)[f'{indicator}'])
        df_ANOVA.loc[len(df_ANOVA.index)] = [f'{first} and {second}', f_val, p_val]

    df_ANOVA.set_index(f'{indicator} categories', inplace=True)
    display(df_ANOVA)
    print()

We can see how the Volume parameter has the most correlated groups of categories, while other indicators show slightly lower values. However, every P-value is close to 0 implying certain statistical significance is obtained.


## 7. Durbin Watson Test


#### What Is the Durbin Watson Statistic?

<strong>The Durbin Watson (DW) statistic</strong> is a test for autocorrelation in the residuals from a statistical model or regression analysis. The Durbin-Watson statistic will always have <em>a value ranging between 0 and 4</em>.

<li><strong>a value of 2.0</strong> indicates there is no autocorrelation detected in the sample.</li> 
<li><strong>values from 0 to less than 2</strong> point to positive autocorrelation</li>
<li><strong>values from 2 to 4</strong> means negative autocorrelation</li>

<br>
<p>
A rule of thumb is that DW test statistic values <em><strong> in the range of 1.5 to 2.5 are relatively normal</strong></em>. Values outside this range could, however, be a cause for concern. 
</p>


This test uses the following hypotheses:
<li><strong>$H_0$ (null hypothesis):</strong> There is no correlation among the residuals.</li>
<li><strong>$H_A$ (alternative hypothesis):</strong> The residuals are autocorrelated.</li>


Let's implement a function that creates regression models:


In [ ]:
def get_reg(x: pd.Series, y: pd.Series):
    """ Return regression model.
    """
    # to get intercept
    X = sm.add_constant(x)
    # fit the regression model
    reg = sm.OLS(y, X).fit()
    return reg

Now let's test this function and perform Durbin Watson Test.


In [ ]:
# independent
X = df['NATR']
# dependent
y = df['Price']
reg = get_reg(X, y)

In [ ]:
print('DW test stats:', durbin_watson(resids=np.array(reg.resid)))

DW test statistic value is NOT in the range of 1.5 to 2.5. Therefore, the price and NATR are not relatively normal.


Now that we now how to calculate Durbin-Watson we can evaluate it for the main and indicators with other currencies.


In [ ]:
df_durbin = pd.DataFrame({'indicator':[], 'durbin-watson': []})

for indicator in indicators:
    # independent
    X = df[indicator]
    # dependent
    y = df['Price']
    reg = get_reg(X, y)
    dw = durbin_watson(resids=np.array(reg.resid))
    df_durbin.loc[len(df_durbin.index)] = [indicator, dw]

df_durbin.set_index('indicator', inplace=True)
df_durbin

According to the results of other indicators, we observe the same situation. The influence of indicators on the trend of changes in the price of cryptocurrency is not positively revealed.


Let's calculate Durbin-Watson for all available indicator pairs.


In [ ]:
variables = ['Price', 'Volume', 'ADOSC', 'NATR', 'TRANGE']

cols = [f"{variable}_dep" for variable in variables]
idxs = [f"{variable}_ind" for variable in variables]

dw_df = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.permutations(variables, 2):
    # independent variable
    X = df[f"{curr1}"]
    # dependent variable
    y = df[f"{curr2}"]
    # to get intercept
    X = sm.add_constant(X)
    # fit the regression model
    reg = sm.OLS(y, X).fit()
    dw = durbin_watson(resids=np.array(reg.resid))
    dw_df.loc[f"{curr1}_ind", f"{curr2}_dep"] = dw
    
dw_df

Replace NaN values with empty values.


In [ ]:
np.fill_diagonal(dw_df.values, ' ')
dw_df

#### Conclusion


Any value we choose will have a row for the independent value and a column for the dependent value.

if we look at the column of the dependent variable Price, we will see that no indicator of independent variables (indicators) is in the range of 1.5-2.5. This tells that the listed indicators cannot be a good prediction for the price of BTCBUSD currency.


## 8. Granger-Causality Test


#### What is Granger-Causality test?

<strong>The Granger causality test</strong> is a statistical hypothesis test for determining whether one time series is useful in forecasting another. If the probability value is less than in our case <em>P-value level</em>, then the hypothesis would be rejected at that level.


#### What is the difference between correlation and Granger causality?

<strong><em>Correlation</em></strong> is a measure of linear dependence between two random variables. So no additional variables are involved in the calculation of the correlation between X and Z, and also, in principle these variables may be just random variables and not time series.

<strong><em>Granger causality</em></strong> is a concept of marginal predictability. So here the time dimension of the potential relationship between X and Z is important.


#### What is the null hypothesis in Granger Causality test?

<strong>The null hypothesis ($H_0$)</strong> for the test is that lagged x-values do not explain the variation in y. In other words, it assumes that $x_t$ doesn't Granger-cause $y_t$. Theoretically, you can run the Granger Test to find out if two variables are related at an instantaneous moment in time.


Let's visualize price trends movements through specific time period for BTCBUSD currency and NATR indicator.


In [ ]:
x = df.head(9000).index
y1 = df['Price'].head(9000)
y2 = df['ADOSC'].head(9000)

# # Plot Line1 (Left Y Axis)
fig, ax1 = plt.subplots(1,1,figsize=(16,9), dpi= 80)
ax1.plot(x, y1, color='tab:red')

# # Plot Line2 (Right Y Axis)
# instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.plot(x, y2, color='tab:blue')

# Decorations
# ax1 (left Y axis)
ax1.set_xlabel('Time', fontsize=20)
ax1.tick_params(axis='x', rotation=0, labelsize=12)
ax1.set_ylabel('Price', color='tab:red', fontsize=20)
ax1.tick_params(axis='y', rotation=0, labelcolor='tab:red' )
ax1.grid(alpha=.4)

# # ax2 (right Y axis)
ax2.set_ylabel('ADOSC', color='tab:blue', fontsize=20)
ax2.tick_params(axis='y', rotation=0, labelsize=12, labelcolor='tab:blue')
ax2.set_title("Visualizing Leading Indicator Phenomenon", fontsize=22)

fig.tight_layout()
plt.show()

Let's import function implementing Granger Causality test from statsmodels module and declare Granger Causality test calculating function.


In [ ]:
def grangers_causation_matrix(data: pd.DataFrame, maxlag: int, columns: List[str]) -> pd.DataFrame:
    """Check Granger Causality of all possible combinations of the Time series.
    data      : pandas dataframe containing the time series variables
    maxlag    : a maximum possible time delay
    columns   : list containing names of the time series variables.
    """
    df = pd.DataFrame(np.zeros((len(columns), len(columns))), columns=columns, index=columns)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0]['ssr_chi2test'][1],4) for i in range(maxlag)]
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [col + '_x' for col in columns]
    df.index = [col + '_y' for col in columns]
    return df

Let's perform the Granger Causality test on the price of BTCBUSD currency and NATR indicator.


In [ ]:
grangers_causation_matrix(df[['Price', 'ADOSC']], 1, columns=['Price', 'ADOSC'])

#### How are the P-values to be read?

If the P-value is less than 0.05, then, assuming a significance level of 0.05, we reject the null hypothesis that X does not generally cause Y.
Hence, the p-value for <strong>ADOSC_x</strong> and <strong>BTCBUSD_Price_y</strong> in the above table is equal to 0.0 determining that NATR granger causes BTCBUSD price and that leads us to reject the null hypothesis.

<em>Therefore, it is likely that the NATR movement will be useful in projecting BTCBUSD price.</em>

<p>
However, <strong>ADOSC_x</strong> and <strong>BTCBUSD_Price_y</strong> have P-values of 0.0118. We cannot rule out the null hypothesis because the P-value does not equal or fall below 0.05.

<em>In other words, ADOSC can not be predicted from BTCBUSD price.</em>
</p>


Let's calculate Granger Causality Test for all available indicators pairs.


In [ ]:
cols = ['Price_y'] + [f"{indicator}_y" for indicator in indicators]
idxs = ['Price_x'] + [f"{indicator}_x" for indicator in indicators]

df_granger = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.permutations(['Price'] + indicators, 2):
    df_test = df[[f"{curr1}", f"{curr2}"]]
    res_df = grangers_causation_matrix(df_test, 1, variables=df_test.columns)
    p1 = res_df[f"{curr1}_x"][f"{curr2}_y"]
    p2 = res_df[f"{curr2}_x"][f"{curr1}_y"]
    df_granger.loc[f"{curr1}_x", f"{curr2}_y"] = p1
    df_granger.loc[f"{curr2}_x", f"{curr1}_y"] = p2

# replace diagonal values with space char
np.fill_diagonal(df_granger.values, ' ')

df_granger

Since we are interested in the influence and possible prediction of the price of our cryptocurrency with the help of indicators, we will consider the BTCBUSD price as a dependent variable <strong>Price_y</strong>.

Hence, the p-value for <strong>Volume_x</strong>, <strong>ADOSC_x</strong>, <strong>ADOSC_x</strong>, <strong>NATR_x</strong>, <strong>TRANGE_x</strong> and <strong>BTCBUSD_Price_y</strong> in the above table is less than 0.05 determining that these indicators granger causes BTCBUSD price and that leads us to reject the null hypothesis.

<em>Therefore, it is likely that the all indicators movement will be useful in projecting BTCBUSD price.</em>


### Conclusion:


<p>We now have a better idea of what our data looks like and which variables are more related to our main BTCBUSD currency. Most of the indicators are some way related to our BTCBUSD currency and can be used to predict its value. We also have narrowed the most related indicators down to the following variables:</p>
<ul>
    <li>Volume</li>
    <li>ADOSC</li>
    <li>NATR</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


## Save Dataset
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.
</p>
<p>
Let's save the dataframe <strong>df</strong> as <strong>BTCBUSD_trades_1m.csv</strong> in order to use it in the following modules. You may use the syntax below, where <code>index = True</code> means the row names will be written as well.
</p>


In [ ]:
df.to_csv("BTCBUSD_trades_1m.csv", index=True)

Great! You have successfully reached the end!


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/yaryna_beida?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX043BEN2378-2023-01-01">Yaryna Beida</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX043BEN2378-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX043BEN2378-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
|     2023-03-11    |   1.0   |Yaryna Beida| Lab created                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
