<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo">
</center>

#  **Investigation of cryptocurrency exchange rate dynamic (on the example of cryptocurrency pair MATIC/BUSD), сalculation and analysis of technical financial indicators, characterizing the cryptocurrency market (the example of ATR, OBV, RSI, AD)**


## **Lab 3. Data Analysis with Python**

Estimated time needed: **30** minutes

## **The tasks**

* Explore features or characteristics to predict price of cryptocurrency;
* Visualize cryptocurrency dynamics using Candlestick Chart;
* Estimate high or low relationships level between cryptocurrency characteristics and indicators;
* Perform financial statistic tests.

## **Objectives**

After completing this lab you will be able to:

* find correlation and causation between cryptocurrencies indicators;
* group data;
* evaluate Durbin-Watson Test, Granger Causality test etc.;
* use Analysis of Variance (ANOVA).


## **Table of Contents**

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import Data</li>
    <li>Analyzing Individual Feature Patterns using Visualization</li>
    <ul>
        <li>Choosing the right visualization method</li>
        <li>Candlestick chart</li>
        <li>Correlation calculation</li>
        <li>Continuous Numerical Variables</li>
        <li>Linear Relationship</li>
        <li>Categorical variables</li>
    </ul>
    <li>Descriptive Statistical Analysis</li>
    <ul>
        <li>Value Counts</li>
    </ul>
    <li>Basics of Grouping</li>
    <li>Correlation and Causation</li>
    <ul>
        <li>Pearson Correlation</li>
        <li>P-value</li>
    </ul>
    <li>ANOVA: Analysis of Variance</li>
    <li>Durbin-Watson Test</li>
    <li>Granger Causality Test</li>
    <li>Sources</li>
</ol>

</div>

<hr>


## **Dataset Description**

### **Files**
* #### **MATICBUSD_trades_1m_preprocessed.csv** - the file contains historical changes of the pair **MATIC/BUSD** and ATR, OBV, RSI, AD indicators for the period from 11/11/2022 to 12/29/2022 with an aggregation time of 1 minute. **MATIC/BUSD** - the exchange rate of **MATIC** cryptocurrency to **BUSD** cryptocurrency

### **Columns**

* #### `Ts` - the timestamp of the record
* #### `Open` -  the price of the asset at the beginning of the trading period
* #### `High` -  the highest price of the asset during the trading period
* #### `Low` - the lowest price of the asset during the trading period.
* #### `Close` - the price of the asset at the end of the trading period
* #### `Volume` - the total number of shares or contracts of a particular asset that are traded during a given period
* #### `Rec_count` -  the number of individual trades or transactions that have been executed during a given time period
* #### `Avg_price` - the average price at which a particular asset has been bought or sold during a given period
* #### `ATR` - average true range indicator
* #### `OBV` - on-balance volume indicator
* #### `RSI` - relative strength index indicator
* #### `AD` - accumulation / distribution indicator


### How to find the correlation between cryptocurrencies and conduct various tests on them?


# **1. Import Data**


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
# If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
# ! conda install pandas -y
# ! conda install numpy -y
# ! conda install scipy -y
# ! conda install matplotlib -y
# ! conda install seaborn -y
! conda install -c conda-forge mplfinance -y
# ! conda install -c anaconda statsmodels -y

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mpl_dates
import mplfinance as mpf
import statsmodels.api as sm

from scipy import stats
from statsmodels.stats.stattools import durbin_watson as dwtest
from statsmodels.tsa.stattools import grangercausalitytests

import itertools
import warnings
import datetime as dt

pd.set_option("display.precision", 5) # setting numbers after digits
pd.options.display.float_format = '{:.5f}'.format
warnings.filterwarnings("ignore") # filterig warnings
sns.set() # setting theme

We will use dataset that we created in the first lab "MATICBUSD_trades_1m_preprocessed.csv"


In [ ]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CJ2EN/MATICBUSD_trades_1m_preprocessed.csv"

Load the data and store it in dataframe `df`:


In [ ]:
df = pd.read_csv(filename, parse_dates=["Ts"]) # We set parameter parse_dates to specify columns which need to perceived as datetime

This dataset was hosted <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CJ2EN/MATICBUSD_trades_1m_preprocessed.csv">HERE</a>


Let's take a look on data that we have got


In [ ]:
df.head(20)

How we can see our first 15 rows of columns **"ATR"** and **"RSI"** are `NaN`'s so we need to drop them


In [ ]:
df = df.dropna()

# **2. Analyzing Individual Feature Patterns Using Visualization**


## **Choosing the right visualization method**

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.


In [ ]:
# list the data types for each column
print(df.dtypes)

Let's define function that plots **candlestick chart**


In [ ]:
def plot_candlestick_chart(df: pd.DataFrame, curr: str) -> None:
    """
    Plots candlestick chart
    
    Parameters
    ----------
    df: pd.DataFrame
        Pandas dataframe that needs to contain columns "Ts", "Open", "High", "Low", "Close", "Volume"
    curr: str
        Name of currency that `df` contains
    """
    # Extracting Data for plotting
    ohlc = df[["Ts", "Open", "High", "Low", "Close", "Volume"]].copy()

    # Setting "Ts" column as datatime if it's not yet
    ohlc["Ts"] = pd.to_datetime(ohlc["Ts"])
    ohlc.index = ohlc["Ts"]

    # Resampling to 1 day
    ohlc = ohlc.resample("1d").agg({
        "Open": "first",
        "High": "max",
        "Low": "min",
        "Close": "last",
        "Volume": "sum"
    })

    # Setting "Ts" column as index and result in the correct format
    ohlc["Ts"] = ohlc.index
    ohlc["Ts"] = ohlc["Ts"].apply(mpl_dates.date2num)

    # Plotting the candlestick chart
    mpf.plot(ohlc, type="candle", 
             volume=True, 
             style="yahoo", 
             ylabel="Price", 
             xlabel="Date", 
             title=f"Daily Candlestick Chart of {curr}")

## **Candlestick chart**


A **candlestick chart** (also called **Japanese candlestick** chart or **K-line**) is a style of financial chart used to describe price movements of a security, derivative, or currency.

It is similar to a bar chart in that each candlestick represents all four important pieces of information for that day: open and close in the thick body; high and low in the “candle wick”. Being densely packed with information, it tends to represent trading patterns over short periods of time, often a few days or a few trading sessions.

Candlestick charts are most often used in technical analysis of equity and currency price patterns. They are used by traders to determine possible price movement based on past patterns, and who use the opening price, closing price, high and low of that time period. They are visually similar to box plots, though box plots show different information.


<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CJ2EN/1024px-Candlestick_chart_scheme_01-en.svg.png" width="300" alt="candlestick">
</center>


Scheme of a single candlestick chart. A candlestick as this one is usually shaded red as the close is lower than the open. The Low and High caps are usually not present but may be added to ease reading.


Let's see our candlestick chart


In [ ]:
plot_candlestick_chart(df, "MATIC/BUSD")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question  #1:</strong></h1>

**What is the data type of the column "Avg_price"?**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df["Avg_price"].dtypes
```

</details>


## **Correlation calculation**


For example, we can calculate the correlation between variables  of type `int64` or `float64` using the method `.corr()`:


In [ ]:
corr = df.corr()
corr

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


We have a dataframe of correlations now we can build **heatmap** based on that correlation to perceive this data visually


**heatmap** plots rectangular data as a color-encoded matrix.


In [ ]:
sns.heatmap(corr)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question  #2:</strong></h1>

**Find the correlation between the following columns: "Avg_price", "ATR" and "AD".**<br><br>
<strong>Hint: if you would like to select those columns, use the following syntax: `df[["Avg_price", "ATR", "AD"]]`</strong>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[["Avg_price", "ATR", "AD"]].corr()
```

</details>


## **Continuous Numerical Variables**

Continuous numerical variables are variables that may contain any value within some range. They can be of type `int64` or `float64`. A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the price, we can use `regplot` which plots the scatterplot plus the fitted regression line for the data.


Let's see several examples of different linear relationships:


## **Linear Relationship**


A **linear relationship** (or **linear association**) is a statistical term used to describe a straight-line relationship between two variables. Linear relationships can be expressed either in a graphical format where the variable and the constant are connected via a straight line or in a mathematical format where the independent variable is multiplied by the slope coefficient, added by a constant, which determines the dependent variable. 


Let's take 2 columns **"Avg_price"** and **"ATR"** and see how they are correlated


In [ ]:
# ATR as potential predictor variable of Avg_price
sns.regplot(x="ATR", y="Avg_price", data=df)

We can see that at small values of **"ATR"** the correlation is quite good, but as **"ATR"** increases the correlation decreases


In [ ]:
df[["ATR", "Avg_price"]].corr()

We can examine the correlation between **"ATR"** and **"Avg_price"** and see that it's approximately 0.33449


Let's find the scatterplot of **"OBV"** and **"Avg_price"**


In [ ]:
sns.regplot(x="OBV", y="Avg_price", data=df)

**"OBV"** does not seem like a good predictor of **"Avg_price"** since the regression line is close to horizontal and in most cases the data points are located far from the fitted line. 


In [ ]:
df[["OBV", "Avg_price"]].corr()

We can examine the correlation between **"OBV"** and **"Avg_price"** and see that it's approximately 0.19047


Let's find the scatterplot of **"AD"** and **"Avg_price"**


In [ ]:
sns.regplot(x="AD", y="Avg_price", data=df)

<p>We see that there is a fairly conditional correlation. in some cases the points are located close to the straight line, in some - not. </p>


In [ ]:
df[["AD", "Avg_price"]].corr()

We can examine the correlation between **"AD"** and **"Avg_price"** and see that it's approximately 0.48184


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question  3 a):</strong></h1>

<strong>Find the correlation  between "Avg_price" and "RSI".</strong><br><br>
<strong>Hint: if you would like to select those columns, use the following syntax: df[["Avg_price", "RSI"]]</strong>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is 0.0298, the non-diagonal elements of the table.

df[["Avg_price", "RSI"]].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question  3 b):</strong></h1>

**Given the correlation results between "Avg_price" and "RSI", do you expect a linear relationship?**
**Verify your results using the function `regplot()`**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python

# There is no correlation between the variables "RSI" and "Avg_price". We can see this using "regplot" to demonstrate this.

# Code: 
sns.regplot(x="Avg_price", y="RSI", data=df)

```

</details>


## **Categorical Variables**

These are variables that describe a "characteristic" of a data unit, and are selected from a small group of categories. The categorical variables can have the type `object` or `int64`. A good way to visualize categorical variables is by using boxplots.


But firstly we need to create the categories. Let's split "Avg_price" into 5 categories (Low, Lower Medium, Medium, Upper Medium, High). To do this we will use `np.linspace` and `pd.cut`. We used `pd.cut` in second lab. `np.linspace` returns evenly spaced numbers over a specified interval.


In [ ]:
group_names = ["Low", "Lower Medium", "Medium", "Upper Medium", "High"]

Let's define a function which convert absolute values to categorical ones


In [ ]:
def to_categorical(column: pd.Series, labels: list) -> pd.Series:
    """
    Convert `column` into categorical pd.Series with labels given as parameter `labels`
    
    Parameters
    ----------
    column: pd.Series
        Column to convert
    labels: list
        Labels which we will use as categories
    
    Returns
    -------
    res: pd.Series
        Categorical column
    """
    cat_number = len(labels)
    bins = np.linspace(min(column), max(column), cat_number+1)
    res = pd.cut(column, bins, labels=labels, include_lowest=True)
    return res

In [ ]:
df["ap_cat"] = to_categorical(df["Avg_price"], group_names)
df[["ap_cat", "Avg_price"]].tail()

Now let's do the same with **"AD"**


In [ ]:
df["AD_cat"] = to_categorical(df["AD"], group_names)

Let's look at the relationship between **"Avg_price"** and **"ap_cat"**.


In [ ]:
sns.boxplot(x="Avg_price", y="ap_cat", data=df)

Here we see that the distribution of price between these five categories, Low, Lower Medium, Medium, Upper Medium and High.


# **3. Descriptive Statistical Analysis**


Let's first take a look at the variables by utilizing a description method.

The `.describe()` function automatically computes basic statistics for all continuous variables. Any `NaN` values are automatically skipped in these statistics.

This will show:

* the count of that variable
* the mean
* the standard deviation (std)
* the minimum value
* the IQR (Interquartile Range: 25%, 50% and 75%)
* the maximum value


We can apply the method `.describe()` as follows:


In [ ]:
df.describe()

However `.describe()` does not include categorical columns so let's include them specifying the `include` parameter


In [ ]:
df.describe(include="category")

## **Value Counts**


Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the `.value_counts()` method on the column **"ap_cat"**. Don’t forget the method `.value_counts()` only works on pandas series, not pandas dataframes. As a result, we only include one bracket `df["ap_cat"]`, not two brackets `df[["ap_cat"]]`


In [ ]:
df["ap_cat"].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df["ap_cat"].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe **"ap_cat"** and rename the column **"ap_cat"** to `.value_counts()`


In [ ]:
ap_counts = df["ap_cat"].value_counts().to_frame()
ap_counts.rename(columns={"ap_cat": "value_counts"}, inplace=True)
ap_counts

Now let's rename the index to **"ap_cat"**:


In [ ]:
ap_counts.index.name = "ap_cat"
ap_counts

We can repeat the above process for the variable **"AD_cat"**.


In [ ]:
# AD_cat as variable
ad_cat = df["AD_cat"].value_counts().to_frame()
ad_cat.rename(columns={"AD_cat": "value_counts"}, inplace=True)
ad_cat.index.name = "AD_cat"
ad_cat.head(10)

# **4. Basics of Grouping**


The `groupby()` method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups

For example, let's group by the variable **"ap_cat"**. We see that there are 5 different categories


In [ ]:
df["ap_cat"].unique()

If we want to know, on average, which type of **"ap_cat"** is most valuable, we can group **"ap_cat"** and then average them.

We can select the columns **"ap_cat"** and **"Avg_price"**, then assign it to the variable `df_group_one`


In [ ]:
df_group_one = df[["ap_cat", "Avg_price"]]

We can then calculate the average price for each of the different categories of data


In [ ]:
# grouping results
df_group_one = df_group_one.groupby(["ap_cat"] ,as_index=False).mean()
df_group_one

Obviously, High category is, on average, the most expensive

You can also group by multiple variables. For example, let's group by both **"ap_cat"** and **"AD_cat"**. This groups the dataframe by the unique combination of **"ap_cat"** and **"AD_cat"**. We can store the results in the variable `grouped_test1`


In [ ]:
# grouping results
df_gptest = df[["ap_cat", "AD_cat", "Avg_price"]]
grouped_test1 = df_gptest.groupby(["ap_cat", "AD_cat"], as_index=False).mean()
grouped_test1

This grouped data is much easier to visualize when it is made into a **cross table**. A cross table is a two-way table consisting of columns and rows. It is also known as a pivot table or a multi-dimensional table. Its greatest strength is its ability to structure, summarize and display large amounts of data. Cross tables can also be used to determine whether there is a relation between the row variable and the column variable or not.

In this case, we will leave the **"ap_cat"** variable as the rows of the table, and **"AD_cat"** to become the columns of the table:


In [ ]:
crossed_table = pd.crosstab(df["ap_cat"], df["AD_cat"])
crossed_table

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question 4:</strong></h1>

**Use the `groupby()` function to find the average price of each category based on "AD_cat"**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# grouping results
df_gptest2 = df[["AD_cat", "Avg_price"]]
grouped_test_bodystyle = df_gptest2.groupby(["AD_cat"], as_index= False).mean()
grouped_test_bodystyle

```

</details>


Variables: **"ap_cat"** vs **"AD_cat"**


Let's use a heat map to visualize the relationship between **"ap_cat"** vs **"AD_cat"**


In [ ]:
# use the grouped results
plt.pcolor(crossed_table, cmap="RdBu")
plt.colorbar()
plt.show()

The heatmap plots relationship between these 2 variables. The larger the diagonal elements, the more these two variables are dependent.

The default labels convey no useful information to us. Let's change that:


In [ ]:
fig, ax = plt.subplots()
im = ax.pcolor(crossed_table, cmap="RdBu")

# label names
row_labels = crossed_table.columns.categories
col_labels = crossed_table.index

# move ticks and labels to the center
ax.set_xticks(np.arange(crossed_table.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(crossed_table.shape[0]) + 0.5, minor=False)

# insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

# rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course

To get a better measure of the important characteristics, we look at the correlation of these variables with the price. In other words: how is the **"Avg_price"** dependent on other variables?


# **5. Correlation and Causation**


**Correlation**: a measure of the extent of interdependence between variables.

**Causation**: the relationship between cause and effect between two variables.

It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.


Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the **correlation** between electricity demand and weather. In this example, there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causation). 


## **Pearson Correlation**

The Pearson Correlation measures the linear dependence between two variables $X$ and $Y$. The Pearson correlation coefficient attempts to establish a line of best fit through a dataset of two variables by essentially laying out the expected values and the resulting Pearson's correlation coefficient indicates how far away the actual dataset is from the expected values. Depending on the sign of our Pearson's correlation coefficient, we can end up with either a negative or positive correlation if there is any sort of relationship between the variables of our data set.</p>
The resulting coefficient is a value between -1 and 1 inclusive, where:

* **1**: Perfect positive linear correlation.
* **0**: No linear correlation, the two variables most likely do not affect each other.
* **1**: Perfect negative linear correlation.


The population correlation coefficient $ \rho_{X,Y}$ between two random variables $X$ and $Y$ with expected values $\mu _{X}$ and $\mu _{Y}$ and standard deviations $\sigma _{X}$ and $\sigma_Y$ is defined as:

<center><h1>$\rho_{X,Y} = \operatorname{corr}(X, Y) = \frac{\operatorname{cov}(X,Y)}{\sigma _{X} \sigma_Y} = \frac{\operatorname{E}[(X \; - \; \mu_{X})(Y \; - \; \mu_{Y})]}{\sigma _{X} \sigma_Y}, \quad \text{if} \; \sigma_{X} \sigma_Y > 0 $</h1></center>

where $\operatorname{E}$ is the expected value operator, $\operatorname{cov}$ means covariance, and $\operatorname {corr}$ is a widely used alternative notation for the correlation coefficient. The Pearson correlation is defined only if both standard deviations are finite and positive. An alternative formula purely in terms of moments is:

<center><h1>$\rho_{X,Y} = \frac{\operatorname{E}(XY) \; - \; \operatorname{E}(X) \operatorname{E}(Y)}{\sqrt{\operatorname{E}(X^{2}) \; - \; \operatorname{E}(X)^2} \sqrt{\operatorname{E}(Y^{2}) \; - \; \operatorname{E}(Y)^2}}$</h3></center><br>


<center><img width="700" src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CJ2EN/R%20SVG%20Plot.svg"></center>
Several sets of $(x, y)$ points, with the Pearson correlation coefficient of $x$ and $y$ for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case, the correlation coefficient is undefined because the variance of $Y$ is zero. <br>


Pearson Correlation is the default method of the function `.corr()`. Like before, we can calculate the Pearson Correlation of the of the `int64` or `float64` variables


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


## **P-value**

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


Let's calculate the  Pearson Correlation Coefficient and P-value of "Avg_price" and "ATR", "OBV", "RSI", "AD".


In [ ]:
indicators = ["ATR", "OBV", "RSI", "AD"] # indicators that we want to calculate PCS and p-value with "Avg_price"

performance = pd.DataFrame({"pair": [], "PCS": [], "P-value": []}) # PCS (Pearson Correlation Coefficient)

# Iterating over all the indicators and calculating needed characteristics
for indicator in indicators:
    pearson_coef, p_value = stats.pearsonr(df["Avg_price"], df[indicator])
    pair = f"Avg_price, {indicator}"
    performance.loc[len(performance.index)] = [pair, pearson_coef, p_value]
   
# Printing results
performance.sort_values(by=["PCS"], ascending=False).head()

## **Conclusion:**

Since the p-value for all pairs is $<$ 0.001 we say there is strong evidence that the correlation is significant. Only correlaton of **"Avg_price"** and **"AD"** is moderate, all others are weak


# **6. ANOVA: Analysis of Variance**


The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a constant does not alter significance. So ANOVA statistical significance result is independent of constant bias and scaling errors as well as the units used in expressing observations. In the era of mechanical calculation it was common to subtract a constant from all observations (when equivalent to dropping leading digits) to simplify data entry. This is an example of data coding.


**F-test score**: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

**P-value**:  P-value tells how statistically significant our calculated score value is.


<center><h1>$F = \frac{MST}{MSE}$</h1></center>


$\text{where:}$ <br>
$F = \text{F-test score}$<br>
$MST = \text{Mean sum of squares due to treatment}$<br>
$MSE = \text{Mean sum of squares due to error}$


If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.


Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.


In [ ]:
grouped_test2 = df[["ap_cat", "Avg_price"]].groupby(["ap_cat"])
grouped_test2.head()

We can obtain the values of the method group using the method `.get_group()`.


In [ ]:
grouped_test2.get_group("Medium")["Avg_price"]

We can use the function `stats.f_oneway` in the module `stats` to obtain the **F-test score** and **P-value**.


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group("Low")["Avg_price"], grouped_test2.get_group("Lower Medium")["Avg_price"], grouped_test2.get_group("Medium")["Avg_price"], grouped_test2.get_group("Upper Medium")["Avg_price"], grouped_test2.get_group("High")["Avg_price"])  

print("ANOVA results: F=%.2f" % f_val, ", P =", p_val)

This is a great result with a large F-test score showing a strong correlation and a P-value of 0 implying almost certain statistical significance. But does this mean all five tested groups are all this highly correlated?

Let's examine them separately.


In [ ]:
performance_anova = pd.DataFrame({"pair": [], "F-test": [], "P-value": []})

# Iterating over all the groups and calculating needed characteristics
for comb in itertools.combinations(group_names, 2):
    f_val, p_val = stats.f_oneway(grouped_test2.get_group(comb[0])["Avg_price"], grouped_test2.get_group(comb[1])["Avg_price"])
    pair = f"{comb[0]}, {comb[1]}"
    performance_anova.loc[len(performance_anova.index)] = [pair, f_val, p_val]
   
# Printing results
performance_anova.sort_values(by=["F-test"], ascending=False)

## **Conclusion:**

Every pair of group has p-value of 0 what means that our calculated score value is significant. Every pair has high **F-score** but Low, Medium pairs have the highest of 780563.41995


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question 5:</strong></h1>

**Get ANOVA score using `stats.f_oneway` function between "Low" and "Medium" groups**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
f_val, p_val = stats.f_oneway(grouped_test2.get_group("Low")["Avg_price"], grouped_test2.get_group("Medium")["Avg_price"])  

print("ANOVA results: F=%.2f" % f_val, ", P =", p_val)

```

</details>


# **7. Durbin-Watson Test**


In regression analysis, Durbin-Watson (DW) is useful for checking the first-order autocorrelation (serial correlation). It analyzes the residuals for independence over time points (autocorrelation). The Durbin-Watson statistic will always have a value ranging between 0 and 4. A value of 2.0 indicates there is no autocorrelation detected in the sample. Values from 0 to less than 2 point to positive autocorrelation and values from 2 to 4 means negative autocorrelation. The closer to 0 the statistic, the more evidence for positive serial correlation. The closer to 4, the more evidence for negative serial correlation.

Durbin-Watson test analyzes the following hypotheses,

Null hypothesis (H<sub>0</sub>): Residuals from the regression are not autocorrelated (autocorrelation coefficient, ρ = 0)
Alternative hypothesis (H<sub>a</sub>): Residuals from the regression are autocorrelated (autocorrelation coefficient, ρ > 0)

A rule of thumb is that DW test statistic values in the range of 1.5 to 2.5 are relatively normal. Values outside this range could, however, be a cause for concern. The Durbin–Watson statistic, while displayed by many regression analysis programs, is not applicable in certain situations. 

<center><h1>$DW = \frac{\sum_{t=2}^{T} ((e_{t} \; - \; e_{t-1})^{2}) }{ \sum_{t=1}^{T} e^{2}_{t} }$</h1></center>


We will use <code>statsmodels.stats.stattools.durbin_watson</code> for Durbin-Watson Test and <code>sm.OLS</code> to get residuals from `statsmodels` library


Let's define function that will return durbin-watson score


In [ ]:
def dw_test(df: pd.DataFrame, ind_col: str, dep_col) -> float:
    """
    Does Durbin-Watson test and return result as float
    
    Parameters
    ----------
    df: pd.DataFrame
        Pandas dataframe that needs to contain columns `ind_col` and `dep_col`
    ind_col: str
        Name of independant currency
    dep_col: str
        Name of dependant currency
    
    Returns
    -------
    score: float
        Durbin-Watson score which has range of [0, 4]
    """
    
    # We want to check on autocorrelation so we suppose that {ind_col} is depandant on {dep_col}
    X = df[ind_col] # independent
    y = df[dep_col] # dependent
    # to get intercept
    X = sm.add_constant(X)
    # fit the regression model
    reg = sm.OLS(y, X).fit()
    score = dwtest(resids=np.array(reg.resid))
    return score

Let's try this function on **"Avg_price"**, **"AD"** columns


In [ ]:
# We want to check on autocorrelation so we suppose that "Avg_price" is depandant on "AD"
dw_test(df, "AD", "Avg_price")

## **Conclusion:**

Because the score is very close to 0 we conclude that there is low positive autocorrelation


Let's calculate Durbin-Watson for every possible pair


In [ ]:
dw_elements = ["ATR", "OBV", "RSI", "AD", "Avg_price"]

cols = [f"{el}_dep" for el in dw_elements]
idxs = [f"{el}_ind" for el in dw_elements]

dw_df = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.permutations(dw_elements, 2):
    dw = dw_test(df, curr1, curr2)
    dw_df.loc[f"{curr2}_ind", f"{curr1}_dep"] = dw
    
np.fill_diagonal(dw_df.values, "—")
    
dw_df

## **Conclusion:**

Because the scores are very close to 0 we conclude that every pair has positive autocorrelation. However the values are very low that's why autocorrelation is also low.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question 6:</strong></h1>

**Get DW score using `dw_test` function on columns "ATR" (ind_col), "Avg_price" (dep_col)**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
dw_test(df, "ATR", "Avg_price")

```

</details>


# **8. Granger Causality Test**


Granger Causality test is a statistical test that is used to determine if a given time series and it’s lags is helpful in explaining the value of another series. 


The Null hypothesis for grangercausalitytests is that the time series in
the second column, $x_2$, does NOT Granger cause the time series in the first
column, $x_1$. Grange causality means that past values of $x_2$ have a
statistically significant effect on the current value of $x_1$, taking past
values of $x_1$ into account as regressors. We reject the null hypothesis
that $x_2$ does not Granger cause $x_1$ if the p-values are below a desired size
of the test.


**How to interpret the p-values?**

Assuming a significance level of 0.05, if the p-value is lesser than 0.05, then we do NOT reject the null hypothesis that $X$ does NOT granger cause $Y$.


Let's define function that will plot 2 variables into 1 plot


In [ ]:
def plot_two_variables(df: pd.DataFrame, col1: str, col2: str) -> None:
    """
    Plots `col1` and `col2` currencies into a single plot with adjusted axes
    
    Parameters
    ----------
    df: pd.DataFrame
        Pandas dataframe that needs to contain columns `col1` and `col2`
    col1: str
        Name of first currency to plot
    col2: str
        Name of second currency to plot
    """
    
    df_to_test = df[[col1, col2]]
    x = df["Ts"]
    y1 = df_to_test[col1]
    y2 = df_to_test[col2]

    # Plot Line1 (Left Y Axis)
    fig, ax1 = plt.subplots(1,1,figsize=(16,9), dpi= 80)
    ax1.plot(x, y1, color="tab:red")

    # Plot Line2 (Right Y Axis)
    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
    ax2.plot(x, y2, color="tab:blue")

    # Decorations
    # ax1 (left Y axis)
    ax1.set_xlabel("Time", fontsize=20)
    ax1.tick_params(axis="x", rotation=0, labelsize=12)
    ax1.set_ylabel(col1, color="tab:red", fontsize=20)
    ax1.tick_params(axis="y", rotation=0, labelcolor="tab:red")
    ax1.grid(alpha=.4)

    # ax2 (right Y axis)
    ax2.set_ylabel(col2, color="tab:blue", fontsize=20)
    ax2.tick_params(axis="y", labelcolor="tab:blue")
    # ax2.set_xticklabels(x[::60], rotation=90, fontdict={"fontsize":10})
    ax2.set_title("Visualizing Leading Indicator Phenomenon", fontsize=22)
    plt.show()

In [ ]:
plot_two_variables(df, "Avg_price", "OBV")

We will use `grangercausalitytests` for Granger Causality Test from `statsmodels` library


Now let's define custom function which will do Granger Causality Test and return result as `pd.DataFrame`


In [ ]:
def grangers_causation_matrix(data: pd.DataFrame, maxlag: int, variables: list, test: str ="ssr_chi2test", verbose: bool = False):    
    """
    Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    Parameters
    ----------
    data: pd.DataFrame
        pandas dataframe containing the time series variables
    maxlag: int
        Number of lags
    variables: 
        list containing names of the time series variables
    test: str
        Name of test
    verbose: bool
        If verbose = True we print in detail
    """
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f"Y = {r}, X = {c}, P Values = {p_values}")
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + "_x" for var in variables]
    df.index = [var + "_y" for var in variables]
    return df

In [ ]:
cols_to_test = ["Avg_price", "OBV"]
grangers_causation_matrix(df[cols_to_test], 1, variables=cols_to_test)

## **Conclusion:**

We can see that 0.2516 $>$ 0.05 and 0.3861 $>$ 0.05 so we conclude that **"OBV"** does not granger-cause **"Avg_price"** and **"Avg_price"** does not granger-cause **"OBV"**


Let's calculate Granger Causality Test for all possible pairs


In [ ]:
cols = [f"{el}_x" for el in dw_elements]
idxs = [f"{el}_y" for el in dw_elements]

gc_df = pd.DataFrame(columns=cols, index=idxs)

for (curr1, curr2) in itertools.combinations(dw_elements, 2):
    df_to_test_2 = df[[curr1, curr2]]
    res_df = grangers_causation_matrix(df_to_test_2, 1, variables=df_to_test_2.columns)
    p1 = res_df[f"{curr1}_x"][f"{curr2}_y"]
    p2 = res_df[f"{curr2}_x"][f"{curr1}_y"]
    gc_df.loc[f"{curr1}_y", f"{curr2}_x"] = p1
    gc_df.loc[f"{curr2}_y", f"{curr1}_x"] = p2
    
np.fill_diagonal(gc_df.values, "—")
    
gc_df

## **Conclusion:**

**"ATR"** granger-causes **"OBV"**, **"RSI"**, **"AD"**. **"Avg_price"**. **"OBV"** granger-causes **"ATR"**, **"RSI"**, **"AD"** granger-causes **"ATR"**, **"OBV"**. **"Avg_price"** granger-causes "**RSI"**


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question 7 a):</strong></h1>

**Plot "ATR" and "Avg_price" using function `plot_two_variables`**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
plot_two_variables(df, "Avg_price", "ATR")

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question 7 b):</strong></h1>

**Run Granger Causality Test on "Avg_price" and "OBV" columns with `maxlag=1`**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
cols_to_test = ["Avg_price", "ATR"]
grangers_causation_matrix(df[cols_to_test], 1, variables=cols_to_test)

```

</details>


Let's save our dataset that will be needed for the next lab


In [ ]:
df.to_csv("MATICBUSD_trades_lab3.csv", index=False)

# **Conclusion:**


## **Correlation**


We now have a better idea of what our data looks like and which indicators are more related to **MATIC/BUSD**.

The most related indicators:

* **"ATR"**
* **"AD"**


## **Durbin-Watson Test**


**"Avg_price"** has high serial correlation (when it's dependant on) **"OBV"**, **"AD"**.


## **Granger Causality Test**


**"ATR"** granger-causes **"OBV"**, **"RSI"**, **"AD"**. **"Avg_price"**. **"OBV"** granger-causes **"ATR"**, **"RSI"**, **"AD"** granger-causes **"ATR"**, **"OBV"**. **"Avg_price"** granger-causes **"RSI"**.

We can conclude that all 4 indicators will be usefull in predicting **"Avg_price"**

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.


# **9. Sources**:

<ul>
    <li><a href="https://en.wikipedia.org/wiki/Correlation?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://en.wikipedia.org/wiki/Correlation</a></li>
    <li><a href="https://www.investopedia.com/terms/a/anova.asp?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://www.investopedia.com/terms/a/anova.asp</a></li>
    <li><a href="https://www.statsmodels.org/dev/generated/statsmodels.stats.stattools.durbin_watson.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://www.statsmodels.org/dev/generated/statsmodels.stats.stattools.durbin_watson.html</a></li>
    <li><a href="https://www.investopedia.com/terms/d/durbin-watson-statistic.asp?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://www.investopedia.com/terms/d/durbin-watson-statistic.asp</a></li>
    <li><a href="https://en.wikipedia.org/wiki/Granger_causality?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://en.wikipedia.org/wiki/Granger_causality</a></li>
    <li><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Candlestick_chart_scheme_01-en.svg/1024px-Candlestick_chart_scheme_01-en.svg.png?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Candlestick_chart_scheme_01-en.svg/1024px-Candlestick_chart_scheme_01-en.svg.png</a></li>
    <li><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01">https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png</a></li>
</ul>


# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/borys_melnychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CJ2EN2387-2023-01-01" >Borys Melnychuk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>



## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-03-11    |   1.0   | Borys Melnychuk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>
