
<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo">
</center>

# **Investigation of cryptocurrency exchange rate dynamic (on the example of cryptocurrency pair MATIC/BUSD), сalculation and analysis of technical financial indicators, characterizing the cryptocurrency market (the example of ATR, OBV, RSI, AD)**

## **Lab 2. Dataset wrangling**

Estimated time needed: **30** minutes

## **The tasks**

* To find empty cells and handle missing values;
* To analyze data format, find the wrong format and correct data format;
* To resample data;
* To standardize and normalize data series.

## **Objectives**

### After completing this lab you will be able to:

*   handle missing values;
*   correct data format;
*   resample data; 
*   standardize and normalize data.


## **Table of Contents**

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import data</li>
    <ul>
        <li>Reading the dataset from the URL</li>
    </ul>
    <li>Handle incorrect values</li>
    <ul>
        <li>Evaluating for Missing Data</li>
        <li>Count missing values in each column</li>
        <li>Deal with missing data</li>
        <li>Replacing with Pandas interpolation</li>
        <li>Correct data format</li>
        <li>Convert data types to proper format</li>
    </ul>
    <li>Data standardization</li>
    <li>Data normalization (centering/scaling)</li>
    <li>Binning and indicator variable</li>
    <ul>
        <li>Example of Binning Data In Pandas</li>
        <li>Bins Visualization</li>
        <li>Indicator Variable (or Dummy Variable)</li>
    </ul>
    <li>Resampling</li>
</ol>
</div>

<hr>


## **Dataset Description**

### **Files**
* #### **MATICBUSD_trades_1m_preprocessed.csv** - the file contains exchange rates of **MATIC/BUSD** and ATR, OBV, RSI, AD indicators for the period from 11/11/2022 to 12/29/2022 with an aggregation time of 1 minute. **MATIC/BUSD** - the exchange rate of **MATIC** cryptocurrency to **BUSD** cryptocurrency. 

<!-- * #### **MATICBUSD_trades_1m.csv** - **MATIC/BUSD** - the exchange rate of **MATIC** cryptocurrency to **BUSD** cryptocurrency. The file contains exchange rates of **MATIC/BUSD** for the period from 11/11/2022 to 12/29/2022 with an aggregation time of 1 minute -->

### **Columns**

* #### `Ts` - the timestamp of the record
* #### `Open` -  the price of the asset at the beginning of the trading period
* #### `High` -  the highest price of the asset during the trading period
* #### `Low` - the lowest price of the asset during the trading period.
* #### `Close` - the price of the asset at the end of the trading period
* #### `Volume` - the total number of shares or contracts of a particular asset that are traded during a given period
* #### `Rec_count` -  the number of individual trades or transactions that have been executed during a given time period
* #### `Avg_price` - the average price at which a particular asset has been bought or sold during a given period
* #### `ATR` - average true range indicator
* #### `OBV` - on-balance volume indicator
* #### `RSI` - relative strength index indicator
* #### `AD` - accumulation / distribution indicator


# **What is the purpose of data wrangling?**


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


# **1. Import data**

You can find the "MATIC/BUSD Dataset" from the following link: <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0IXGEN/MATICBUSD_trades_1m_preprocessed.csv">https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0IXGEN/MATICBUSD_trades_1m_preprocessed.csv</a>. 
We will be using this dataset throughout this course.


Run the following cell to install required libraries:


In [ ]:
# If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
# install specific version of libraries used in lab
# ! conda update -n base -c defaults conda -y
# ! conda install pandas -y
# ! conda install numpy -y
# ! conda install -c anaconda requests -y
! conda install scikit-learn -y
# ! conda install -c conda-forge matplotlib -y

In [ ]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import requests

## **Reading the dataset from the URL**


First, we assign the URL of the dataset to `filename`


This dataset was hosted <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0IXGEN/MATICBUSD_trades_1m_preprocessed.csv">HERE</a>


In [ ]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0IXGEN/MATICBUSD_trades_1m_preprocessed.csv"

Use the Pandas method `read_csv()` to load the data from the web address. 


In [ ]:
df = pd.read_csv(filename)

Use the method `head()` to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head()

As we can see, several `NaN` appeared in the dataframe; those are incorrect data which may hinders our further analysis.


# **2. Handle incorrect values**


## **Evaluating for Missing Data**

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

* #### `isnull()`
* #### `notnull()`

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df.isnull()
missing_data.head()

`True` means the value is a missing value while `False` means the value is not a missing value.


## **Count missing values in each column**

Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, `True` represents a missing value and `False` means the value is present in the dataset.  In the body of the for loop the method `.value_counts()` counts the number of `True` values. 



In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts(), "\n")

Based on the summary above, each column has 66861 rows of data and two of the columns containing missing data:

1. **"ATR"**: 15 missing data
2. **"RSI"**: 15 missing data


## **Deal with missing data**


When replacing missing values, there are 2 approaches:

1. If the missing values follow each other at the beginning or at the end of the dataset - drop them
2. If in the middle of the dataset - use interpolation

**What is interpolation?**

Interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points


Drop `NaN`: Since our case coincides with approach number 1, we drop `NaN`. And also we need to reset index


In [ ]:
df = df.dropna()
df = df.reset_index(drop=True)

Now we can generate `NaN` in the middle of the dataset to try how good out data will be interpolated


Let's define `spoil_df` function which produces incorrect data


In [ ]:
def spoil_df(df: pd.DataFrame, cols: list = ["ATR", "RSI"]):
    """
    Replaces the column element with nan with a constant probability of 0.1
    
    Parameters
    ----------
    df: pd.DataFrame
        The data frame to apply function on
    cols: list
        Columns to be updated
    Returns
    -------
    new_df: pd.DataFrame
        The updated data frame with NaN's
    """
    rng = np.random.default_rng(seed=42)
    new_df = df.copy()
    
    for col in cols:
        m = rng.random(len(df))
        l1 = 0.1
        mask1 = m < l1 # NaN
        new_df.loc[mask1, col] = np.NaN
        
    return new_df

In [ ]:
spoiled_df = spoil_df(df, cols=["ATR", "RSI", "OBV"])

Let's see our `spoiled_df`


In [ ]:
spoiled_df.head()

Now we can replace out data with interpolation


**Replace by Pandas interpolation:**

* **"ATR"**: 6584 missing data, replace with interpolation
* **"OBV"**: 6660 missing data, replace with interpolation
* **"RSI"**: 6656 missing data, replace with interpolation


## **Replacing with Pandas interpolation**


Let's try different models for interpolation on **"ATR"** column and take the best one. We use MSE (Mean squared error) and MAPE (Mean absolute percentage error) to measure performance


In [ ]:
# Setting precision
pd.set_option("display.precision", 10)

# Methods without order
methods = ["linear", "nearest", "slinear", "quadratic", "cubic", "piecewise_polynomial", "pchip", "akima", "cubicspline"]
# Methods with order
order_methods = ["spline", "polynomial"]
performance = pd.DataFrame({"name": [], "MSE": [], "MAPE": []})

for method in methods:
    nan_rows = spoiled_df["ATR"].isna()
    interpolated_ap = spoiled_df["ATR"].interpolate(method=method)
    # Calculating MSE and MAPE
    mse = mean_squared_error(df.loc[nan_rows, "ATR"], interpolated_ap[nan_rows])
    mape = mean_absolute_percentage_error(df.loc[nan_rows, "ATR"], interpolated_ap[nan_rows])
    # Adding results to dataframe
    performance.loc[len(performance.index)] = [method, mse, mape]
    
for method in order_methods:
    for order in [3, 5]:
        nan_rows = spoiled_df["ATR"].isna()
        interpolated_ap = spoiled_df["ATR"].interpolate(method=method, order=order)
        # Calculating MSE and MAPE
        mse = mean_squared_error(df.loc[nan_rows, "ATR"], interpolated_ap[nan_rows])
        mape = mean_absolute_percentage_error(df.loc[nan_rows, "ATR"], interpolated_ap[nan_rows])
        # Adding results to dataframe
        performance.loc[len(performance.index)] = [f"{method}_{order}", mse, mape]
        
performance = performance.sort_values(by=["MAPE", "MSE"], ascending=True)
# Forming MAPE as percentage (0-100%)
performance["MAPE"] = performance["MAPE"] * 100
performance["MAPE"] = performance["MAPE"].astype("str")
performance["MAPE"] = performance["MAPE"].str.slice(stop=8) + "%"

performance.head(15)

How we can see the best model for interpolation is `pchip` so we'll use it to replace `NaN` values


In [ ]:
cols_to_replace = ["ATR", "RSI"]
for col in cols_to_replace:
    spoiled_df[col] = spoiled_df[col].interpolate(method="pchip")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #1:**

**Based on the example above, replace** `NaN` **in "OBV" column with interpolated values**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
spoiled_df["OBV"] = spoiled_df["OBV"].interpolate(method="pchip")
```

</details>


In [ ]:
spoiled_df.isna().sum()

**Good!** Now, we have a dataset with no missing values.


## **Correct data format**

**We are almost there!**

The last step in data cleaning is checking and making sure that all data is in the correct format (`int`, `float`, `text` or other).

In Pandas, we use:

* #### `.dtype()` to check the data type
* #### `.astype()` to change the data type


Let's list the data types for each column


In [ ]:
spoiled_df.dtypes

As we can see above, some columns are not of the correct data type. Numerical variables should have type `float` or `int`, and variables with timestamps have type `datetime`. For example, **"Open"**, **"High"**, **"Low"**, **"Close"**, **"Avg_price"** variables are numerical values that describe the price, so we should expect them to be of the type `float`. **"Rec_count"** should have type `int` because it describes quantity. "Ts" column should have type `datetime`; however, that column is shown as type `object` We have to convert data types into a proper format for each column using the `.astype()` method.


## **Convert data types to proper format**


In [ ]:
spoiled_df["Ts"] = spoiled_df["Ts"].astype("datetime64[ns]")

Let us list the columns after the conversion


In [ ]:
spoiled_df.dtypes

**Wonderful!**

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.


# **3. Data Standardization**

Data is usually collected from different agencies in different formats.
(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)

**What is standardization?**

Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.

**The example**

Transform BUSD TO EUR:

In our dataset, **"Open"**, **"High"**, **"Low"**, **"Close"**, **"Volume"**, **"Avg_price"** are represented by BUSD (Binance USD) unit. But what if we want to represent by other currency?
We will need to apply **data transformation** to transform BUSD into EUR.


We can do many mathematical operations directly in Pandas.


In [ ]:
spoiled_df.head()

We define function which will use Binance API to convert values


In [ ]:
def convert(series: pd.Series, from_curr: str, to_curr: str) -> pd.Series:
    """
    Сonverts `from_curr` into `to_curr`
    
    Parameters
    ----------
    series: pd.Series
        The column to be converted
    from_curr: str
        The name of the currency from which we will convert
    to_curr: str
        The name of the currency into which we will convert
    
    Returns
    -------
    series: pd.Series
        The converted series
    """
    from_curr, to_curr = from_curr.lower(), to_curr.lower()
    res = requests.get(f"https://api.coingecko.com/api/v3/simple/price?ids={from_curr}&vs_currencies={to_curr}")
    res = res.json()
    rate = float(res[from_curr][to_curr])
    
    print(f"The exchange rate is 1 {from_curr} = {rate} {to_curr}")
    series = series * rate
    return series

Let's convert BUSD TO EUR


In [ ]:
# Convert BUSD to EUR by mathematical operation
spoiled_df["Avg_price_EUR"] = convert(spoiled_df["Avg_price"], "BUSD", "EUR")

# check your transformed data 
spoiled_df[["Avg_price", "Avg_price_EUR"]].head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #2:**

**According to the example above, transform "Open" (price in BUSD) to EUR (using** `convert` **function) and name the column "Open_EUR"**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# transform BUSD to EUR by mathematical operation
spoiled_df["Open_EUR"] = convert(spoiled_df["Open"], "BUSD", "EUR")

# check your transformed data 
spoiled_df[["Open", "Open_EUR"]].head()

</details>


# **4. Data Normalization**

**Why normalization?**

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

**The example**

To demonstrate normalization, let's say we want to scale the columns **"Open"**, **"Close"** and **"Avg_price"**.

**Target:** would like to normalize those variables so their value ranges from 0 to 1

**Approach:** replace original value by (original value)/(maximum value), `sklearn.preprocessing.MinMaxScaler`


In [ ]:
# replace (original value) by (original value)/(maximum value)
spoiled_df["Open_norm"] = spoiled_df["Open"] / spoiled_df["Open"].max()

# replace (original value) by MinMaxScaler
scaler = MinMaxScaler()
spoiled_df["Close_norm"] = scaler.fit_transform(spoiled_df["Close"].to_numpy().reshape(-1, 1))

In [ ]:
spoiled_df[["Open_norm", "Close_norm"]].head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question #3:**

**According to the example above, normalize the column "Avg_price" using** `MinMaxScaler`

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
scaler = MinMaxScaler()
spoiled_df["Avg_price_norm"] = scaler.fit_transform(spoiled_df["Avg_price"].to_numpy().reshape(-1, 1))


# show the scaled column
spoiled_df[["Avg_price_norm"]].head()


```

</details>


Here we can see we've normalized **"Open"**, **"Close"** and **"Avg_price"** in the range of \[0,1].


# **5. Binning and indicator variable**

**Why binning?**

Binning is a process of transforming continuous numerical variables into discrete categorical "bins" for grouped analysis.

**The example:**

In our dataset, **"Volume"** is a real valued variable. What if we want to break it down into 3 bins and see which bin sold the most? Can we rearrange them into three "bins" to simplify analysis?

We will use the Pandas method `cut` to segment the **"Volume"** column into 3 bins.


## **Example of Binning Data In Pandas**


Let's plot the histogram of **"Volume"** to see what the distribution of **"Volume"** looks like.


In [ ]:
plt.hist(spoiled_df["Volume"])

# set x/y labels and plot title
plt.xlabel("Volume")
plt.ylabel("Count")
plt.title("Volume bins")

In [ ]:
bins = [min(spoiled_df["Volume"]), 10000, 30000, max(spoiled_df["Volume"])]
bins

In [ ]:
spoiled_df["Volume"].describe()

We set group  names:


In [ ]:
group_names = ["Low", "Medium", "High"]

We apply the function <b>cut</b> to determine what each value of `spoiled_df["Volume"]` belongs to.


In [ ]:
spoiled_df["Volume-binned"] = pd.cut(spoiled_df["Volume"], bins, labels=group_names, include_lowest=True)
spoiled_df[["Volume", "Volume-binned"]].head()

Let's see the "Volume" in each bin:


In [ ]:
spoiled_df["Volume-binned"].value_counts()

Let's plot the distribution of each bin:


In [ ]:
plt.bar(list(spoiled_df["Volume-binned"].value_counts().index), spoiled_df["Volume-binned"].value_counts())

# set x/y labels and plot title
plt.xlabel("Volume")
plt.ylabel("Count")
plt.title("Volume bins")

As we can see we managed to create 3 classes based on **"Volume"**


## **Bins Visualization**

Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "volume-binned" with bins = 3
plt.hist(spoiled_df["Volume-binned"], bins=3, edgecolor='white')

# set x/y labels and plot title
plt.xlabel("Volume")
plt.ylabel("Count")
plt.title("Volume-binned bins")

The plot above shows the binning result for the attribute **"Volume"**.


## **Indicator Variable (or Dummy Variable)**

**What is an indicator variable?**

An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called "dummies" because the numbers themselves don't have inherent meaning. 

**Why we use indicator variables?**

We use indicator variables so we can use categorical variables for regression analysis in the later modules.

**Example**

We see the column **"Volume-binned"** has three unique values: "Low", "Medium" or "High". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert **"Volume-binned"** to indicator variables.

We will use pandas' method `get_dummies` to assign numerical values to different categories of **"Volume"**. 


Get the indicator variables and assign it to data frame `dummy_variable_1`:


In [ ]:
dummy_variable_1 = pd.get_dummies(spoiled_df["Volume-binned"], prefix="Volume")
dummy_variable_1.head()

In [ ]:
# merge data frame "spoiled_df" and "dummy_variable_1" 
spoiled_df = pd.concat([spoiled_df, dummy_variable_1], axis=1)

# drop original column "volume-binned" from "spoiled_df"
spoiled_df.drop("Volume-binned", axis = 1, inplace=True)

In [ ]:
spoiled_df.head()

The last three columns are now the indicator variable representation of the volume variable. They're all 0-s and 1-s now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4:**

**Similar to before, create an indicator variable for the column "Ts" (create `spoiled_df["Ts_day_name"] = spoiled_df["Ts"].dt.day_name()` as category and then `get_dummies` from that column and then concatenate the `spoiled_df`) and merge the dataframes**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Create a new column
spoiled_df["Ts_day_name"] = spoiled_df["Ts"].dt.day_name()
# get indicator variables of day_name and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(spoiled_df["Ts_day_name"], prefix="Ts")
# Concatenate df's
spoiled_df = pd.concat([spoiled_df, dummy_variable_2], axis=1)
spoiled_df.head()

```

</details>


# **6. Resampling**


**What is resampling?**

Data resampling is any process whereby data is gathered and expressed in a summary form


Let's set **"Ts"** as index to be able to resample our dataframe


In [ ]:
spoiled_df.index = spoiled_df["Ts"]

After resampling we can use different aggregation functions such as:

<ul>
    <li><code>mean()</code></li>
    <li><code>sum()</code></li>
    <li><code>prod()</code></li>
    <li><code>first()</code></li>
    <li><code>last()</code></li>
    <li><code>min()</code></li>
    <li><code>max()</code></li>
</ul>


In [ ]:
resampled_df1 = spoiled_df[["Open", "High", "Low", "Close", "Volume"]].resample("15min").agg({
    "Open": "first",
    "High": "max",
    "Low": "min",
    "Close": "last",
    "Volume": "sum",
})
resampled_df1.head()

How we can see we resampled data with aggregation time 15 min so we reduced amount of the data and generalized it. And we can make assumptions on much wider time window (15 min instead of 1 min)


 <div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #5:**

**Apply resampling to `spoiled_df` with aggregation time 1 hour to columns "High", "Low", "Close". Then apply `max()`, `min()`, `last()` functions. Assign that to `resampled_df2` variable and show**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
resampled_df2 = spoiled_df[["High", "Low", "Close"]].resample("1h").agg({
    "High": "max",
    "Low": "min",
    "Close": "last"
})
resampled_df2.head()
```

</details>


Save the new csv:


In [ ]:
spoiled_df.to_csv("MATIC BUSD-lab2.csv", index=False)

# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/borys_melnychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0IXGEN2343-2023-01-01" >Borys Melnychuk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>



## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-03-04    |   1.0   | Borys Melnychuk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>
