<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="400" alt="cognitiveclass.ai logo">
</center>

# **Investigation relationships between exchange rate BTC/BUSD and ADOSC, NATR, TRANGE indicators**


## Lab 2. Data Wrangling


Estimated time needed: **30** minutes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
Для Марії
### The tasks:
*   

</div>

### Objectives

After completing this lab you will be able to:

*   Handle missing values
*   Correct data format
*   Standardize and normalize data
*   Resample data

<h3>Table of Contents</h3>
<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import and Load Data</li>
    <li>Generating missing values</li>
    <li>Identify and handle missing values
        <ul>
            <li>Identify missing values</li>
            <li>Deal with missing values</li>
            <li>Correct data format</li>
        </ul>
    </li>
    <li>Data standardization</li>
    <li>Data normalization (centering/scaling)</li>
    <li>Binning</li>
    <li>Indicator variable</li>
    <li>Resampling</li>
</ol>

</div>


## Dataset Description

### Context
Dataset contains historical changes of the ***BTC/BUSD*** and ***ADOSC, NATR, TRANGE indicators*** for the period from *11/11/2022 to 11/24/2022* with an *1-minute* aggregation time.

### Columns

#### Input columns
* ***Time*** - the timestamp of the record
* ***Open*** -  the price of the asset at the beginning of the trading period
* ***High*** -  the highest price of the asset during the trading period
* ***Low*** - the lowest price of the asset during the trading period.
* ***Close*** - the price of the asset at the end of the trading period
* ***Volume*** - the total number of shares or contracts of a particular asset that are traded during a given period
* ***Count*** -  the number of individual trades or transactions that have been executed during a given time period
* ***ADOSC*** - Chaikin oscillator indicator
* ***NATR*** - normalized average true range (ATR) indicator
* ***TRANGE*** - true range indicator

#### Target column
* ***Price*** - the average price at which a particular asset has been bought or sold during a given period


## 1. Import and Load Data

### Import data
<p>
You can find the dataset from the following <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08LZES/BTCBUSD_trades_1m.csv">link</a>. We will be using this dataset throughout this course.
</p>


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#! mamba install scikit-learn

Let's import the modules we will use:


In [ ]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import requests
import json
import pandas as pd
import numpy as np
import scipy as sc
import random
import string
import sklearn.metrics
import matplotlib.pylab as plt
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot

### Read Dataset

First, we assign the URL of the dataset to <code>"path"</code>.

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/corsera_da0101en_notebook_bottom?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08LZES/BTCBUSD_trades_1m.csv"

Use the Pandas method <code>read_csv()</code> to load the data from the web address. Set the parameter  <code>index_col=0</code> in order to use the first column of cvs file as the index of the dataframe.


In [ ]:
df = pd.read_csv(path, index_col=0)

In finance you sometimes need to use different numbers of decimal places. For ease of reading, let's specify the value of the precision parameter equal to 4 to display three decimal signs (instead of 6 as default).


In [ ]:
pd.set_option("display.precision", 4)

Set dataframe index column type to <strong>datetime</strong> using <code>pd.to_datetime()</code> method for correct time series analysis. 


In [ ]:
df.index = pd.to_datetime(df.index)

Use the method <code>head()</code> to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head()

### Drop NaN


In the previous lab we calculated technical financial indicators. Since the values of previous periods had to be taken into account for their calculation, the first few lines of the dataframe contain `NaN` values.

We will use different methods for recovering missing data in this module that do not work correctly with recovering data in the first rows of time series. Therefore, we need to remove `NaN` values with `df.dropna(inplace=True)` method.


In [ ]:
df.dropna(inplace=True)
df.head()

Great! Now we're ready to get started and get familiar with data wrangling.


### What is the purpose of data wrangling?

<strong><em>Data wrangling</em></strong> is the process of converting data from the initial format to a format that may be better for analysis.


## 2. Generating missing values


One of the important steps of data wrangling is identifying gaps or empty cells in data and either filling or removing them. 
<p>
Let's find out if our dataset has fields with missing values. Use method <code>is.null()</code> to detect missing values. The output is a boolean value indicating whether the value is in fact missing data. <strong>False</strong> means the cell is not empty, <strong>True</strong> indicates missing value.
</p>


In [ ]:
missing_data = df.isnull()
missing_data.head()

After that use <code>values_counts()</code> to return a series containing counts of unique rows in the dataframe.


In [ ]:
missing_data.value_counts()

As we can see each column of our dataframe has only boolean **False** value that indicates no missing values are present in our dataset. 

<p>
To gain a better experience of each steps of data wrangling process let's generate missing values in our dataset. Then we will try to restore them and compare with a real data.
</p>


### Generating incorrect data


Missing data is not the only incorrect data that may occur in datasets. Let us generate the following incorrect data for <em>OHLCV parameters (Open, High, Low, Close, Volume)</em> within our dataset:
<li> <em>missing values (NaN)</em> </li>
<li> <em>negative values</em> </li>
<li> <em>strings</em> </li>
<br>
<p>
We should declare function <code>generate_incorrect_data()</code> responsible of this task.
</p>
<p>
In order to generate random values, the <code>rand()</code> method of random module and numpy <code>where()</code> function will be used.
</p>    
<p>
Here we use the function: 
 <pre>numpy.where(condition, x, y)</pre>

to return elements chosen from x or y depending on condition.
</p>
<p>
    In our instance specify a condition <code>np.random.rand(len(df))</code> greater than 0.07 to generate part of the incorrect dataset data. If condition equals True, then yield x, otherwise yield incorrect data such as NaN, strings, and negative values.
</p>
<p>
    To generate negative values we use <code>np.random.uniform(low, high)</code> that draws samples from a uniform distribution over <code>[low, high)</code> interval.</code>
</p>
<p>
    To generate strings we use <code>random.choices(array, k)</code> to create a random sample from a given array. In our case we specify array as ASCII letters <em>string.ascii_letters</em> and create sequences of letters with a length of 7 characters <code>k=7</code>. As the <code>choices()</code> method returns a list with the randomly selected element from the specified array, we need to convert it to <em>string</em> type. To accomplish this we use the <code>string.join()</code> method that takes all items in an iterable and joins them into one string.
</p>


In [ ]:
def generate_incorrect_data(pd: pd.DataFrame, columns):
    """Return modified dataframe with incorrect data.
    """
    for column in columns:
        pd[column] = pd[column].where(lambda x: np.random.rand(len(df)) > 0.07, np.nan)
        pd[column] = pd[column].where(lambda x: np.random.rand(len(df)) > 0.07, np.random.uniform(-1.0, 0.0))
        pd[column] = pd[column].where(lambda x: np.random.rand(len(df)) > 0.07, 
                                      ''.join(random.choices(string.ascii_letters, k=7)))
    return pd

You probably have noticed strange keyword <code>lambda</code> in our function.

#### What is lambda?

<pre><i>Lambda</i> - an anonymous function which we can pass in instantly without defining a name like a full traditional function.</pre>

We can operate with a lambda function in relation to both the columns and rows of the Pandas dataframe. In our instance we apply lambda function to each cell of passed to function columns.


We should keep our initial dataframe unchangable to use in further steps of current lab. Thus, the following dataframe manipulation will be produced on the copied dataframe. To make a copy of this object’s indices and data use <code>copy()</code> function:


In [ ]:
df_missing = df.copy()

columns = ['Open', 'High', 'Low', 'Close']
df_missing = generate_incorrect_data(df_missing, columns)
df_missing

As we can see, incorrect data appeared in the dataframe; those may hinder our further analysis.

So, how do we identify all those missing and incorrect values and deal with them?

<strong>How to work with missing data?</strong>

Steps for working with missing data:
<ol>
    <li><em>Identify missing data</em></li>
    <li><em>Deal with missing data</em></li>
    <li><em>Correct data format</em></li>
</ol>


## 3. Identify and handle missing values

### 3.1 Identify missing values

#### Convert incorrect data to NaN

We replace incorrect data with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>

to replace A by B.
<br>
<p>Try to replace string cells in dataframe using this function. Since when generating incorrect data we randomized strings we should use <strong>regular expressions</strong> (<em>aka Regex</em>) to determine all dataframe values that correspond to a sequence of uppercase and lowercase letters.
<pre><i>RegEx</i>, or regular expression - a sequence of characters that forms a search pattern.</pre>
</p>
<p>
To accomplish this we need to specify the parameter <code>regex=True</code> in <code>replace()</code> method to use regular expressions and pass the regex itself. To determine the strings of letters use regex <code>r'^[A-Za-z]+$'</code>. 
</p>


In [ ]:
# replace strings with NaN
df_missing.replace(r'^[A-Za-z]+$', np.nan, inplace=True, regex=True)
df_missing

The next step we need to replace the negative values ​​in the dataframe with <code>NaN</code>. To accomplish this we use numpy function <code>where()</code> specifying a condition <code>x > 0</code>. If condition equals True, then yield <code>x</code>, otherwise yield <code>NaN</code>.


In [ ]:
# replace negative values with NaN
for column in columns:
    df_missing[column] = df_missing[column].where(lambda x: x > 0, np.nan)
df_missing

#### Evaluating for Missing Data

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><code>.isnull()</code></li>
    <li><code>.notnull()<code></li>
</ol>

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df_missing.isnull()
missing_data.head()

<strong>True</strong> means the value is a missing value, whereas <strong>False</strong> means the value is not a missing value.


#### Count missing values in each column
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. 
</p>

As mentioned above, <strong>True</strong> represents a missing value and <strong>False</strong> means the value is present in the dataset.  In the body of the for loop the method <code>.value_counts()</code> counts the number of <strong>True</strong> values. 



In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print('')

Based on the summary above, three of the columns containing missing data.


### 3.2 Deal with missing data

#### How to deal with missing data?
<ol>
    <li><strong>Drop data</strong><br>
        a. Drop <em>the whole row</em><br>
        b. Drop <em>the whole column</em>
    </li>
    <li><strong>Replace data</strong><br>
        a. Replace it <em>by mean</em><br>
        b. Replace it <em>by frequency</em><br>
        c. Replace it <em>based on other functions</em>
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data. However, some methods may seem more reasonable than others. 

We will apply different methods such as replacing by <strong>interpolation</strong> with diverse techniques. Then we will compare each of replacing ways by calculating precision between restored dataframe and the real one. 

#### What is an interpolation?

In the mathematical field of numerical analysis, <strong><em>interpolation</em></strong> is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points.


For further comparison of the obtained dataframes with the initial one, we will create a separate dataframe to record the data recovery method and calculated accuracy.


In [ ]:
df_precision = pd.DataFrame({"method":[], "MSE": [], "MAPE": []})

#### Estimation metrics


For calculating difference between restored and initial dataframes we will use $Mean\ Squared\ Error$ and $Mean\ Absolute\ Percentage$ $Error\$. 

#### What is Mean Squared Error (MSE)?

The $Mean\ Squared\ Error$ is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero.

$$
MSE = \frac{1}{n} \sum \limits _{i=1} ^{n} (A_i - F_i)^{2},
$$ 
<center>where $A_i$ — the actual value, $F_i$ — the forecast value.</center>

Read more about it <a href="https://en.wikipedia.org/wiki/Mean_squared_error?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08LZES2357-2023-01-01">here</a>.

<p>
To calculate it we will use <code>mean_squared_error(y_true, y_pred)</code> from <strong><em>skicit-learn library</em></strong>, where <code>y_true</code> represents <em>ground truth (correct) target values</em> and <code>y_pred</code> corresponds to <em>estimated target values</em>.
</p>

#### What is Mean Absolute Percentage Error (MAPE)?

A statistic known as $Mean\ Absolute\ Percentage\ Error$ is used to assess how accurate a forecasting technique is. It represents the average of the absolute percentage errors of each entry in a dataset to calculate how accurate the forecasted quantities were in comparison with the actual quantities. 

$$
MAPE = \frac{100}{n} \sum \limits _{i=1} ^{n} \left\lvert \frac{A_i - F_i}{A_i} \right\rvert,
$$ 
<center>where $A_i$ — the actual value, $F_i$ — the forecast value.<center>

<br>

The absolute value of this ratio is summed for every forecast point in time and divided by number of fitted pooints $n$. Read more about it <a href="https://en.wikipedia.org/wiki/Mean_absolute_percentage_error?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08LZES2357-2023-01-01">here</a>.
<br>

To calculate MAPE we can use <code>mean_absolute_percentage_error(y_true, y_pred)</code> from <strong><em>skicit-learn library</em></strong> where <code>y_true</code> represents ground truth (correct) target values and <code>y_pred</code> corresponds to estimated target values.

However, let's use scikit-learn <code>mean_squared_error()</code> function for calculating $MSE$ and create a custom function to calculate $MAPE$.  


In [ ]:
def mape(y_true, y_pred):
    """Return Mean Absolute Percentage Error (MAPE).
    """
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

Excellent! Let's move to replacing missing data.


#### Replace "NaN" with the interpolation techniques


Let us use <code>interpolate(method, inplace=True)</code> to fill <code>NaN</code> values using an interpolation, where parameter <code>method</code> specifies interpolation technique to use, and <code>inplace=True</code> - modify and update the current dataframe.


Let's try to remove missing values with <em>linear interpolation technique</em>. 


In [ ]:
# fill NaN values using an interpolation method 
df_pred = df_missing.astype('float').interpolate(method='linear')
df_pred

As we can see, missing data disappeared.

Now when we've tested how interpolation works, let's test its other methods and compare their accuracy.


<p>
Note that even after using interpolation the first and last entries in the columns may remain <code>NaN</code> (if incorrect data is generated in such positions), because there is no data before it to use for interpolation. To fix this issue we use <code>fillna(method, inplace)</code> with <code>'ffill'</code> and <code>'bfill'</code> methods to use previous and next valid observation to fill gaps.
</p>
<p>
Along with that, we need to keep updating our dataframe <code>df_precision</code> responsible of storing Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE) values of each data restoring method used.
</p>


Pay attention that both <em>polynomial</em> and <em>spline</em> techniques require specifying an <code>order</code> parameter, e.g. <code>df.interpolate(method='polynomial', order=5)</code>.

Considering the amount of data and interpolation methods, the following code may take some time to execute.


In [ ]:
methods = ["linear", "nearest", "quadratic", "cubic"]
order_methods = ["spline", "polynomial"]

for method in methods:
    # fill NaN values using an interpolation method 
    df_pred = df_missing.astype('float').interpolate(method=method)
    df_pred.fillna(method="ffill", inplace=True)
    df_pred.fillna(method='bfill', inplace=True)
    
    # calculate MSE and MAPE
    df_precision.loc[len(df_precision.index)] = [method, mean_squared_error(df[columns], df_pred[columns]), mape(df[columns], df_pred[columns])]

for method in order_methods:
    # fill NaN values using an interpolation method
    df_pred = df_missing.astype('float').interpolate(method=method, order=2)
    df_pred.fillna(method="ffill", inplace=True)
    df_pred.fillna(method='bfill', inplace=True)
    
    # calculate MSE and MAPE
    df_precision.loc[len(df_precision.index)] = [method, mean_squared_error(df[columns], df_pred[columns]), mape(df[columns], df_pred[columns])]

df_precision.set_index('method', inplace=True)
df_precision

As we can see, <em>the linear interpolation method</em> did the best job in restoring data in our dataframe. The MSE and MAPE value it produced are the smallest among the others.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #1: </h1>

<b>Based on the example above, use before declared function to generate incorrect data in "Close" column, replace it with <code>NaN</code> values. Also fill missing data by using <em>linear interpolation method</em>.</b>

</div>


<strong><em>Note:</em></strong> to keep our initial dataframe unchangable make a copy of it using <code>df.copy()</code> method.


In [ ]:
# Write your code below and press Shift+Enter to execute

# copy initial dataframe
df_copied = df.copy()

# generate incorrect data in the "Close" column
df_copied = generate_incorrect_data(df_copied, ["Close"])

# replace strings in "Close" column with NaN value
df_copied.replace(r'^[A-Za-z]+$', np.nan, inplace=True, regex=True)

# replace negative values in "Close" column with NaN value
df_copied["Close"] = df["Close"].where(lambda x: x > 0, np.nan)

# replace NaN by linear interpolation
df_copied["Close"].interpolate(method="linear", inplace=True)

# check changes in "Close" column
df_copied[["Close"]].head()

<details><summary>Click here for the solution</summary>

```python
# copy initial dataframe
df_copied = df.copy()

# generate incorrect data in the "Close" column
df_copied = generate_incorrect_data(df_copied, ["Close"])

# replace strings in "Close" column with NaN value
df_copied.replace(r'^[A-Za-z]+$', np.nan, inplace=True, regex=True)

# replace negative values in "Close" column with NaN value
df_copied["Close"] = df["Close"].where(lambda x: x > 0, np.nan)

# replace NaN by linear interpolation
df_copied["Close"].interpolate(method="linear", inplace=True)

# check changes in "Close" column
df_copied[["Close"]].head()
```
    
</details>


<b>Good!</b> Now, we have a dataset with no missing values.


### 3.3 Correct data format

<p>The last step in data cleaning is checking and making sure that all data is in the correct format (<strong>int, float, text</strong> or other).</p>

In Pandas, we use:
<ul>
    <li><code>.dtype()</code> to check the data type</li>
    <li><code>.astype()</code> to change the data type</li>
</ul>


Let's list the data types for each column:


In [ ]:
df.dtypes

<p>As we can see above, all columns are of the correct data type. Numerical variables should have type <code>'float'</code> or <code>'int'</code>, and variables with strings such as categories should have type <code>'object'</code>. 
<br>
For example, 'Open' and 'Count' variables are numerical values, so we should expect them to be of the type <code>'float'</code> or <code>'int'</code>.
</p>

If we have to convert data types into a proper format for each column, we use the <code>astype()</code> method:


<b>Wonderful!</b>

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.


## 4. Data Standardization
<p>
Data is usually collected from different agencies in different formats.
Data standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation.
</p>

#### What is data standardization?

<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
</p>

<p><strong>Example:</strong> <em>transform BUSD to USDT</em>
<p>In our dataset, the columns "Open", "High", "Low", "Close", "Volume" and "Price" are represented by BUSD (Binance USD) unit. However, in most cases USDT is commonly used. We will need to apply <strong>data transformation</strong> to transform BUSD into USDT.</p>


<p>Let us start by solving the issue with obtaining current exchange rate. We will use <code>pyfetch()</code>method to make HTTP requests to official Binance API and fetch exchange rate data from it. </p>


In [ ]:
# get updated USDT rate
response = requests.get('https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=BUSD&toAsset=USDT')
response = json.loads(response.text)

Returned value is in <strong>JSON format<strong>.

<pre><em><strong>JavaScript Object Notation (JSON)</strong></em> is a standard text-based format for representing structured data based on JavaScript object syntax. It is frequently employed for data transmission in online applications (e.g., sending some data from the server to the client, so it can be displayed on a web page, or vice versa).</pre>

Then we should check the HTTP status response. <strong>200 (OK success)</strong> code indicates that the request has succeeded. For obtaining current rate we need to access <em>"toAssetMinAmount"</em> field in our response.


In [ ]:
# if the API is unavailable we set fixed rate
try:
    if response.status != 200:
        rate = 0.999707
    else:
        rate = float(response[0]["toAssetMinAmount"])
except:
     rate = 0.999707

print(f"The exchange rate is 1 BUSD = {rate} USDT")

<p>
The next step we calculate a new value for needed currency of each following columns: "Open", "High", "Low", "Close", "Volume", "Price".
</p>


In [ ]:
columns_to_convert = ['Open', 'High', 'Low', 'Close', 'Volume', 'Price']
for column in columns_to_convert:
    df[f'{column}_USDT'] = df[column] * rate

<p>Do not forget to drop unnecessary columns in BUSD currency.
</p>


In [ ]:
# drop unnecessary columns
df.drop(columns_to_convert, axis=1, inplace=True)

Finally, we check our transformed data.


In [ ]:
# check your transformed data
df[['Open_USDT', 'High_USDT', 'Low_USDT', 'Close_USDT', 'Volume_USDT', 'Price_USDT']].head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #2: </h1>

<b>According to the example above, transform "Price_USDT" column vales from USDT currency to EUR and change the name of column to "Price_EUR".</b>
</div>


<strong><em>Note:</em></strong>. to receive the current data on the euro exchange rate, use the following URL: https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR.


In [ ]:
# Write your code below and press Shift+Enter to execute 

# get current EUR rate
response = requests.get('https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR')
response = json.loads(response.text)

# if the API is unavailable we set fixed rate
try:
    if response.status != 200:
        rate = 0.92
    else:
        rate = float(response[0]['toAssetMinAmount'])
except:
    rate = 0.92
    
print(f'The exchange rate is 1 USDT = {rate} EUR')

# change rate in the column
df['Price_USDT'] = df['Price_USDT'] * rate

# rename column name from "Price_USDT" to "Price_EUR"
df.rename(columns={'Price_USDT':'Price_EUR'}, inplace=True)

# check your transformed data
df[['Price_EUR']].head()

<details><summary>Click here for the solution</summary>

```python
# get current EUR rate
response = requests.get('https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR')
response = json.loads(response.text)

# if the API is unavailable we set fixed rate
try:
    if response.status != 200:
        rate = 0.92
    else:
        rate = float(response[0]['toAssetMinAmount'])
except:
    rate = 0.92
    
print(f'The exchange rate is 1 USDT = {rate} EUR')

# change rate in the column
df['Price_USDT'] = df['Price_USDT'] * rate

# rename column name from "Price_USDT" to "Price_EUR"
df.rename(columns={'Price_USDT':'Price_EUR'}, inplace=True)

# check your transformed data
df[['Price_EUR']].head()
```

</details>


## 5. Data Normalization

#### Why normalization?

<p><strong><em>Normalization</em></strong> is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<strong>Example:</strong>

<p>To demonstrate normalization, let's say we want to scale the "Open_USDT" column.</p>
<p><strong>Target:</strong> would like to normalize this variable so its value ranges from 0 to 1.</p>

<p><strong>Approach:</strong> replace original value by:
<li> formula $ \frac{original\ value}{maximum\ value}$ </li>
<li> skicit-learn estimator $ MinMaxScaler $.</li>
</p>
<p>
The $ MinMaxScaler $ transformation is given by:

$$
X_{std} = \frac{ X - X_{min}}{ X_{max} - X_{min}}, \\\\\\
X_{scaled} = X_{std}\times (max - min) + min,
$$
        
<center>where $ min, \ max$ are upper and lower borders in scaling range.</center>
</p>

<p>Let's scale the "Open_USDT" column by first formula.</p>


In [ ]:
# replace original value by (original value)/(maximum value)
df['Open_USDT'] = df['Open_USDT']/df['Open_USDT'].max()
df[['Open_USDT']].head()

Let us do the same for "Close_USDT" column with <code>MinMaxScaler()</code> estimator using its <code>fit_transform()</code> function. 


In [ ]:
# replace (original value) by estimator MinMaxScaler
scaler = MinMaxScaler()
df[['Close_USDT']] = scaler.fit_transform(df[['Close_USDT']])
df[['Close_USDT']].head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #3: </h1>

<b>According to the example above, normalize the columns "High_USDT"  using formula $ \frac{original\ value}{maximum\ value}$ and "Low_USDT" using $MinMaxScaler$ estimator.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute

# normalize by (original value)/(maximum value)
df['High_USDT'] = df['High_USDT']/df['High_USDT'].max() 

#  normalize by MinMaxScaler estimator
df[['Low_USDT']] = scaler.fit_transform(df[['Low_USDT']])

# show the scaled columns
df[['High_USDT', 'Low_USDT']].head()

<details><summary>Click here for the solution</summary>

```python
# normalize by (original value)/(maximum value)
df['High_USDT'] = df['High_USDT']/df['High_USDT'].max() 

#  normalize by MinMaxScaler estimator
df[['Low_USDT']] = scaler.fit_transform(df[['Low_USDT']])

# show the scaled columns
df[['High_USDT', 'Low_USDT']].head()

```

</details>


Here we can see we've normalized "High_USDT" and "Low_USDT" in the range of \[0,1].


## 6. Binning
#### Why binning?
<p>
    <strong><em>Binning</em></strong> is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<strong>Example: </strong>

<p>In our dataset, "Volume" is a real valued variable. What if we only care about the price difference with high price, medium price, and little price (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'Price_EUR' column into 3 bins.</p>

Firstly, let's convert data to correct format:


In [ ]:
df['Price_EUR'] = df['Price_EUR'].astype(float, copy=True)

Let's plot the histogram of volume to see what the distribution of volume looks like.


In [ ]:
# plot the distribution
plt.pyplot.hist(df['Price_EUR'])

# set x/y labels and plot title
plt.pyplot.xlabel('Price_EUR, euro')
plt.pyplot.ylabel('Count')
plt.pyplot.title('Price_EUR bins')

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated)</code> function.</p>
<p>Since we want to include the minimum value of price, we want to set <code>start_value = min(df['Price_EUR'])</code>.</p>
<p>Since we want to include the maximum value of price, we want to set <code>end_value = max(df['Price_EUR'])</code>.</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so <code>numbers_generated = 4</code>.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = np.linspace(min(df['Price_EUR']), max(df['Price_EUR']), 4)
bins

We set group  names:


In [ ]:
group_names = ['Price_Low', 'Price_Medium', 'Price_High']

We apply the function "cut" to determine what each value of `df['Price_EUR']` belongs to.


In [ ]:
df['Price-binned'] = pd.cut(df['Price_EUR'], bins, labels=group_names, include_lowest=True )
df[['Price_EUR', 'Price-binned']].head()

Let's see the number of vehicles in each bin:


In [ ]:
df['Price-binned'].value_counts()

Let's plot the distribution of each bin:


In [ ]:
# plot the distribution
pyplot.bar(group_names, df['Price-binned'].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("Price")
plt.pyplot.ylabel("Count")
plt.pyplot.title("Price bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for <strong>"Price"</strong> based on 3 categories: <strong>"Price_Low"</strong>, <strong>"Price_Medium"</strong> and <strong>"Price_High"</strong>. 
</p>


#### Bins Visualization

Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "Price_EUR" with bins = 3
plt.pyplot.hist(df["Price_EUR"], bins=3)

# set x/y labels and plot title
plt.pyplot.xlabel("Price, euro")
plt.pyplot.ylabel("Count")
plt.pyplot.title("Price bins")

The plot above shows the binning result for the attribute "Price_EUR".


## 7. Indicator Variable (or Dummy Variable)
#### What is an indicator variable?
<p>
    <strong><em>An indicator variable (or dummy variable)</em></strong> is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

#### Why we use indicator variables?

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>

<strong>Example:</strong>
<p>
    We see the column <code>"Price-binned"</code> has three unique values: <code>"Price_Low"</code>, <code>"Price_Medium"</code> or <code>"Price_High"</code>. Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert <code>"Price-binned"</code> to indicator variables.
</p>

<p>
    We will use pandas method <code>get_dummies()</code> to assign numerical values to different categories of avg_price. 
</p>


Get the indicator variables and assign it to data frame <code>dummy_variable\_1</code>:


In [ ]:
dummy_variable_1 = pd.get_dummies(df['Price-binned'])
dummy_variable_1.head()

In the dataframe, column <code>'Price-binned'</code> has values for <code>'Price_Low'</code>, <code>'Price_Medium'</code> and <code>'Price_High'</code> as 0s and 1s now.


In [ ]:
# merge data frame "df" and "dummy_variable_1"
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "Price-binned" from "df"
df.drop('Price-binned', axis=1, inplace=True)
df[['Price_EUR', 'Price_Low', 'Price_Medium', 'Price_High']].head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #4: </h1>

<b>Similar to before, create bins for the column "Open_USDT" with three group names. Plot the distribution of each bin.
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute

# create bins
bins = np.linspace(df['Open_USDT'].min(), df['Open_USDT'].max(), 4)
group_names = ['Open_Low', 'Open_Medium', 'Open_High']

# determine which bin each value of df['Open_USDT'] belongs to
df['Open_binned'] = pd.cut(df['Open_USDT'], bins, labels=group_names, include_lowest=True)

# plot the distribution
pyplot.bar(group_names, df['Open_binned'].value_counts())

# draw historgram of attribute "Open_EUR" with bins = 3
plt.pyplot.hist(df['Open_USDT'])

# set x/y labels and plot title
plt.pyplot.xlabel('Open')
plt.pyplot.ylabel('Count')
plt.pyplot.title('Open bins')

<details><summary>Click here for the solution</summary>

```python
# create bins
bins = np.linspace(df['Open_USDT'].min(), df['Open_USDT'].max(), 4)
group_names = ['Open_Low', 'Open_Medium', 'Open_High']

# determine which bin each value of df['Open_USDT'] belongs to
df['Open_binned'] = pd.cut(df['Open_USDT'], bins, labels=group_names, include_lowest=True)

# plot the distribution
pyplot.bar(group_names, df['Open_binned'].value_counts())

# draw historgram of attribute "Open_EUR" with bins = 3
plt.pyplot.hist(df['Open_USDT'])

# set x/y labels and plot title
plt.pyplot.xlabel('Open')
plt.pyplot.ylabel('Count')
plt.pyplot.title('Open bins')

```

</details>


## 8. Resampling
#### What is a resampling?
<p>
    <strong><em>Resampling</em></strong> is a crucial method for time series analysis that enables you to freely choose the desired level of data resolution. You can either upsample, or increase the number of data points, such as by transforming 5-minute data into 1-minute data. 
</p>
<p>
    The basic syntax for resampling in Pandas is <code>dataframe.resample('desired resolution')</code> method. Along with that, different aggregation function can be used.
</p>
<p>
    Start by downsampling the series from 1 minute into 10-minute bins. Considering the semantics of our dataset, for the column <strong>"Open_USDT"</strong> we take <em>the first value</em> of a 10-minute interval, while for <strong>"Close_USDT"</strong> we have <em>the last value</em>. For <strong>"High_USDT"</strong> <em>maximum value</em> within a 10-minute interval is taken, in accordance for <strong>"Low_USDT"</strong> we take <em>minimum value</em>. Column <strong>"Volume_USDT"</strong> will store <em>all summed-up values</em> within a 10-minute interval.
</p>


Firstly, let's copy the initial dataframe:


In [ ]:
df_resampled = df[['Open_USDT', 'High_USDT', 'Low_USDT', 'Close_USDT', 'Volume_USDT', 'Price_EUR']].copy()

Now we are ready to perform resampling.


In [ ]:
df_resampled.loc[:, 'Open_USDT':'Price_EUR'].resample("10min").agg({
    'Price_EUR': 'mean',
    'Volume_USDT': 'sum',
    'Open_USDT': 'first',
    'High_USDT': 'max',
    'Low_USDT': 'min',
    'Close_USDT': 'last',
}).head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #5 a): </h1>

<b>Similar to before, downsample the <code>df_resampled</code> into 1-day bins.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

df_resampled = df_resampled.loc[:, 'Open_USDT':'Price_EUR'].resample("1D").agg({
    'Price_EUR': 'mean',
    'Volume_USDT': 'sum',
    'Open_USDT': 'first',
    'High_USDT': 'max',
    'Low_USDT': 'min',
    'Close_USDT': 'last',
})

df_resampled.head()

<details><summary>Click here for the solution</summary>

```python
# downsample the series into 1-day bins.
df_resampled = df_resampled.loc[:, 'Open_USDT':'Price_EUR'].resample("1D").agg({
    'Price_EUR': 'mean',
    'Volume_USDT': 'sum',
    'Open_USDT': 'first',
    'High_USDT': 'max',
    'Low_USDT': 'min',
    'Close_USDT': 'last',
})

df_resampled.head()
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #5 b): </h1>

<b>Similar to before, create an indicator variable for the index column "Time" by days of week using <code>df_resampled.index.day_name()</code>.
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute

# get indicator variables of day and assign it to dataframe "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df_resampled.index.day_name())
dummy_variable_2.set_index(df_resampled.index, inplace=True)

# show first 5 instances of data frame "dummy_variable_2"
dummy_variable_2.head()

<details><summary>Click here for the solution</summary>

```python
# get indicator variables of day and assign it to dataframe "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df_resampled.index.day_name())
dummy_variable_2.set_index(df_resampled.index, inplace=True)

# show first 5 instances of data frame "dummy_variable_2"
dummy_variable_2.head()

```

</details>


Great job! We have successfully reached the end.


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/yaryna_beida?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08LZES2357-2023-01-01">Yaryna Beida</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08LZES2357-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08LZES2357-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                  |
| ----------------- | ------- | ---------- | ----------------------------------- |
|     2023-03-04    |   1.0   |Yaryna Beida| Lab created                         |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
