<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# Financial Services: Lab 2. Dataset wrangling (on the example of MATIC/BUSD and several technical indicators:  ADOSC, NATR, TRANGE)

The tasks:
* To find empty cells and handle missing values;
* Analyze data format, find the wrong format and correct data format;
* Standardize and normalize data series.

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Handle missing values
*   Correct data format
*   Standardize and normalize data


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li>Identify and handle missing values
        <ul>
            <li>Identify missing values</li>
            <li>Deal with missing values</li>
            <li>Correct data format</li>
        </ul>
    </li>
    <li>Data standardization</li>
    <li>Data normalization (centering/scaling)</li>
    <li>Binning</li>
    <li>Indicator variable</li>
    <li>Resample data</li>
</ul>

</div>

<hr>


## What is the purpose of data wrangling?


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


### What is the Avarage price of MATICcoin in different currency?

### Import data
<p>
You can find the "MATICBUSD trades Dataset" from the following link:
<a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0B59EN/labs/MATICBUSD_trades_1m%20(1).csv" target="_blank">https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0B59EN/labs/MATICBUSD_trades_1m%20(1).csv</a>

We will be using this dataset throughout this course.
</p>


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
! conda install pandas -y
! conda install numpy -y

In [ ]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot

#set values precision as 6
pd.set_option("display.precision", 6)

This function will download the dataset into your browser


## Reading the dataset from the URL and adding the related headers


First, we assign the URL of the dataset to "filename".


This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/corsera_da0101en_notebook_bottom?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0B59EN/labs/MATICBUSD_trades_1m%20(1).csv"

Then, we create a Python list <b>headers</b> containing name of headers.


In [ ]:
headers = ["Ts", "Open", "High", "Low", "Close", "Volume", "Avg_price", "ADOSC", "TRANGE", "NATR"]

Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter  "names" equal to the Python list "headers".


In [ ]:
df = pd.read_csv(filename, index_col = 0, names = headers, skiprows = 1)

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.index = df.index.astype("datetime64[ns]")
df.head()

## Reading the dataset from the URL and adding the related headers
As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.


<div>So, how do we identify all those missing values and deal with them?</div> 


<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>


## Identify and handle missing values

### Identify missing values
#### Convert missed or wrong data to NaN
In the trades dataset, missing data comes with the question mark "?", text or negative values.
We replace that with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>
 to replace A by B or
 <pre>.mask(condition, new value, inplace = True)</pre>
 to change value by some condition.


#### Evaluating for Missing Data

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df.isnull()
missing_data.head()

"True" means the value is a missing value while "False" means the value is not a missing value.


#### Count missing values in each column
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. 
</p>


In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Based on the summary above, each column has 50891 rows of data and eight of the columns containing missing data, for example:

<ol>
    <li>"Close" : 2597 missing data</li>
    <li>"ADOSC":  9 missing data</li>
    <li>"NATR": 10440 missing data</li>
</ol>


### Deal with missing data
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. 

Our data is linked to time, so we can not drop rows to save data quality for future analysis. Also not the best way to restore our data is to replace it by mean value.
So we will use an interpolation (and also delete some rows) to restore it.

In [ ]:
#set output precision on 6 digits
pd.set_option("display.precision", 6)

#### Replace "NaN" with the linear value in the "Avg_price" column


In [ ]:
df["Avg_price"].interpolate(method='linear', inplace=True)
df.head()

Now let`s check if "Avg_price" column have Nan values:

In [ ]:
missing_data = df.isnull()
print("Avg_price")
print (missing_data["Avg_price"].value_counts())

We can see there are no missed values anymore.
Now we can fix missed values in indicators columns. We used firs 10 rows in ADOSC and NATR column to anticipate next values, so let`s drop them.

In [ ]:
df = df.drop(df.index[range(10)])
df.head(15)

Now we are going to fix another Nan values in ADOSC, TRANGE and NATR columns. We will replace them by interpolated values, using different types of interpolation.

In [ ]:
df["ADOSC"].interpolate(method='nearest', inplace=True)
df["TRANGE"].interpolate(method='quadratic', inplace=True)
df["NATR"].interpolate(method='cubic', inplace=True)

indicators = ["ADOSC", "TRANGE", 'NATR']

missing_data = df.isnull()
for indicator in indicators:
    print(indicator)
    print(missing_data[indicator].value_counts())
    print("")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #1: </h1>

<b>Based on the example above, replace NaN in "Close" column with the linear interpolation value.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python

# replace NaN by with the linear interpolation value in "Close" column
df["Close"].interpolate(method='linear', inplace=True)
df.head(5)

```
</details>


Also we can use some more interpolation types like polynomial, with that type of interpolation we need to choose order, for example we use 5

In [ ]:
df["High"].interpolate(method='nearest', inplace=True)

The replacement result is very similar to what we have seen previously with other methods


In [ ]:
missing_data = df.isnull()
print("High")
print (missing_data["High"].value_counts())

No more missed values in "High" column.
Now we can fill Nan values in another columns:
<li>The "Low" column we will fill using interpolation like we used on the "ADOSC" one</li>

In [ ]:
df["Low"].interpolate(method='nearest', inplace=True)

<li>The "Open" column we will fill like the "High" one, but now we will use a spline type of interpolation</li>

In [ ]:
# replace NaN by with the linear interpolation value in "Open" column
df["Open"].interpolate(method='spline', order = 3, inplace=True)

Check if we fix all data

In [ ]:
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

<b>Good!</b> Now, we have a dataset with no missing values.


### Correct data format
<b>We are almost there!</b>
<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


<h4>Let's list the data types for each column</h4>


In [ ]:
df.dtypes

<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.</p> 


<h4>Convert data types to proper format</h4>


In [ ]:
df[["Volume"]] = df[["Volume"]].astype("int")

<h4>Let us list the columns after the conversion</h4>


In [ ]:
df.dtypes

<b>Wonderful!</b>

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.


In [ ]:
df.to_csv('MATICBUSD_lab3.csv')

## Data Standardization
<p>
Data is usually collected from different agencies in different formats.
(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
</p>

<b>What is standardization?</b>

<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
</p>

<b>Example</b>

<p>Transform USD to EUR:</p>
<p>In our dataset, the columns with price values are represented by BUSD. Assume we are developing an application in a country that accepts the price values with EUR standard.</p>
<p>We will need to apply <b>data transformation</b> to transform BUSD into EUR.</p>


In [ ]:
df.head()

In [ ]:
import requests
# Convert mpg to USDT by mathematical operation
res = requests.get("https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=BUSD&toAsset=EUR")
if res.status_code != 200:
    rate = 0.93
else:
    res = res.json()
    rate = float(res[0]["toAssetMinAmount"])
    
print(f"The exchange rate is 1 BUSD = {rate} EUR")

cols_to_convert = ["Open", "High", "Low", "Close", "Avg_price"]
for col in cols_to_convert:
    df[f"{col}_EUR"] = df[col] * rate

# check your transformed data 
df.head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #2: </h1>

<b>According to the example above, transform BUSD to GBP in "Avg_price" column and change the name of columns in appropriative way.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Convert BUSD to GBP by mathematical operation

import requests
# Convert mpg to USDT by mathematical operation
res = requests.get("https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=BUSD&toAsset=GBP")
if res.status_code != 200:
    rate = 0.83
else:
    res = res.json()
    rate = float(res[0]["toAssetMinAmount"])
    
print(f"The exchange rate is 1 BUSD = {rate} GBP")

cols_to_convert = ["Open", "High", "Low", "Close", "Avg_price"]
for col in cols_to_convert:
    df[f"{col}_GBP"] = df[col] * rate

# check your transformed data 
df.head()

```

</details>


## Data Normalization

<b>Why normalization?</b>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "Open" and "Close".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>


In [ ]:
df.head(10)

In [ ]:
# replace (original value) by (original value)/(maximum value)
df['Open'] = df['Open']/df['Open'].max()
df['Close'] = df['Close']/df['Close'].max()
df.head(10)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #3: </h1>

<b>According to the example above, normalize the column "Avg_price".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df['Avg_price'] = df['Avg_price']/df['Avg_price'].max() 

# show the scaled columns
df[["Open","Close","Avg_price"]].head()


```

</details>


Here we can see we've normalized "Open", "Close" and "Avg_price" in the range of \[0,1].


## Binning
<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p>In our dataset, "Avg_price" is a real valued variable ranging from 0.7661862 to 1.064339 and it has a lot unique values. What if we only care about the price difference between DOGE in diapasone with high Avg_price, medium Avg_price, and little Avg_price (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'Avg_price' column into 3 bins.</p>


### Example of Binning Data In Pandas


Convert data to correct format:


In [ ]:
df["Avg_price"]=df["Avg_price"].astype(float, copy=True)

Let's plot the histogram of Avg_price to see what the distribution of Avg_price looks like.


In [ ]:
plt.pyplot.hist(df["Avg_price"])

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("count")
plt.pyplot.title("Avg_price bins")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of Avg_price, we want to set start_value = min(df["Avg_price"]).</p>
<p>Since we want to include the maximum value of Avg_price, we want to set end_value = max(df["Avg_price"]).</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = np.linspace(min(df["Avg_price"]), max(df["Avg_price"]), 4)
bins

We set group  names:


In [ ]:
group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of `df['Avg_price']` belongs to.


In [ ]:
df['Avg_price-binned'] = pd.cut(df['Avg_price'], bins, labels=group_names, include_lowest=True )
df[['Avg_price','Avg_price-binned']].head(20)

Let's see the number of data in each bin:


In [ ]:
df["Avg_price-binned"].value_counts()

Let's plot the distribution of each bin:


In [ ]:
pyplot.bar(group_names, df["Avg_price-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("count")
plt.pyplot.title("Avg_price bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for "Avg_price" based on 3 categories ("Low", "Medium" and "High"). 
</p>
<p>
    We successfully narrowed down the intervals to only 3!
</p>


### Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["Avg_price"], bins = 3)

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("count")
plt.pyplot.title("Avg_price bins")

The plot above shows the binning result for the attribute "Avg_price".


## Indicator Variable (or Dummy Variable)
<b>What is an indicator variable?</b>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>
<b>Example</b>
<p>
    We see the column "Avg_price" has three unique values: "Low", "Medium" or "High". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "Avg_price" to indicator variables.
</p>

<p>
    We will use pandas' method 'get_dummies' to assign numerical values to different categories of Avg_price. 
</p>


In [ ]:
df.columns

Get the indicator variables and assign it to data frame "dummy_variable\_1":


In [ ]:
dummy_variable_1 = pd.get_dummies(df["Avg_price-binned"])
dummy_variable_1.head(5)

Change the column names for clarity:


In [ ]:
dummy_variable_1.rename(columns={'Low':'Avg_price-Low', 'Medium':'Avg_price-Medium', 'High':'Avg_price-High'}, inplace=True)
dummy_variable_1.head()

In the dataframe, column 'Avg_price' has values for 'Low', 'Medium' and 'High' as 0s and 1s now.


In [ ]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "Avg_price" from "df"
df.drop("Avg_price", axis = 1, inplace=True)

In [ ]:
df.head()

The last two columns are now the indicator variable representation of the Avg_price variable. They're all 0s and 1s now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #4: </h1>

<b>Similar to before, create an indicator variable for the column "Ts"</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# get day name
day = df.Ts
df['Day_name'] = day.df.day_name()

# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['Day_name'])

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()


```

</details>


## Resample time series data


Resampling is a series of techniques used in statistics to gather more information about a sample. This can include retaking a sample or estimating its accuracy. With these additional techniques, resampling often improves the overall accuracy and estimates any uncertainty within a population.

In [ ]:
#create new dataset
wdf = pd.DataFrame()
df.index = df.index.astype("datetime64[ns]")

Find summary Rec_count per week:

In [ ]:
wdf['Open'] = df['Open'].resample('D').first()
wdf['High'] = df['High'].resample('D').max()
wdf['Low'] = df['Low'].resample('D').min()
wdf['Close'] = df['Close'].resample('D').last()
wdf['Volume'] = df['Volume'].resample('D').sum()

In [ ]:
wdf.head()

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #5: </h1>

<b>Make a daily summary of Rec_count_eur.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 



<details><summary>Click here for the solution</summary>

```python
nwdf = pd.DataFrame()
nwdf['Open'] = df['Open'].resample('D').first()
nwdf['High'] = df['High'].resample('D').max()
nwdf['Low'] = df['Low'].resample('D').min()
nwdf['Close'] = df['Close'].resample('D').last()
nwdf['Volume'] = df = pd.DataFrame()wdf['Volume'].resample('D').sum()
nwdf.head()

```

</details>


In [ ]:
wdf.to_csv('clean_df.csv')

Save the new csv:

> Note : The  csv file cannot be viewed in the jupyterlite based SN labs environment.However you can Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Module%202/DA0101EN-2-Review-Data-Wrangling.ipynb?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01">HERE</a> to download the lab notebook (.ipynb) to your local machine and view the csv file once the notebook is executed.


# **Thank you for completing Lab 2!**

## Authors

<a href="https://author.skills.network/instructors/oleh_lozovyi?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Oleh Lozovyi</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">Joseph Santarcangelo</a>


<a href="https://www.linkedin.com/in/fiorellawever/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">Fiorella Wenver</a>

<a href="https:// https://www.linkedin.com/in/yi-leng-yao-84451275/ " target="_blank" >Yi Yao</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By   | Change Description                                         |
| ----------------- | ------- | -------------| ---------------------------------------------------------- |
|     2023-03-08    |   1.0   | Oleh Lozovyi | Lab created                                                |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>



