<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# Investigation of cryptocurrency exchange rate dynamic (Matic/USD), сalculation and analysis of technical financial indicators, characterizing the cryptocurrency market (ATR, OBV, RSI, AD)


# Lab.2. Dataset wrangling (Matic/USD)

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Handle missing values
*   Correct data format
*   Standardize and normalize data

### **Columns**

* #### `Ts` - the timestamp of the record
* #### `Open` -  the price of the asset at the beginning of the trading period
* #### `High` -  the highest price of the asset during the trading period
* #### `Low` - the lowest price of the asset during the trading period.
* #### `Close` - the price of the asset at the end of the trading period
* #### `Volume` - the total number of shares or contracts of a particular asset that are traded during a given period
* #### `Rec_count` -  the number of individual trades or transactions that have been executed during a given time period
* #### `Avg_price` - the average price at which a particular asset has been bought or sold during a given period
* #### `ATR` - average true range indicator
* #### `OBV` - on-balance volume indicator
* #### `RSI` - relative strength index indicator
* #### `AD` - accumulation / distribution indicator
* #### `BTC_price` - the avarage price from BTC/BUSD dataset 
* #### `BNB_price` - the avarage price from BNB/BUSD dataset
* ##### Additional columns:  'Open_EUR', 'BTC_price_EUR', 'high_EUR', 'High_Normalized', 'Low_Normalized','close_low', 'close_medium', 'close_high', 'rec_count-binned','rec_count_low', 'rec_count_medium', 'rec_count_high'

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li><a href="https://#identify_handle_missing_values">Identify and handle missing values</a>
        <ul>
            <li><a href="https://#identify_missing_values">Identify missing values</a></li>
            <li><a href="https://#deal_missing_values">Deal with missing values</a></li>
            <li><a href="https://#correct_data_format">Correct data format</a></li>
        </ul>
    </li>
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>
    <li><a href="https://#binning">Binning</a></li>
    <li><a href="https://#indicator">Indicator variable</a></li>
</ul>

</div>

<hr>


## What is the purpose of data wrangling?


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


### What is the price of BTC converted to USDT ?


### Import data
<p>
<li>Data source: <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0Z9BEN/Lab1DataSet.csv" target="_blank">https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0Z9BEN/Lab1DataSet.csv</a></li> 
We will be using this dataset throughout this course.
In this lab, we'll need to use a dataset that we created in the Lab 1. Investigation of cryptocurrency exchange rate dynamic (BTC/USD), сalculation and analysis of technical financial indicators, characterizing the cryptocurrency market (ATR, OBV, RSI, AD).
</p>


#### Import pandas


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
! conda install scikit-learn -y 

In [ ]:
import pandas as pd
import matplotlib.pylab as plt
from matplotlib import pyplot
import numpy as np
import pandas as pd 
import numpy as np 
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error 
import requests
%matplotlib inline


#set precision 
pd.set_option("display.precision", 2)
#set precision for float
pd.options.display.float_format = '{:.2f}'.format



## Reading the dataset from the URL and adding the related headers


First, we assign the URL of the dataset to "filename".


This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/corsera_da0101en_notebook_bottom?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
##Now you need to use dataset, you made in he first lab 
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0Z9BEN/Lab1DataSet.csv"

Then, we create a Python list <b>headers</b> containing name of headers.


Use the Pandas method <b>read_csv()</b> to load the data from the web address.


In [ ]:
df = pd.read_csv(filename,low_memory=False, index_col=0)
#create another data frame, to use in future
spoiled_df = pd.read_csv(filename,low_memory=False, index_col=0)
df = df.reset_index()
spoiled_df = spoiled_df.reset_index()
df.head()

Use the method <b>head()</b> to display the first five rows of the dataframe.


#### Make some values wrong

Often dataset is damaged, so lets damage our dataframe to learn how to renew data

In [ ]:
#Columns to hurt
cols = ['Open', 'Rec_count', 'Close']
#We hurt columns one by one
for col in cols:
        #Get random value in range [0,1]
        m = np.random.rand(len(df))
        #Set coefficients.
        l1 = 0.05 # NaN
        l2 = 0.03 # Text
        l3 = 0.04 # Negative
        
        mask1 = m < l1 # NaN
        mask2 = (m >= l1) & (m < l1+l2) # Text
        mask3 = (m >= l1+l2) & (m < l1+l2+l3) # Negative
        
        #Change data
        spoiled_df.loc[mask1, col] = np.NaN
        spoiled_df.loc[mask2, col] = "?"
        spoiled_df.loc[mask3, col] = spoiled_df.loc[mask3, col]


In [ ]:
# To see what the data set looks like, we'll use the head() method.
spoiled_df.head(50)

As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.

<div>So, how do we identify all those missing values and deal with them?</div> 

<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>


## Identify and handle missing values
### Identify missing values
#### Convert "?" to NaN
In the our dataset, missing data comes with the question mark "?", negative values and NaN(Not a Number).
We replace "?" and negative values by NaN, Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>
to replace A by B.


In [ ]:
# replace "?" with NaN
spoiled_df.replace("?", np.nan, inplace = True)

# replace negative values with NaN
num = df._get_numeric_data()
num[num < 0] = np.nan

#Lets see our result
spoiled_df.head(50)

#### Evaluating for Missing Data

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = spoiled_df.isnull()
missing_data.head(5)

"True" means the value is a missing value while "False" means the value is not a missing value.


#### Count missing values in each column
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. 
</p>


In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Based on the summary above, each column has 67212 rows of data and seven of the columns containing missing data:

<ol>
    <li>"open": ~5 400 missing data</li>
    <li>"close": ~5 400 missing data</li>
    <li>"rec_count": ~5 400 missing data</li>
</ol>


### Deal with missing data
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:

<b>Replace by mean:</b>

<ul>
    <li>"rec_count": ~5 400 missing data</li>
</ul>

<b>Replace by interpolation:</b>

<ul>
    <li>"open": ~5 400 missing data</li>
    <li>"close": ~5 400 missing data</li>
</ul>



#### Calculate the mean value for the "normalized-losses" column 


In [ ]:
avg_norm_loss = spoiled_df["Rec_count"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)

<h4>Replace "NaN" with mean value in "normalized-losses" column</h4>


In [ ]:
spoiled_df["Rec_count"].replace(np.nan, avg_norm_loss, inplace=True)
spoiled_df.head(10)


#### Fill all NaN in the "open" column by interpolating

<b>Pay attention!</b> To replace missing values in a dataset using interpolation, it is necessary to ensure that there are no missing values at the beginning or end of the dataset. One way to do this is shown below


In [ ]:
# check for NaN values at the beginning of the dataset
while spoiled_df.iloc[0].isna().any():
    spoiled_df = spoiled_df.drop(spoiled_df.index[0])

# check for NaN values at the end of the dataset
while spoiled_df.iloc[-1].isna().any():
    spoiled_df = spoiled_df.drop(spoiled_df.index[-1])


#Do same with own dataframe to check values by their indexes
while df.iloc[0].isna().any():
    df = df.drop(df.index[0])

while df.iloc[-1].isna().any():
    df = df.drop(df.index[-1])

    
spoiled_df = spoiled_df.reset_index(drop=True)
df = df.reset_index(drop=True)

Check which method is better for our dataset:
<p>Create lish with interpolation methods without order

In [ ]:
# Methods without order
methods = ["linear", "nearest", "slinear", "quadratic", "cubic", "piecewise_polynomial", "pchip", "akima", "cubicspline"]


Create lish with interpolation methods with order

In [ ]:
# Methods with order
order_methods = ["spline", "polynomial"]
# Create a dataframe to store performance metrics
performance = pd.DataFrame(columns=["name", "MSE", "MAPE"])


<p>First step to check which method is better is to use interpolation method without order

In [ ]:
# Test all methods for func 'pandas.interpolate'
for method in methods:
    # Find all NaN values in spoiled_df['close']
    nan_rows = spoiled_df["Close"].isna()
    # Renew data with pandas.interpolate(). We use only one method at the same time
    interpolated_close = spoiled_df["Close"].interpolate(method=method, limit=5, limit_direction="both")
    interpolated_close[nan_rows].reset_index(drop=True, inplace=True)
    df.reset_index(drop=True, inplace=True)
    # Calculate squared error between real and renewed data
    mse = mean_squared_error(df.loc[nan_rows, "Close"], interpolated_close[nan_rows])
    # Calculate absolute percentage error between real and renewed data
    mape = mean_absolute_percentage_error(df.loc[nan_rows, "Close"], interpolated_close[nan_rows])
    # Append new row to performance dataframe
    performance.loc[len(performance)] = [method, mse, mape]


Second step to check which method is better is to use interpolation method with order

In [ ]:
for method in order_methods:
    for order in [3, 5]:
        # Find all NaN values in spoiled_df['close']
        nan_rows = spoiled_df["Close"].isna()
        # Renew data with pandas.interpolate(). We use only one method at the same time
        interpolated_close = spoiled_df["Close"].interpolate(method=method, order=order)
        # Calculate squared error between real and renewed data
        mse = mean_squared_error(df.loc[nan_rows, "Close"], interpolated_close[nan_rows])
        # Calculate absolute percentage error between real and renewed data
        mape = mean_absolute_percentage_error(df.loc[nan_rows, "Close"], interpolated_close[nan_rows])
        # Append new row to performance dataframe
        performance.loc[len(performance)] = [method, mse, mape]

In [ ]:
# Sort dataframe by MAPE and MSE in ascending order
performance = performance.sort_values(by=["MAPE", "MSE"], ascending=True)

# Convert "MAPE" column data to percents
performance["MSE"] = performance["MSE"].apply(lambda x: '{:.7f}'.format(x))
performance["MAPE"] = performance["MAPE"] * 100
performance["MAPE"] = performance["MAPE"].round(5).astype(str) + "%"

# Display top 15 rows of performance dataframe
print(performance.head(15))

So, 'linear' method is one of the best methods for us, we will use it 

In [ ]:
#Lets see data before interpolation
spoiled_df.head(20)

In [ ]:
spoiled_df['Close'] = spoiled_df["Close"].interpolate(method='linear')

pd.set_option("display.precision", 2)
pd.options.display.float_format = '{:.2f}'.format

#Lets see data after interpolation
spoiled_df.head(20)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #1: </b><br>

<b>Based on the example above, interpolate NaN in "open" column with linear mehod.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
spoiled_df['Open'] = spoiled_df["Open"].interpolate(method='linear')


<details><summary>Click here for the solution</summary>

```python
spoiled_df['Open'] = spoiled_df["Open"].interpolate(method='linear')
```

</details>


To see which values are present in a particular column, we can use the ".value_counts()" method:

In [ ]:
#We don't need "Spoiled_df" any more, because its reneved, lets contain it to our "df" dataframe   
df = spoiled_df

df['Close'].value_counts()

Lets drop "Nan" values from "Open". Dont foget to reset indexes

In [ ]:
# simply drop whole row with NaN in "open" column
df.dropna(subset=["Open"], axis=0, inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)

In [ ]:
df.head()

<b>Good!</b> Now, we have a dataset with no missing values.Lets save it


In [ ]:
df.to_csv("Lab2ClearDF.csv")

### Correct data format
<b>We are almost there!</b>
<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


#### Let's list the data types for each column


In [ ]:
df.dtypes

<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'rec_count' variable are numerical value that describe counnt of records, expect them to be of the type 'int'; however, it is shown as type 'float'. We have to convert data types into a proper format for each column using the "astype()" method.</p> 


#### Convert data types to proper format


In [ ]:
df['Ts'] = pd.to_datetime(df['Ts'])
df["Rec_count"] = df["Rec_count"].astype("int")


#### Let us list the columns after the conversion


In [ ]:
df.dtypes

<b>Wonderful!</b>

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.


## Data Standardization
<p>
Data is usually collected from different agencies in different formats.
(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
</p>

<b>What is standardization?</b>

<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
</p>

<b>Example</b>

<p>Transform USD to EUR:</p>
<p>We will need to apply <b>data transformation</b> to transform BTC into EUR.</p>
In our dataset prices are in dollars, so to get our price in EUR we have to get USDT - EUR rate


<p>The formula for unit conversion is:<p>
1 USDT = 0.94 EUR (01.03.2023)
<p>We can do many mathematical operations directly in Pandas.</p>


In [ ]:
df.head()

In [ ]:
res = requests.get("https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR") 
if res.status_code != 200: 
    rate = 0.94 
else: 
    res = res.json() 
    rate = float(res[0]["toAssetMinAmount"]) 
     
print(f"The exchange rate is 1 BTC = {rate} EUR") 
 
cols_to_convert = ["Open","BTC_price"] 
for col in cols_to_convert: 
    df[f"{col}_EUR"] = df[col] * rate 
 
# check your transformed data  

df.head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <b style="font-size: 2em; font-weight: bold;"> Question  #2: </b><br>

<b>According to the example above, transform BNB to EUR in column "high"</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
res = requests.get("https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR") 
if res.status_code != 200: 
    rate = 0.94 
else: 
    res = res.json() 
    rate = float(res[0]["toAssetMinAmount"]) 
     
print(f"The exchange rate is 1 BTC = {rate} EUR") 
 
df["high_EUR"] = df["High"] * rate
df.head() 

<details><summary>Click here for the solution</summary>

```python
res = requests.get("https://api.binance.com/sapi/v1/convert/exchangeInfo?fromAsset=USDT&toAsset=EUR") 
if res.status_code != 200: 
    rate = 0.94 
else: 
    res = res.json() 
    rate = float(res[0]["toAssetMinAmount"]) 
     
print(f"The exchange rate is 1 BTC = {rate} EUR") 
 
df["high_EUR"] = df["High"] * rate
 

```

</details>


## Data Normalization

<b>Why normalization?</b>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "high" and "low".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>


In [ ]:
df['High'].head(10)

In [ ]:
# replace (original value) by (original value)/(maximum value)
df['High_Normalized'] = df['High']/df['High'].max()
df['High_Normalized'].head(10)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <b style="font-size: 2em; font-weight: bold;"> Question #3: </b><br>
    <b>According to the example above, normalize the column "low".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
df['Low_Normalized'] = df['Low']/df['Low'].max() 

# show the scaled columns
df[["High_Normalized","Low_Normalized"]].head(10)

<details><summary>Click here for the solution</summary>

```python
df['low'] = df['low']/df['low'].max() 

# show the scaled columns
df[["high","low"]].head(10)

```

</details>


Here we can see we've normalized "High","Low" columns in the range of \[0,1].


## Binning
<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p>In our dataset, "close" is a real valued variable ranging from 200 to 320 and it has ~10900 unique values. What if we only care about the price difference between closing minute price(3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'close' column into 3 bins.</p>


### Example of Binning Data In Pandas

Convert data to correct format:


In [ ]:
df.dtypes

Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.


In [ ]:
plt.hist(df["Close"])

# set x/y labels and plot title
plt.xlabel("Close")
plt.ylabel("Count")
plt.title("close counts")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of close, we want to set start_value = min(df["close"]).</p>
<p>Since we want to include the maximum value of close, we want to set end_value = max(df["close"]).</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>

We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.

In [ ]:
bins = np.linspace(min(df["Close"]), max(df["Close"]), 4)
bins

We set group  names:

In [ ]:
group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of `df['close']` belongs to.

In [ ]:
df['close-binned'] = pd.cut(df['Close'], bins, labels=group_names, include_lowest=True )
df[['Close','close-binned']].head(20)

Let's see the number of vehicles in each bin:

In [ ]:
df["close-binned"].value_counts().sort_index()

Let's plot the distribution of each bin:

In [ ]:
pyplot.bar(group_names, df["close-binned"].value_counts().sort_index())
# set x/y labels and plot title
plt.xlabel("close")
plt.ylabel("count")
plt.title("close bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for "close" based on 3 categories ("Low", "Medium" and "High"). 
</p>
<p>
    We successfully narrowed down the intervals from 9.000 to 55.000!
</p>


### Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "close" with bins = 3
plt.hist(df["Close"], bins = 3)

# set x/y labels and plot title
plt.xlabel("close")
plt.ylabel("count")
plt.title("close bins")

The plot above shows the binning result for the attribute "close".


## Indicator Variable (or Dummy Variable)
<b>What is an indicator variable?</b>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>
<b>Example</b>
<p>
    We see the column "close-binned" has two unique values: "Low","Medium" and "High". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "open-binned" to indicator variables.
</p>

<p>
    We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type. 
</p>


In [ ]:
df.columns

Get the indicator variables and assign it to data frame "dummy_variable\_1":


In [ ]:
dummy_variable_1 = pd.get_dummies(df["close-binned"])
dummy_variable_1.head()

Change the column names for clarity:


In [ ]:
dummy_variable_1.rename(columns={'Low':'close_low', 'Medium':'close_medium', 'High':'close_high'}, inplace=True)
dummy_variable_1.head()

In the dataframe, column 'close-binned' has values for 'low','medium' and 'high' as 0s and 1s now.


In [ ]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("close-binned", axis = 1, inplace=True)

The last three columns are now the indicator variable. They're all 0s and 1s now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <b style="font-size: 2em; font-weight: bold;"> Question  #4: </b><br>
    <b>Similar to before, create an indicator variable for the column "rec_count"</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
group_names = ['Low', 'Medium', 'High']

bins = np.linspace(min(df["Rec_count"]), max(df["Rec_count"]), 4)

df['rec_count-binned'] = pd.cut(df['Rec_count'], bins, labels=group_names, include_lowest=True )
columns={'Low':'rec_count_low', 'Medium':'rec_count_medium', 'High':'rec_count_high'}

# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['rec_count-binned'])

# change column names for clarity
dummy_variable_2.rename(columns={'Low':'rec_count_low', 'Medium':'rec_count_medium', 'High':'rec_count_high'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

<details><summary>Click here for the solution</summary>

```python
group_names = ['Low', 'Medium', 'High']

bins = np.linspace(min(df["Rec_count"]), max(df["Rec_count"]), 4)

df['rec_count-binned'] = pd.cut(df['Rec_count'], bins, labels=group_names, include_lowest=True )
columns={'Low':'rec_count_low', 'Medium':'rec_count_medium', 'High':'rec_count_high'}

# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['rec_count-binned'])

# change column names for clarity
dummy_variable_2.rename(columns={'Low':'rec_count_low', 'Medium':'rec_count_medium', 'High':'rec_count_high'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <b style="font-size: 2em; font-weight: bold;"> Question  #5: </b><br>
    <b>Merge the new dataframe to the original dataframe.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

df.head()


<details><summary>Click here for the solution</summary>

```python
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

df.head()

```

</details>


In [ ]:
df.head()

In [ ]:
df = df.set_index('Ts')
resample_df = pd.DataFrame()
resample_df['Open'] = df['Open'].resample('5min').first()
resample_df['High'] = df['High'].resample('5min').max()
resample_df['Low'] = df['Low'].resample('5min').min()
resample_df['Close'] = df['Close'].resample('5min').last()
resample_df['Volume'] = df['Volume'].resample('5min').sum()

resample_df.to_csv("resampl.csv")
resample_df.head()
df.to_csv("Lab2DataSet.csv")

#### Save the new csv:

> Note : The  csv file cannot be viewed in the jupyterlite based SN labs environment.However you can Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Module%202/DA0101EN-2-Review-Data-Wrangling.ipynb?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01">HERE</a> to download the lab notebook (.ipynb) to your local machine and view the csv file once the notebook is executed.


# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/ostap_liashenyk" target="_blank" >Ostap Liashenyk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>




## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-04-01    |   1.0   | Ostap Liashenyk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>