<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04M1EN/SN_web_lightmode.png?1677234044475" width="300" alt="cognitiveclass.ai logo"  />
</center>

# Financial Services: Data Wrangling (on the example of BTC/BUSD)

Estimated time needed: **30** minutes


 The tasks:
* To find empty cells and handle missing values;
* Analyze data format, find the wrong format and correct data format;
* Standardize and normalize data series.

## Objectives

After completing this lab you will be able to:

*   Handle missing values
*   Correct data format
*   Standardize and normalize data


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li><a href="https://#identify_handle_missing_values">Identify and handle missing values</a>
        <ul>
            <li><a href="https://#identify_missing_values">Identify missing values</a></li>
            <li><a href="https://#deal_missing_values">Deal with missing values</a></li>
            <li><a href="https://#correct_data_format">Correct data format</a></li>
        </ul>
    </li>
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>
    <li><a href="https://#binning">Binning</a></li>
    <li><a href="https://#indicator">Indicator variable</a></li>
</ul>

</div>

<hr>


<h2>What is the purpose of data wrangling?</h2>


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


<h3>What is the average coin cost in USDT?</h3>


<h3>Import data</h3>
<p>
You can find the "Finance Dataset" from the following link: <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04M1EN/BTCBUSD_resampled_1min.csv"> https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04M1EN/BTCBUSD_resampled_1min.csv</a>. 
We will be using this dataset throughout this course.
</p>


<h4>Install and import libraries</h4> 


In [ ]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install scipy
!pip install seaborn
!pip install scikit-learn --y
!pip install --upgrade scikit-learn

Now, let's import libraries that we will use

In [ ]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
import warnings
import requests

from sklearn.preprocessing import MinMaxScaler
from typing import List, Tuple
%matplotlib inline 
import matplotlib as plt
from matplotlib import pyplot


<h2>Reading the dataset from the URL and adding the related headers</h2>


In [ ]:
filename = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04M1EN/BTCBUSD_resampled_1min.csv'

Use the Pandas method <b>read_csv()</b> to load the data from the web address.
Our dataset contains index column which is the first column, and headers, so we will set the parameter <code>index_col=0</code> to use first column as index columns.


In [ ]:
df = pd.read_csv(filename, index_col='ts')
df.index = pd.to_datetime(df.index)

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head(15)

Let's make a copy of our input dataframe(we will need it later).

In [ ]:
df_copy = df.copy()

Using <code>shape</code> to retrieve number of rows and columns

In [ ]:
df.shape

Using <code>count</code> to retrieve number of rows that do not have <count>nan</count> values.

In [ ]:
df.count()

As we can see, our dataset almost does not have missing values(except indicators and few ohlc values), but in the real world, it is not always like this; more often, our datasets need some preprocessing to be done. So for the sake of this exercise, we will create missing values and wrong values and then learn how to work with such data.

So let's add some negative values first.

But before that, let's create <code>raise_warning</code> method that has common warning messages and will be used to warn the user that missing values already exist in the data frame, it helps in ensuring that we won't generate to many <code>NAN</code> values.

In [ ]:
def raise_warning(column: str, type: str = 'negative'): 
  """Common warning function used to raise warnings that some values are already in our column"""
  warning_error: str = f'Column {column} already has some {type} values! '
  warning_message: str = f'Column {column} was not changed'
  warning_note: str = f'Note: columns listed before {column} where changed!'        
  warnings.warn(warning_error + 
                warning_message + 
                warning_note)


In [ ]:
# columns which will have negative values
columns_to_add_negative_values = [('volume', 0.75), ('rec_count', 0.002)]

In [ ]:
def has_negative_values(df: pd.DataFrame, 
                       column: str) -> bool:
  """Checks if dataframe column has negative values
    :param df is a dataframe with the column we want to check
    :param column is the name of the column
  """         
  tmp_column = df[column].where(df[column] >= 0)
  return tmp_column.isna().sum() > 0


def create_negative_values(df: pd.DataFrame, 
                     columns: List[Tuple[str, int or float]]) ->  pd.DataFrame:
  """Adds negative values into specific columns of the dataframe if they 
     do not contain negative values already.
    :param df is a dataframe where you want to add some negative values
    :param columns is a list of column names and probabilities of each 
           cell becoming negative value
           Example: [('first_column', .5), ('second_column', .1)]
    :return updated dataframe or old dataframe 
            if negative values already exist in one of the columns
  """
  for column, percentage in columns:
    if has_negative_values(df, column):
      raise_warning(column)
      return df
    column_shape = df[column].shape
    condition_df = np.random.random(column_shape) < percentage
    df[column] = np.where(condition_df, -df[column], df[column])
  return df
  

In [ ]:
df = create_negative_values(df, columns_to_add_negative_values)

In [ ]:
df.head(50)

Sometimes some columns may be of one type, but represented as absolutely different type that is not convinient for us. Let's change types of some columns to 'wrong' types.

In [ ]:
df = df.astype({'open': 'string', 'close': 'string', 'volume': 'string'})

Let's see the dataframe types to make sure everything worked as expected

In [ ]:
df.dtypes

<b>Now, let's add missing values</b>

We will use the question mark '?' for the missing values. To replace some values <code>mask</code> method will be used. This method takes condition dataframe, which is a dataframe of boolean values for each cell. Parameter <code>other</code> specifies value that will be inserted.

In [ ]:
def has_missing_values(df: pd.DataFrame, 
                       column: str, 
                       missing_value: str) -> bool:
  """Checks if dataframe column has missing values
    :param df is a dataframe with the column we want to check
    :param column is the name of the column
    :param missing_value specifies value that is considered to be a
           missing value
  """         
  tmp_column = df[column].replace(missing_value, np.nan)
  return tmp_column.isna().sum() > 0

def add_missing_data(df: pd.DataFrame, 
                     columns: List[Tuple[str, int or float]],
                     missing_value: str ='?') -> pd.DataFrame:
  """Adds missing values into specific columns of the dataframe if they 
     do not contain missing values already
    :param df is a dataframe where you want to add some missing value
    :param columns is a list of column names and probabilities of each 
           cell being replaced by a missing value
           Example: [('first_column', .5), ('second_column', .1)]
    :param missing_value specifies value that is 
           considered to be a missing value
    :return updated dataframe
  """
  for column, percentage in columns:
    if has_missing_values(df, column, missing_value):
      raise_warning(column, type='missing')
      return df
    column_shape = df[column].shape
    condition_df = np.random.random(column_shape) < percentage
    df[column] = df[column].mask(condition_df, other=missing_value)
  return df

Let's create an array of columns that are going to have some missing values(in our case '?').

In [ ]:
columns_to_add_missing_data = [('volume', 0.015), ('rec_count', 0.02)]

Make sure to run the below code once, as running one more time won't change the dataframe, but will cause warning.

In [ ]:
df = add_missing_data(df, columns_to_add_missing_data)

In [ ]:
df.head(50)

As we can see, our dataframe is more realistic now and has wrong values, types and several question marks in the dataframe(those are missing values which may hinder our further analysis). Now, we can learn how to work with such data.


<i><b>So, how do we identify all those missing values and deal with them?</b></i>

<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>

<i><b>How do we identify all the dirty values; how to deal with them?</b></i>

<b>How to work with dirty data?</b>

Steps for working with dirty data:

<ol>
    <li>Identify columns with dirty data</li>
    <li>Deal with dirty data</li>
    <li>Correct data format</li>
</ol>

<h2 id="identify_handle_missing_values">Identify and handle missing values</h2>

<h3 id="identify_missing_values">Identify missing values</h3>
<h4>Convert "?" to NaN</h4>
In our dataset, missing data comes with the question mark "?".
So let's replace "?" with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>
to replace A by B.
Doing so enables us to use convinient pandas methods for working with missing data(You can find them few cells below)

In [ ]:
df.replace('?', np.nan, inplace=True)
df.head(50)

In [ ]:
df.dtypes

<h4>Evaluating for Missing Data</h4>

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df.isnull()
missing_data.head(50)

"True" means the value is a missing value while "False" means the value is not a missing value.


<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. 
</p>


In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print('')    

Based on the summary above, you can analyze how many cells have a missing data. Number near True is number of cells with missing values, vice versa for False.


<h3 id="deal_missing_values">Deal with missing data</h3>
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions<br>
        d. Replace it by imputed value
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:
<ul>
    <li>Column "volume": has 1.5% probability of cell being missing data. Use interpolation to fill missing values.
    </li>
    <li>Column "rec_count": has 0.2% probability of cell being missing data. Use interpolation to fill missing values.</li>
    <li>Column "ADOSC": has few missing values in the cells.</li>
    <li>Column "NATR": has few missing values in the cells. </li>
    <li>Column "TRANGE": has few missing value in the cells.</li>
</ul>



First of all, let's process our indicator values(ADOSC, NATR, TRANGE).

In [ ]:
df.shape

In [ ]:
df = df.dropna(subset=['ADOSC', 'NATR', 'TRANGE'])

In [ ]:
df.head(15)

Second of all, let's get rid of negative values.

For that we convert data_types and replace NaN with float('inf'). 
We need float('inf') as it will help us to get rid of negative values(there will be no method restrictions).


In [ ]:
fix_types = {
    'open': np.float64,
    'close': np.float64,
    'high': np.float64,
    'low': np.float64,
    'volume': np.float64,
    'rec_count': np.float64,
    'avg_price': np.float64,
    }
df = df.astype(fix_types)
df[['volume', 'rec_count']] = df[['volume', 'rec_count']].replace(np.nan, float('inf'))
df.head()

In [ ]:
cols_with_neg_values = ['volume', 'rec_count']

In [ ]:
for column in cols_with_neg_values:
    df[column] = df[column].abs()

Let's check if everything is fine by displaying dataframe and using <code>has_negative_values</code> method that we implemented previously.

In [ ]:
for column in cols_with_neg_values:
  col_res = has_negative_values(df, column)
  log_res = 'contains' if col_res else 'does not contain'
  print(f'Column {column} {log_res} negative values')

Now as we do not have negative values we can replace float('inf') with np.nan

In [ ]:
df.replace(float('inf'), np.nan, inplace = True)

In [ ]:
methods = [{'method': 'linear', 'limit_direction': 'both'}, {'method': 'polynomial', 'order': 3}, {'method': 'pad', 'limit_direction': 'forward'}]

Now we have to interpolate missing values and choose one of the methods.

In [ ]:
# from sklearn.metrics import mean_absolute_percentage_error
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


interp_df = pd.DataFrame()

for params in methods:
  method = params['method']

  # geting mean_absolute_percentage_error for record count
  interp_df[f'{method}_rec_count'] = df['rec_count'].interpolate(**params)
  print(f"rec_count {method} {mean_absolute_percentage_error(df_copy['rec_count'], interp_df[f'{method}_rec_count'])}")
    
  # geting mean_absolute_percentage_error for Volume
  interp_df[f'{method}_volume'] = df['volume'].interpolate(**params)
  print(f"volume {method} {mean_absolute_percentage_error(df_copy['volume'], interp_df[f'{method}_volume'])}")

In [ ]:
# let's do the same thing with other values

for params in methods:
  # geting mean_absolute_percentage_error for open
  interp_df['open'] = df['open'].interpolate(**params)
  
  # geting mean_absolute_percentage_error for close
  interp_df['close'] = df['close'].interpolate(**params)
    
  # geting mean_absolute_percentage_error for close
  interp_df['low'] = df['low'].interpolate(**params)
  
  # geting mean_absolute_percentage_error for close
  interp_df['high'] = df['high'].interpolate(**params)
    
  # geting mean_absolute_percentage_error for close
  interp_df['avg_price'] = df['avg_price'].interpolate(**params)

As we can see linear method fits our data the most.

In [ ]:
pd.options.mode.chained_assignment = None 
df['rec_count'] = interp_df['linear_rec_count']
df['volume'] = interp_df['linear_volume']
df['open'] = interp_df['open']
df['close'] = interp_df['close']
df['high'] = interp_df['high']
df['low'] = interp_df['low']
df['avg_price'] = interp_df['avg_price']
pd.options.mode.chained_assignment = 'warn'

Let's check if all missing values were replaced using previously implemented <code>has_missing_values</code> method.

In [ ]:
has_missing_values(df, 'rec_count', missing_value=np.nan)

In [ ]:
has_missing_values(df, 'volume', missing_value=np.nan)

In [ ]:
o = has_missing_values(df, 'open', missing_value=np.nan)
c = has_missing_values(df, 'close', missing_value=np.nan)
h = has_missing_values(df, 'high', missing_value=np.nan)
l = has_missing_values(df, 'low', missing_value=np.nan)
print(o, c, h, l)

<b>Good!</b> Now, we have a dataset with no missing values.


<h3 id="correct_data_format">Correct data format</h3>
<b>We are almost there!</b>
<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


<h4>Let's list the data types for each column</h4>


In [ ]:
df.dtypes

<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type <code>float</code> or <code>int</code>, datetime values should have <code>datetime</code> . For example, 'Volume' is  numerical value of type <code>float</code>; however, it is displayed <code>object</code>, same goes for 'Ts' column that is of type <code>datetime</code>. We have to convert data types into a proper format for each column using the <code>astype()</code> method.</p> 


<h4>Convert data types to proper format</h4>


In [ ]:
correct_types = {
                'open': np.float64,
                'high': np.float64,
                'low': np.float64,
                'close': np.float64,
                'volume': np.float64,
                'rec_count': np.int64,
                'avg_price': np.float64,
                # indicators
                'ADOSC': np.float64,
                'NATR': np.float64,
                'TRANGE': np.float64,
                # other currencies
                'ape_avg_price': np.float64,
                'bnb_avg_price': np.float64,
                'doge_avg_price': np.float64,
                'eth_avg_price': np.float64,
                'xrp_avg_price': np.float64,
                'matic_avg_price': np.float64
                 }

df = df.astype(correct_types)

<h4>Let us list the columns after the conversion</h4>


In [ ]:
df.dtypes

Now we have obtained the cleaned dataset with no missing values and with correct datatypes. Let's save it.


In [ ]:
df.to_csv('BTCBUSD_1min.csv')

<h2 id="data_standardization">Data Standardization</h2>
<p>
Data is usually collected from different sources in different formats.
(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
</p>

<b>What is standardization?</b>

<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
</p>

<b>Example</b>

<p>Transform BUSD to USDT:</p>
<p>In our dataset, the avarage cost column "avg_cost" is represented by BUSD unit. Assume we want to perform some analisys, but using USDT values.</p>
<p>We will need to apply <b>data transformation</b> to transform BUSD into USDT.</p>


Let's use [CoinMarketCAP](https://coinmarketcap.com/api/) API to get up to date info for transformation

We can do many mathematical operations directly in Pandas.

Note: to send requests to [https://coinmarketcap.com/api/](https://coinmarketcap.com/api/) you need to register and get your own API_KEY

In [ ]:
df.head()

In [ ]:
url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest"

parameters = {
  "symbol": "BUSD",
  "convert": "USDT"
}

headers = {
  "Accepts": "application/json",
  "X-CMC_PRO_API_KEY": "YOUR_API_KEY" # change "YOUR_API_KEY" to your API_KEY in other case, 
    # the code will use default rate that may be outdated
}

response = requests.get(url, headers=headers, params=parameters)

if response.status_code != 200:
    rate = 0.999707
else:
    data = response.json()
    rate = float(data["data"]["BUSD"]["quote"]["USDT"]["price"])

print(f"The exchange rate is 1 BUSD = {rate} USDT")

cols_to_convert = ["avg_price"]
for col in cols_to_convert:
    df[f"{col}_USDT"] = df[col] * rate

# check your transformed data 
df[["avg_price_USDT"]].head(25)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #1: </h1>

<b>According to the example above, transform BUSD to ADA in the new column  called "ADA_avg_price".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
# Convert BUSD to BTC by mathematical operation
import requests

url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest"

parameters = {
  "symbol": "BUSD",
  "convert": "BTC"
}

headers = {
  "Accepts": "application/json",
  "X-CMC_PRO_API_KEY": "YOUR_API_KEY" # change "YOUR_API_KEY" to your API_KEY in other case, 
    # the code will use default rate that may be outdated
}

response = requests.get(url, headers=headers, params=parameters)

if response.status_code != 200:
    rate = 4.1659326843755195e-05
else:
    data = response.json()
    rate = float(data["data"]["BUSD"]["quote"]["BTC"]["price"])

print(f"The exchange rate is 1 BUSD = {rate} BTC")

cols_to_convert = ["avg_price"]
for col in cols_to_convert:
    df[f"{col}_BTC"] = df[col] * rate

# check your transformed data 
df[["avg_price_BTC"]].head(25)

```

</details>


<h2 id="data_normalization">Data Normalization</h2>

<b>Why normalization?</b>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "rec_count" and "volume".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> Use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">MinMaxScaler</a> that does following computations:</p>
<ul>
<li>
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
</li>
<li>
X_scaled = X_std * (max - min) + min
</li>
</ul>
Where min, max are params from feature_range.


In [ ]:
# replace (original value) by MinMaxScaler values
# the default value of fearture_range parameter is (0, 1),
# so we do not need to specify it
scaler = MinMaxScaler()
df[['rec_count']] = scaler.fit_transform(df[['rec_count']])
df[['rec_count']]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #2: </h1>

<b>According to the example above, normalize the column "volume".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
df[['volume']] = scaler.fit_transform(df[['volume']])

# show the scaled columns
df[['volume','rec_count']].head()

```

</details>


Here we can see we've normalized "volume" and "rec_count" in the range of \[0,1].


<h2 id="binning">Binning</h2>
<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p> What if we want to see periods of time when "Avg_price" had highest values and analyse the difference between "high" and "low" values for high, medium and low "avg_price"?
Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method "cut" to segment the "avg_price" column into 3 bins.</p>


<h3>Example of Binning Data In Pandas</h3>


Let's plot the histogram of avg_price to see what the distribution of avg_price looks like.


In [ ]:
plt.pyplot.hist(df["avg_price"])

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("Count")
plt.pyplot.title("Average price bins")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of "avg_price" column, we want to set start_value = min(df["avg_price"]).</p>
<p>Since we want to include the maximum value of "avg_price" column, we want to set end_value = max(df["avg_price"]).</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = np.linspace(min(df['avg_price']), max(df['avg_price']), 4)
bins

We set group  names:


In [ ]:
group_names = ['low', 'medium', 'high']

We apply the function "cut" to determine what each value of `df['Avg_price']` belongs to.


In [ ]:
df['avg_price_binned'] = pd.cut(df['avg_price'], bins, labels=group_names, include_lowest=True )
df[['avg_price','avg_price_binned']].head(20)

Let's see the number of records in each bin:


In [ ]:
df["avg_price_binned"].value_counts()

Let's plot the distribution of each bin:


In [ ]:
pyplot.bar(group_names, df["avg_price_binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("Count")
plt.pyplot.title("Average price bins")

<p>
    Good, we devided our dataset into bins by "Avg_price" based on 3 categories ("Low", "Medium" and "High"). 
</p>

<h3>Bins Visualization</h3>
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "High" with bins = 3
plt.pyplot.hist(df["avg_price"], bins=3)

# set x/y labels and plot title
plt.pyplot.xlabel("Avg_price")
plt.pyplot.ylabel("Count")
plt.pyplot.title("Avarage price bins")

The plot above shows the binning result for the attribute "Avg_price".


In [ ]:
dummy_variable = pd.get_dummies(df['avg_price_binned'])
dummy_variable

Change the column names for clarity:

In [ ]:
dummy_variable.rename(columns={'low':'low_bin', 'medium':'medium_bin', 'high': 'high_bin'}, inplace=True)

In [ ]:
# merge data frame "df" and "dummy_variable" 
df = pd.concat([df, dummy_variable], axis=1)

# drop original column "Avg_price_binned" from "df"
df.drop("avg_price_binned", axis=1, inplace=True)

In [ ]:
df.head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #3: </h1>

<b>Create an indicator variable for the list of values: <code>['apple', 'orange', 'banana']</code></b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
# get indicator variables for list of values and assign it to data frame "dummy_variable_1"
list_of_values = ['apple', 'orange', 'banana']
dummy_variable_1 = pd.get_dummies(list_of_values)

dummy_variable_1.head()


```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #4: </h1>

<b>Similar to before, create bins for the column "high" with four group names. Plot the distribution of each bin. </b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot

bins = np.linspace(df['high'].min(), df['high'].max(), 5)
print(bins)

group_names = ['hlow_bin', 'hmedium_Low_bin', 'hmedium_High_bin' , 'hhigh_bin']

df['high_binned'] = pd.cut(df['high'], bins, labels=group_names, include_lowest=True)
pyplot.bar(group_names, df["high_binned"].value_counts())

plt.pyplot.hist(df["high"])

plt.pyplot.xlabel("High ")
plt.pyplot.ylabel("Count")
plt.pyplot.title("High bins")
```

</details>


<h3>Resampling</h3>


Time series data can be summarized or aggregated by a new time interval. For example, you can summarize minute data into hours, hours into days, etc.

This process of changing the time period data to be summarized into another time period is often called resampling.

Let's resample our data from 1 minute to 5 minutes, we will use <code>pandas</code> <code>resample()</code> method for this purpose.

Now we create new dataframe and write aggragated data in it. 

In [ ]:
resample_df5 = pd.DataFrame()
resample_df5['open'] = df['open'].resample('5min').first()
resample_df5['high'] = df['high'].resample('5min').max()
resample_df5['low'] = df['low'].resample('5min').min()
resample_df5['close'] = df['close'].resample('5min').last()
resample_df5['volume'] = df['volume'].resample('5min').sum()

Let's aggregate data to 10 minutes, using shorter syntax.

In [ ]:
resample_df10 = df[["open", "high", "low", "close", "volume"]].resample("10min").agg({
    "open": "first",
    "high": "max",
    "low": "min",
    "close": "last",
    "volume": "sum"
})

In the above cell we specified time and how our aggragated data will be aggregated.

> <i>For example:</i> for 'Volume' all values(within 5 minutes intervals) in that column will be summed up; for 'Close' we get last value from 5-minute interval; for 'High' we get maximum value within 5-minute intervals.

In [ ]:
resample_df5

As you can see we successfully resampled our 1 minute intervals. Let's save new datasets into csv file

In [ ]:
resample_df5.to_csv('BTC BUSD 5min.csv')
resample_df10.to_csv('BTC BUSD 10min.csv')

### Thank you for completing this lab!

## Author

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Joseph Santarcangelo</a>

### Other Contributors

<a href="https://www.linkedin.com/in/mahdi-noorian-58219234/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Mahdi Noorian PhD</a>

Bahare Talayian

Eric Xiao

Steven Dong

Parizad

Hima Vasudevan

<a href="https://www.linkedin.com/in/fiorellawever/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Fiorella Wenver</a>

<a href="https:// https://www.linkedin.com/in/yi-leng-yao-84451275/ " target="_blank" >Yi Yao</a>.

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                  |
| ----------------- | ------- | ---------- | ----------------------------------- |

<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
