In [1]:
import pandas as pd
import numpy as np

<h2>Feature Engineering</h2>

In the previous notebook, we created the variables to define the characteristic of a certain coin. In this notebook, we will take a step further in processing these variables to create the following features for training:
1. Mean
2. Median
3. Standard Deviation
4. Last Value
5. Overall trend (Difference between first and last value)

These features will be calculated for each variable we created previously and will be built based on a trailing window of width $w$ for a certain coin. $w$ is measured in days. To illustrate this, take a look at an example below.

In [3]:
# Get processed data
df = pd.read_csv("./data/processed_data.csv")

In [8]:
btc_df = df[df['sym'] == 'BTC'].iloc[-11:, :]

In [9]:
btc_df

Unnamed: 0,market_cap,name,price,sym,time,volume,rank,market_share,age,roi
95782,116037000000.0,Bitcoin,7083.8,BTC,2018-04-02,4333440000.0,1.0,0.458644,1800,-0.033819
95783,120415000000.0,Bitcoin,7456.11,BTC,2018-04-03,5499700000.0,1.0,0.458281,1801,-0.049934
95784,126434000000.0,Bitcoin,6853.84,BTC,2018-04-04,4936000000.0,1.0,0.45113,1802,0.087873
95785,116142000000.0,Bitcoin,6811.47,BTC,2018-04-05,5639320000.0,1.0,0.454319,1803,0.00622
95786,115601000000.0,Bitcoin,6636.32,BTC,2018-04-06,3766810000.0,1.0,0.451979,1804,0.026393
95787,112467000000.0,Bitcoin,6911.09,BTC,2018-04-07,3976610000.0,1.0,0.45376,1805,-0.039758
95788,117392000000.0,Bitcoin,7023.52,BTC,2018-04-08,3652500000.0,1.0,0.454923,1806,-0.016008
95789,119516000000.0,Bitcoin,6770.73,BTC,2018-04-09,4894060000.0,1.0,0.451838,1807,0.037336
95790,115306000000.0,Bitcoin,6834.76,BTC,2018-04-10,4272750000.0,1.0,0.446148,1808,-0.009368
95791,116126000000.0,Bitcoin,6968.32,BTC,2018-04-11,4641890000.0,1.0,0.441747,1809,-0.019167


The data above shows the last 11 data points in the BTC timeline. Assume that we are trying to predict for its price on 2018-04-12 (which we can annotate as $t$) and that we have decided to $w$ be 10. This means, to calculate the features to predict for this time step $t$, we will use the previous 10 data points preceding $t$. In this case, the average price will be the average of prices between 2018-04-02 ($t-10$) and 2018-04-11 ($t-1$), inclusive. 

In [16]:
mean_price = btc_df['price'][1:10].mean()
print(mean_price)

6918.462222222222


Therefore, the general equation to calculate mean for a given $w$ is as follows:

$$mean = \frac{1}{w}{\textstyle\sum}_{n=1}^{w}{V_{t-n}}$$
<div style="text-align: center">Where $V$ can be any coin related variables we defined previously such as market capitalization, volume, rank, etc. </div></br>


Similarly median, standard deviation, last value and overall trend are calculated in a similar manner in terms of the data points being used even though the equation to process them are different.