In [12]:
import pandas as pd
import numpy as np

<h2>Features</h2>

In the previous notebook, we created the variables to define the characteristic of a certain coin. In this notebook, we will take a step further in processing these variables to create the following features for training:
1. Mean
2. Median
3. Standard Deviation
4. Last Value
5. Overall trend (Difference between first and last value)

These features will be calculated for each variable we created previously and will be built based on a trailing window of width $w$ for a certain coin. $w$ is measured in days. To illustrate this, take a look at an example below.

In [13]:
# Get processed data
df = pd.read_csv("./data/processed_data.csv")

In [14]:
btc_df = df[df['sym'] == 'BTC'].iloc[-11:, :]

In [15]:
btc_df

Unnamed: 0,market_cap,name,price,sym,time,volume,rank,market_share,age,roi
95782,116037000000.0,Bitcoin,7083.8,BTC,2018-04-02,4333440000.0,1.0,0.458644,1800,-0.033819
95783,120415000000.0,Bitcoin,7456.11,BTC,2018-04-03,5499700000.0,1.0,0.458281,1801,-0.049934
95784,126434000000.0,Bitcoin,6853.84,BTC,2018-04-04,4936000000.0,1.0,0.45113,1802,0.087873
95785,116142000000.0,Bitcoin,6811.47,BTC,2018-04-05,5639320000.0,1.0,0.454319,1803,0.00622
95786,115601000000.0,Bitcoin,6636.32,BTC,2018-04-06,3766810000.0,1.0,0.451979,1804,0.026393
95787,112467000000.0,Bitcoin,6911.09,BTC,2018-04-07,3976610000.0,1.0,0.45376,1805,-0.039758
95788,117392000000.0,Bitcoin,7023.52,BTC,2018-04-08,3652500000.0,1.0,0.454923,1806,-0.016008
95789,119516000000.0,Bitcoin,6770.73,BTC,2018-04-09,4894060000.0,1.0,0.451838,1807,0.037336
95790,115306000000.0,Bitcoin,6834.76,BTC,2018-04-10,4272750000.0,1.0,0.446148,1808,-0.009368
95791,116126000000.0,Bitcoin,6968.32,BTC,2018-04-11,4641890000.0,1.0,0.441747,1809,-0.019167


The data above shows the last 11 data points in the BTC timeline. Assume that we are trying to predict for its price on 2018-04-12 (which we can annotate as $t$) and that we have decided to $w$ be 10. This means, to calculate the features to predict for this time step $t$, we will use the previous 10 data points preceding $t$. In this case, the average price will be the average of prices between 2018-04-02 ($t-10$) and 2018-04-11 ($t-1$), inclusive. 

In [16]:
mean_price = btc_df['price'][1:10].mean()
print(mean_price)

6918.462222222222


Therefore, the general equation to calculate mean for a given $w$ is as follows:

$$mean = \frac{1}{w}{\textstyle\sum}_{n=1}^{w}{V_{t-n}}$$
<div style="text-align: center">Where $V$ can be any coin related variables we defined previously such as market capitalization, volume, rank, etc. </div></br>


Similarly median, standard deviation, last value and overall trend are calculated in a similar manner in terms of the data points being used even though the equation to process them are different.

<h2>Feature Engineering</h2>

This part is meant to develop the function to calculate the features for a given $w$.

In [17]:
w = 10

In [18]:
# Function to rename the column for the features created
def rename_columns(df, suffix):
    if 'index' in df.columns:
        df.drop(columns='index', inplace=True)
    else:
        pass
    col_names = [col + '_' + suffix for col in df.columns]
    df.columns = col_names

In [29]:
def create_features(df, w):
    # Calculate mean
    mean_df = df.groupby('sym').shift(1).rolling(w).mean().reset_index()
    rename_columns(mean_df, '1_mean') # Numbers are assigned to the suffix for column ordering

    # Calculate median
    median_df = df.groupby('sym').shift(1).rolling(w).median().reset_index()
    rename_columns(median_df, '2_median')
    
    # Calculate standard deviation
    stdev_df = df.groupby('sym').shift(1).rolling(w).std().reset_index()
    rename_columns(stdev_df, '3_stdev')
    
    # Identify last value
    last_df = df.groupby('sym').shift(1)
    rename_columns(last_df, '4_last')
    
    # Identify first value and calculate difference between first and last value
    first_df = df.groupby('sym').shift(w)
    delta_df = pd.DataFrame()
    for col in first_df.columns:
        col_last = col + '_4_last'
        try:
            delta_df[col] = last_df[col_last] - first_df[col]
        except:
            pass
    rename_columns(delta_df, '5_delta')
    
    # Create base table to match created features with its respective time and coin
    base_df = df[['time', 'sym', 'price']]
    
    # Set current price as the target price to predict based on the features
    base_df = base_df.rename(columns={'price': 'target_price'})
    
    # Combine all features into a single table
    features_df = pd.concat([base_df, mean_df, median_df, stdev_df, last_df, delta_df], axis=1)
    
    # Clean up table by removing unnecessary columns and rearrange for ease of reference
    feature_cols = [col for col in features_df.columns]

    feature_cols.remove('name_4_last')
    feature_cols.remove('time_4_last')
    feature_cols.remove('time')
    feature_cols.remove('sym')
    feature_cols.sort()

    features_df = features_df[['time', 'sym'] + feature_cols + ['target_price']]
    
    # Remove points where there are insufficient data points to calculate features
    # For example, any time step that does not have w data points before it will have NA features
    features_df.dropna(inplace=True)
    
    return features_df

In [30]:
features_df = create_features(df, w)

In [31]:
features_df.head()

Unnamed: 0,time,sym,age_1_mean,age_2_median,age_3_stdev,age_4_last,age_5_delta,market_cap_1_mean,market_cap_2_median,market_cap_3_stdev,...,roi_3_stdev,roi_4_last,roi_5_delta,target_price,volume_1_mean,volume_2_median,volume_3_stdev,volume_4_last,volume_5_delta,target_price.1
10,2015-12-21,$$$,4.5,4.5,3.02765,9.0,9.0,1282.1,1062.5,462.989909,...,0.336553,0.045455,0.045455,2.2e-05,2.4,2.0,1.646545,3.0,2.0,2.2e-05
11,2015-12-22,$$$,5.5,5.5,3.02765,10.0,9.0,1155.0,1053.5,303.439176,...,0.336553,0.0,0.372093,2.2e-05,2.4,2.0,1.646545,1.0,1.0,2.2e-05
12,2015-12-23,$$$,6.5,6.5,3.02765,11.0,9.0,1130.9,1044.5,305.062999,...,0.302512,0.0,-0.954545,2.7e-05,2.5,2.0,1.509231,1.0,-1.0,2.7e-05
13,2015-12-24,$$$,7.5,7.5,3.02765,12.0,9.0,1031.5,1030.5,27.857774,...,0.062266,-0.185185,-0.185185,2.3e-05,2.6,2.5,1.505545,3.0,0.0,2.3e-05
14,2015-12-25,$$$,8.5,8.5,3.02765,13.0,9.0,1053.9,1044.5,62.665691,...,0.087233,0.173913,0.217391,2.3e-05,2.5,2.0,1.509231,2.0,0.0,2.3e-05


<h2>Feature Evaluation</h2>

In this section, we will take a look at the relationship of the features to the target label or the target price to see if we can spot certain trends. Additionally, we will also look at the correlation between these features to avoid any highly correlated features which might overcomplicate the model.