# Daily Stopwatch Data Science: Two Sigma Financial Modeling Challenge

Note: this is a part of a Kaggle competition: Two Sigma Financial Modeling Challenge. For more information, see https://www.kaggle.com/c/two-sigma-financial-modeling

## Stage 1: Ask a question

My objective is to the predict the value of unknown variable $y$, which is a time series.

I measure the performance by R-squared. It seems that the actual competition has online version of it for reinforcement learning. I will get back to it later.

## Stage 2: Set the environment up and get data

First, set up a directory for data and link it to this workplace. Download data into your choice of directory.

In [1]:
#Set up the environment
import time
import numpy as np                         #Numpy
import pandas as pd                        #Pandas
import matplotlib.pyplot as plt            #Plot

In [2]:

# Set up data directory
DataDir = "C:/Users/Admin/Documents/data/"

# Here's an example of loading the CSV using Pandas's built-in HDF5 support:
with pd.HDFStore(DataDir+ "train.h5", "r") as train:
    # Note that the "train" dataframe is the only dataframe in the file
    df = train.get("train")

In [3]:
%matplotlib inline                         
#Make plot appeared inline

## Stage 3: Explore the data

Explore, Visualize, Clean, Transform, Feature engineering

In [4]:
#Basis stats
len(df), len(df.columns)   #number of rows and columns

(1710756, 111)

In [5]:
#See all column names and types
#print(df.dtypes.to_string())

In [6]:
#See first ten rows
#df.head(10)

In [7]:
#See last ten rows
#df.tail(10)

Here one can see something peculiar about this data. There are ids and timestamps. Let's check the number of unique elements

In [8]:
len(df["timestamp"].unique()), len(df["id"].unique())

(1813, 1424)

Let's create a summary table for those columns including type, min, mean, max, sigma, percent of zeros, and percent of missing values

In [9]:
df_column_summary = pd.DataFrame(df.dtypes,columns=['type'])
df_column_summary.reset_index(inplace=True)
df_column_summary['min'] = list(df.min())
df_column_summary['mean'] = list(df.mean())
df_column_summary['max'] = list(df.max())
df_column_summary['sigma'] = list(df.std())

l = ['NA'] * len(df.columns)
for i in range(0,len(df.columns)):
    l[i] = 1.0-np.count_nonzero(df.iloc[:,i])*1.0/len(df)
df_column_summary['zero'] = l

df_column_summary['missing'] = list(1.0-df.count()*1.0/len(df))

In [10]:
#print(df_column_summary.to_string())

From Round 2, we found that there is no significant issue with data. So let's opt the individual feature out.

From Round 3, we found that our method is quite slow. This is because the data is too large. Let's do PCA to make data smaller in a more tractable way. We will look at data at one timestamp at a time. 

First normalize data so that each feature has 0-mean and 1-std (except id, timestamp, and input).

In [11]:
for i in range(2,110):
    mean = df_column_summary.loc[i,'mean']
    sigma = df_column_summary.loc[i,'sigma']
    df[df.columns[i]] = (df[df.columns[i]]-mean)/sigma

Next let's look at output value y from different ids. Convert it into a matrix by pivoting

In [12]:
dy = df[['id','timestamp','y']].pivot(index='timestamp', columns='id', values='y')
dy = dy.fillna(value=0)

In [13]:
len(dy),len(dy.columns)

(1813, 1424)

In [14]:
#Let's try PCA
from sklearn.decomposition import PCA

In [15]:
X = np.array(dy)
pca_derived = PCA(n_components=500)
pca_derived.fit(X)
l = pca_derived.explained_variance_ratio_ 
sum(l)

0.89116884041384048

It seems that there is very small correlation between values among different ids. We should look at individual ids then. Lengthy but hopefully doable. 

In [16]:
#Revisit PCA 
#try group them according to feature types: derived
X = np.array(df[df.columns[range(2,7)]].dropna(axis=0))
pca_derived = PCA(n_components=4)
pca_derived.fit(X)

#try group them according to feature types: fundamental
X = np.array(df[df.columns[range(7,70)]].dropna(axis=0))
pca_fundamental = PCA(n_components=16)
pca_fundamental.fit(X)

#try group them according to feature types: technical
X = np.array(df[df.columns[range(70,110)]].dropna(axis=0))
pca_technical = PCA(n_components=32)
pca_technical.fit(X)
print(pca_technical.explained_variance_ratio_) 

[ 0.144173    0.08449022  0.05374959  0.04201616  0.03940461  0.03401835
  0.02991891  0.02958015  0.02731697  0.02642108  0.02606273  0.02566239
  0.02549551  0.0254263   0.02527166  0.02518373  0.02511065  0.02498515
  0.02477049  0.02456923  0.0228427   0.02177022  0.02136387  0.01724815
  0.01452915  0.01372289  0.01339999  0.01210162  0.0116663   0.0110958
  0.01007953  0.00972421]


In [17]:
#CLEAN
#replace NaN values with zero (mean).
df = df.fillna(value=0)

In [18]:
#TRANFORM/ NEW VARIABLES

#Apply PCA to different group of features and reattached everything
d1 = df[['id','timestamp']]
#derived
d2 = pd.DataFrame(pca_derived.transform(df[df.columns[range(2,7)]]))
d2 = d2.loc[:,0:3]
d2.columns = ['derived_pca_1','derived_pca_2','derived_pca_3','derived_pca_4']

d3 = pd.DataFrame(pca_fundamental.transform(df[df.columns[range(7,70)]]))
d3 = d3.loc[:,0:15]
d3.columns = ['fundamental_pca_' + str(i) for i in range(0,16)]

d4 = pd.DataFrame(pca_technical.transform(df[df.columns[range(70,110)]]))
d4 = d4.loc[:,0:31]
d4.columns = ['technical_pca_' + str(i) for i in range(0,32)]

d5 = df[['y']]

df = pd.concat([d1,d2,d3,d4,d5],axis=1)

## Stage 4: Model the data

Here I have prepared data for validation using 70% train 30% test. Inside the loop for each id, we use the following models:

1. Ordinary Least Square (scikit learn LinearRegression)
2. Lasso with embedded cross-validataion (sklearn LassoLars)
3. Random Forest (sklearn RandomForestRegressor)
4. Gradient Boost regression (sklearn GradientBoostingRegressor)
5. Neural nets (sklearn MLPRegressor)

In [19]:
from sklearn.linear_model import LassoLarsCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
idList = list(df['id'].unique())

In [20]:
performance_summary = pd.DataFrame(idList,columns=['id'])
#performance_summary.reset_index(inplace=True)
performance_summary['size'] = 0
performance_summary['time 1'] = 0
performance_summary['R2_train 1'] = 0
performance_summary['R2_test 1'] = 0
performance_summary['time 2'] = 0
performance_summary['R2_train 2'] = 0
performance_summary['R2_test 2'] = 0
performance_summary['time 3'] = 0
performance_summary['R2_train 3'] = 0
performance_summary['R2_test 3'] = 0
performance_summary['time 4'] = 0
performance_summary['R2_train 4'] = 0
performance_summary['R2_test 4'] = 0
performance_summary['time 5'] = 0
performance_summary['R2_train 5'] = 0
performance_summary['R2_test 5'] = 0

In [21]:
#for i in range(0,len(idList)):
for i in range(0,10):
    d_small = df[df['id']==idList[10]]
    performance_summary.loc[i,'size'] = len(d_small)

    # Now test/train split by random selection. Ideally, we should do cross-validation and parameter average, but save it for later.
    r = np.random.uniform(0,1,len(d_small)) # Random UNIForm numbers, one per row
    train = d_small[ r < 0.7]
    test = d_small[0.7 <= r]
    #len(train), len(test)
    X_train = train.drop('y', axis=1)
    y_train = train['y']
    X_test = test.drop('y', axis=1)
    y_test = test['y']

    #Let's try GLM: OLS and Lasso with cross-validation 
    t1 = time.time()
    model = LinearRegression().fit(X_train, y_train)
    performance_summary.loc[i,'time 1'] = time.time() - t1
    performance_summary.loc[i,'R2_train 1'] = model.score(X_train,y_train)
    performance_summary.loc[i,'R2_test 1'] = model.score(X_test,y_test)

    t1 = time.time()
    model = LassoLarsCV(cv=10).fit(X_train, y_train)
    performance_summary.loc[i,'time 2'] = time.time() - t1
    performance_summary.loc[i,'R2_train 2'] = model.score(X_train,y_train)
    performance_summary.loc[i,'R2_test 2'] = model.score(X_test,y_test)
    
    t1 = time.time()
    model = RandomForestRegressor().fit(X_train, y_train)
    performance_summary.loc[i,'time 3'] = time.time() - t1
    performance_summary.loc[i,'R2_train 3'] = model.score(X_train,y_train)
    performance_summary.loc[i,'R2_test 3'] = model.score(X_test,y_test)
    
    t1 = time.time()
    model = GradientBoostingRegressor().fit(X_train, y_train)
    performance_summary.loc[i,'time 4'] = time.time() - t1
    performance_summary.loc[i,'R2_train 4'] = model.score(X_train,y_train)
    performance_summary.loc[i,'R2_test 4'] = model.score(X_test,y_test)
    
    t1 = time.time()
    model = MLPRegressor().fit(X_train, y_train)
    performance_summary.loc[i,'time 5'] = time.time() - t1
    performance_summary.loc[i,'R2_train 5'] = model.score(X_train,y_train)
    performance_summary.loc[i,'R2_test 5'] = model.score(X_test,y_test)



In [22]:
performance_summary[:10]

Unnamed: 0,id,size,time 1,R2_train 1,R2_test 1,time 2,R2_train 2,R2_test 2,time 3,R2_train 3,R2_test 3,time 4,R2_train 4,R2_test 4,time 5,R2_train 5,R2_test 5
0,10,1813,0.035,0.061907,-0.028552,0.248,2.225798e-09,-0.01322769,0.915,0.775154,-0.257548,0.672,0.485774,-0.13937,0.487,-82.802518,-67.077787
1,11,1813,0.004,0.076076,-0.052691,0.209,1.348906e-09,-0.001886267,1.056,0.786193,-0.248646,0.666,0.508353,-0.139617,0.521,-92.782877,-86.188995
2,12,1813,0.004,0.049545,0.011457,0.195,0.0008825445,-0.00201292,0.899,0.774275,-0.076948,0.612,0.470823,-0.005521,0.782,-53.961576,-43.300283
3,25,1813,0.004,0.068395,-0.041446,0.222,2.577542e-09,-0.0008243471,0.946,0.786814,-0.269605,0.658,0.504237,-0.167374,0.644,-33.364927,-48.679275
4,26,1813,0.003,0.062425,-0.033362,0.235,-5.425558e-10,-0.006356545,0.881,0.790439,-0.283623,0.569,0.506923,-0.1272,0.497,-41.453494,-45.106524
5,27,1813,0.004,0.074375,-0.074235,0.197,-3.117838e-09,1.309091e-10,1.45,0.772715,-0.445998,0.58,0.502901,-0.232244,0.295,-79.907888,-99.06477
6,31,1813,0.004,0.059268,-0.00436,0.181,5.552157e-10,-0.0009268213,0.954,0.793072,-0.157627,0.675,0.458707,-0.037483,0.286,-90.243919,-81.454823
7,38,1813,0.004,0.058595,-0.008301,0.193,3.057372e-09,-0.006863513,0.974,0.774856,-0.195217,0.677,0.511703,-0.090083,0.694,-62.844934,-71.23347
8,39,1813,0.004,0.061035,-0.020137,0.172,1.989556e-09,-0.00115411,0.808,0.780045,-0.197367,0.574,0.510618,-0.061932,0.052,-842766.96845,-841734.135675
9,40,1813,0.004,0.054648,-0.021377,0.181,2.501688e-09,-0.002296333,0.782,0.778624,-0.30314,0.555,0.505776,-0.187429,0.357,-27.176841,-35.798516


Here are some comments

1. We haven't tuned paramters properly for Method 3,4 and 5. It seems hopeless but might worth trying.
2. One thing that we haven't tried is to use values from previous timestamps as an input. Let's try it next time. 

## Stage 5: Communicate the data

*Note to myself*: There is a stupid mistake at the beginning about concatenate data frames. So the code is fixed. It took some time to run so I spent roughly 4 Pomodoros along with side-taks (Translate Khan Academy videos). The result is poor. Try new things tomorrow.