# Predicting the CLOSE value from the LSTM predictions

This notebook will reproduce the steps for a REGRESSION on  predictions.
The main objective is to predict the variable actual.

Model Ridge (L2) regression, trained on 2018-11-04 10:47:55.

Let's start with importing the required libs, and tune pandas display options:

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk
from collections import defaultdict, Counter

In [2]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data
The first step is to get our machine learning dataset:

In [3]:
ml_dataset = pd.read_csv('/Users/renero/Documents/SideProjects/SailBoatsFactory/data/predictions.csv')
print('Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1]))
# Five first records",
ml_dataset.head(5)

Base data has 500 rows and 14 columns


Unnamed: 0,1yw3,1yw7,1yw10,5yw3,5yw7,5yw10,10yw3,10yw7,actual,avg,avg_diff,median,med_diff,winner
0,0.669609,0.670776,0.671797,0.669609,0.665964,0.670193,0.66611,0.668735,0.680276,0.669099,0.011177,0.669609,0.010667,1yw10
1,0.7224,0.72633,0.72633,0.72633,0.721685,0.727402,0.726687,0.723114,0.722522,0.725035,0.002513,0.72633,0.003808,1yw3
2,0.805926,0.805926,0.800715,0.805318,0.806621,0.803668,0.8061,0.805058,0.80534,0.804916,0.000423,0.805318,2.2e-05,5yw3
3,0.368011,0.370196,0.369034,0.368708,0.370707,0.367686,0.369963,0.369638,0.366431,0.369243,0.002812,0.369243,0.002812,5yw10
4,0.800756,0.805087,0.805706,0.806943,0.806325,0.798282,0.805706,0.806943,0.788609,0.804469,0.01586,0.805706,0.017097,5yw10


#### Initial data management
The preprocessing aims at making the dataset compatible with modeling. At the end of this step, we will have a matrix of float numbers, with no missing values. We'll use the features and the preprocessing steps defined in Models.

Let's only keep selected features

In [4]:
ml_dataset = ml_dataset[[u'actual', u'10yw7', u'1yw7', u'1yw3', u'1yw10', u'median', u'5yw10', u'10yw3', u'5yw3', u'avg', u'5yw7']]
# Five first records",
ml_dataset.head(5)

Unnamed: 0,actual,10yw7,1yw7,1yw3,1yw10,median,5yw10,10yw3,5yw3,avg,5yw7
0,0.680276,0.668735,0.670776,0.669609,0.671797,0.669609,0.670193,0.66611,0.669609,0.669099,0.665964
1,0.722522,0.723114,0.72633,0.7224,0.72633,0.72633,0.727402,0.726687,0.72633,0.725035,0.721685
2,0.80534,0.805058,0.805926,0.805926,0.800715,0.805318,0.803668,0.8061,0.805318,0.804916,0.806621
3,0.366431,0.369638,0.370196,0.368011,0.369034,0.369243,0.367686,0.369963,0.368708,0.369243,0.370707
4,0.788609,0.806943,0.805087,0.800756,0.805706,0.805706,0.798282,0.805706,0.806943,0.804469,0.806325


Let's first coerce categorical columns into unicode, numerical features into floats.

In [5]:
# astype('unicode') does not work as expected
def coerce_to_unicode(x):
    if isinstance(x, str):
        return unicode(x,'utf-8')
    else:
        return unicode(x)

categorical_features = []
numerical_features = [u'10yw7', u'1yw7', u'1yw3', u'1yw10', u'median', u'5yw10', u'10yw3', u'5yw3', u'avg', u'5yw7']
text_features = []

for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in numerical_features:
    if ml_dataset[feature].dtype != np.dtype('M8[ns]'):
        ml_dataset[feature] = ml_dataset[feature].astype('double')

We renamed the target variable to a column named target

In [6]:
ml_dataset['__target__'] = ml_dataset['actual']
del ml_dataset['actual']

# Remove rows for which the target is unknown.
ml_dataset = ml_dataset[~ml_dataset['__target__'].isnull()]

#### Cross-validation strategy
The dataset needs to be split into 2 new sets, one that will be used for training the model (train set) and another that will be used to test its generalization capability (test set).

This is a simple cross-validation strategy.

In [10]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ml_dataset, test_size=0.2, shuffle=False)
print('Train data has %i rows and %i columns' % (train.shape[0], train.shape[1]))
print('Test data has %i rows and %i columns' % (test.shape[0], test.shape[1]))

NameError: name 'split_train_valid' is not defined

#### Features preprocessing
The first thing to do at the features level is to handle the missing values. Let's reuse the settings defined in the model

In [None]:
drop_rows_when_missing = []
impute_when_missing = [{'impute_with': u'MEAN', 'feature': u'10yw7'}, {'impute_with': u'MEAN', 'feature': u'1yw7'}, {'impute_with': u'MEAN', 'feature': u'1yw3'}, {'impute_with': u'MEAN', 'feature': u'1yw10'}, {'impute_with': u'MEAN', 'feature': u'median'}, {'impute_with': u'MEAN', 'feature': u'5yw10'}, {'impute_with': u'MEAN', 'feature': u'10yw3'}, {'impute_with': u'MEAN', 'feature': u'5yw3'}, {'impute_with': u'MEAN', 'feature': u'avg'}, {'impute_with': u'MEAN', 'feature': u'5yw7'}]

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    test = test[test[feature].notnull()]
    print 'Dropped missing records in %s' % feature

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    elif feature['impute_with'] == 'CONSTANT':
        v = feature['value']
    train[feature['feature']] = train[feature['feature']].fillna(v)
    test[feature['feature']] = test[feature['feature']].fillna(v)
    print 'Imputed missing values in feature %s with value %s' % (feature['feature'], unicode(str(v), 'utf8'))

We can now handle the categorical features (still using the settings defined in Models):

Let's rescale numerical features

In [None]:
rescale_features = {u'10yw3': u'AVGSTD', u'10yw7': u'AVGSTD', u'1yw7': u'AVGSTD', u'1yw3': u'AVGSTD', u'1yw10': u'AVGSTD', u'median': u'AVGSTD', u'5yw10': u'AVGSTD', u'5yw3': u'AVGSTD', u'avg': u'AVGSTD', u'5yw7': u'AVGSTD'}
for (feature_name, rescale_method) in rescale_features.items():
    if rescale_method == 'MINMAX':
        _min = train[feature_name].min()
        _max = train[feature_name].max()
        scale = _max - _min
        shift = _min
    else:
        shift = train[feature_name].mean()
        scale = train[feature_name].std()
    if scale == 0.:
        del train[feature_name]
        del test[feature_name]
        print 'Feature %s was dropped because it has no variance' % feature_name
    else:
        print 'Rescaled %s' % feature_name
        train[feature_name] = (train[feature_name] - shift).astype(np.float64) / scale
        test[feature_name] = (test[feature_name] - shift).astype(np.float64) / scale

#### Modeling

Before actually creating our model, we need to split the datasets into their features and labels parts:

In [None]:
train_X = train.drop('__target__', axis=1)
test_X = test.drop('__target__', axis=1)

train_Y = np.array(train['__target__'])
test_Y = np.array(test['__target__'])

Now we can finally create our model !

In [None]:
from sklearn.linear_model import RidgeCV
clf = RidgeCV(fit_intercept=True, normalize=True)

... And train it

In [None]:
%time clf.fit(train_X, train_Y)

Build up our result dataset

In [None]:
%time _predictions = clf.predict(test_X)
predictions = pd.Series(data=_predictions, index=test_X.index, name='predicted_value')

# Build scored dataset
results_test = test_X.join(predictions, how='left')
results_test = results_test.join(test['__target__'], how='left')
results_test = results_test.rename(columns= {'__target__': 'actual'})

#### Results

You can measure the model's accuracy:

In [None]:
c =  results_test[['predicted_value', 'actual']].corr()
print 'Pearson correlation: %s' % c['predicted_value'][1]

I measure the score of the model over the test sets, as indicated in the Ridge SKLearn manual

In [None]:
score = clf.score(test_X, test_Y)
print("Test score: {0:.2f} %".format(100 * score))

I dump the model to a pickle file, so that I can use it from the main code.

In [None]:
import pickle
pkl_filename = "/USers/renero/Documents/SideProjects/SailBoatsFactory/networks/ridge_l2_model.pkl"  
with open(pkl_filename, 'wb') as file: 
    pickle.dump(clf, file)

I check that model still works

In [None]:
pkl_filename = "/USers/renero/Documents/SideProjects/SailBoatsFactory/networks/ridge_l2_model.pkl"  
with open(pkl_filename, 'rb') as file:  
    pickle_model = pickle.load(file)
new_score = pickle_model.score(test_X, test_Y)
print("Test score: {0:.2f} %".format(100 * new_score))