# Ubiquant: Linear Regression with CV

This notebook presents a simple approach using Linear Regression method with CV. As new assets will appear in the future test set, cross-validation across the existing assets would be a good choice to evaluate the model performance and avoid overfitting issue.

In [None]:
import numpy as np 
import pandas as pd 
import gc, joblib

from sklearn.linear_model import LinearRegression as lr
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr as p

import warnings
warnings.simplefilter('ignore')

## Data

A light version of the data is considered (see https://www.kaggle.com/c/ubiquant-market-prediction/discussion/302223#1658970)

In [None]:
train = pd.read_pickle('../input/ump195gb/train.pkl')
train.head(3)

The number of assets is almost constant before a crazy time_id of about 400, then it increases linearly versus time_id. Thefore, we should expect to have more assets in the portfolio. 

In [None]:
df = train.groupby('time_id')['investment_id'].count()
df.plot(color='green', ylabel='Number of assets', figsize=(10,7), linewidth=2)
gc.collect()

In this study, we will consider only the linear trend (time_id > 400) that is representative for future data. Also, we will remove the time_id where there is a sudden drop in number of assets.

In [None]:
df_sub = df.drop(df[df.diff() <-200].index, axis=0)[400:]
df_sub.plot(color='blue', ylabel='Number of assets', figsize=(10,7), linewidth=2)

del df
gc.collect()

We reduce the train dataset with respect to the reduced time_ids

In [None]:
train = train[train.time_id.isin(df_sub.index)].reset_index(drop=True)
del df_sub
gc.collect()
print(f'Number of rows in the reduced dataset: {len(train)}')

## Model

time_id is actually not considered as a feature

In [None]:
X = train.drop(['row_id', 'time_id', 'target'], axis=1)
y = train['target']

del train
gc.collect()

We use StratifiedKFold to split the train/validation subsets with respect to the investment_id. Doing so, we allow the model to predict new assets that may appear in future test set. 

In [None]:
skf = StratifiedKFold(n_splits=5)
models = []
scores = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X, X.investment_id)):
    print('_'*50)
    print(f'Fold: {fold}')
    
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    model = lr()
    model.fit(X_train,y_train)
    
    # Save model for inference or future ensemble models
    models.append(model)
    joblib.dump(model, f'fold_{fold}.pkl')
    
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_pred, y_val))
    corr = p(y_pred, y_val)[0]
    scores.append(corr)
    print(f'RMSE: {rmse},\t Pearson correlation score: {corr}')
    
    del X_train, y_train, X_val, y_val, model, y_pred, rmse, corr
    gc.collect()
    
print(f'Average Pearson correlation score: {np.mean(scores, axis=0)}')
del X, y
gc.collect()

## Submission

In [None]:
import ubiquant
env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission
for (test_df, sample_prediction_df) in iter_test:
    X_test = test_df.drop(['row_id'], axis=1)

    # loop over all the model for prediction and take the mean of all models
    sample_prediction_df['target'] = np.mean([model.predict(X_test) for model in models], axis=0) 
    
    env.predict(sample_prediction_df)   # register your predictions
    
    display(sample_prediction_df) # display the predicted results

## End. Please upvote if you like or copy this notebook.