# Ubiquant market prediction : EDA, PCA and Linear Regression
This is a notebook dedicated to :
- analysis of the dataset of Ubiquant market prediction Kaggle competition,
- PCA on a sample of the dataset
- Linear Regression used for predictions.

Several ideas are picked up from this kernel https://www.kaggle.com/code/bastiendelaval/analyse-oc such as correlations and PCA.

## Librairies

In [None]:
# Data Manipulation
import numpy as np
import pandas as pd
import random

# Get files content
import os
import joblib

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

import warnings

warnings.filterwarnings(action="ignore")

# scipy tools
from scipy.stats.stats import pearsonr

# sklearn tools
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    make_scorer,
)
from sklearn.model_selection import (
    learning_curve,
    cross_validate,
    KFold,
    TimeSeriesSplit,
)

## Data importation

We use parquet version of the dataset thanks to this kernel https://www.kaggle.com/code/camilomx/parquet-format-quickstart.

In [None]:
%%time

# Import dataset
df = pd.read_parquet("../input/ubiquant-parquet-low-mem/train_low_mem.parquet")

## First look on the dataset

In [None]:
# Display first rows
df.head(5)

There are 300 features named "f_i" for i in (0, 300). 

There is the target named "target".

Row_id is indexed on investment_id time_id.

For each column time_id value, there are several investment_id.

In [None]:
# Dimension
df.shape

In [None]:
# Info about data
df.info()

304 columns and more than 3M of rows.

Column row_id is dtype object.

In [None]:
print("Columns of dtype uint16 : ")
for col in df.select_dtypes("uint16"):
    print(col)

In [None]:
# Data summary
df.describe()

Features seem to have low values (< 100) and can get be negative, even for the target.

As the means are very close to zero, we can consider that features f had already been standardized.

In [None]:
# Check if there are missing values
df.isnull().sum().sum()

There is no missing value.

### Reduce memory usage of the dataset
Many kernels use this function to reduce the memory usage of the dataset (to avoid Memory-over error). I didn't find the original kernel that introduce this.

Be careful it takes a long time.

EDIT : it seems that there is a lost of information, especially when we get the describe information. I am not sure if it is a good idea.

I wrote a notebook about it https://www.kaggle.com/code/larochemf/ubiquant-low-memory-use-be-careful. It seems that some lines have to be changed. Nevertheless, I finally did not use this function.

In [None]:
%%time

def reduce_mem_usage(df):

    start_mem = df.memory_usage().sum() / 1024 ** 2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024 ** 2

    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))

    return df


df_1 = reduce_mem_usage(df)

In [None]:
# Data summary
df_1.describe()

## Features analysis

We are going to analyse features, with some points taken from this kernel https://www.kaggle.com/code/jiahauc/ubiqunt-eda-linearregression
### Investment

In [None]:
investments = df["investment_id"].nunique()
print("Number of unique investiment_id : ", investments)

In [None]:
df["investment_id"].value_counts()

It seems that several investments have low frequency. Let's have a look at investment_id = 905.

In [None]:
df.loc[df["investment_id"] == 905]

This investment is only present at this end of the dataset.

In [None]:
# Let's group by investment_id and see distribution
obs_by_investments = df.groupby(["investment_id"])["target"].count()

obs_by_investments.plot(kind="hist", bins=100)
plt.title("Target by investment distribution")
plt.show()

There are more targets with investment_id with high values count.

In [None]:
# Get mean values of the target when groupping by investment_id
mean_targets = df.groupby(["investment_id"])["target"].mean()
mean_targets

In [None]:
# Plot these means distributions
mean_targets.plot(kind="hist", bins=100)
plt.title("target mean distribution")
plt.show()

Target mean distribution is close to normal distribution.

In [None]:
ax = sns.jointplot(
    x=obs_by_investments,
    y=mean_targets,
    kind="reg",
    height=10,
    joint_kws={"line_kws": {"color": "red"}},
)
ax.ax_joint.set_xlabel("observations")
ax.ax_joint.set_ylabel("mean of target")
plt.show()

Through this joint plot of observations in each investment and mean target value in each investment, it shows there is a growing trend when the observations increase. Also, the dispersion of target values is more apparent when the number of recorded investments is relatively low.

### time_id

In [None]:
timestamps = df["time_id"].nunique()
print("Number of unique time_id : ", timestamps)

In [None]:
df["time_id"].value_counts()

In [None]:
plt.figure(figsize=(30, 8))
df["time_id"].value_counts().plot(kind="bar")
plt.show()

In [None]:
print(
    "Percent of time_id value_counts >= 2000 : {}%".format(
        round(
            (df["time_id"].value_counts() >= 2000).sum()
            / len(df["time_id"].value_counts())
            * 100,
            1,
        )
    )
)

In [None]:
# Let's plot investment_id and time_id together
df[["time_id", "investment_id"]].plot(
    kind="scatter", x="time_id", y="investment_id", figsize=(20, 30), s=0.5
)
plt.show()

We can see that investment_id are more present with high time_id.

In [None]:
# Let's see what's happenning around 300-400 time_id.
df[["time_id", "investment_id"]].plot(
    kind="scatter", x="time_id", y="investment_id", figsize=(20, 30), s=0.5
)
plt.xlim(300, 400)
plt.show()

We can see that there are some missing time_id.

### Features f_i
A histogram of all features is available at this kernel https://www.kaggle.com/code/mk1001/eda-f-0-299-histogram/notebook.

Let's see randomly six of them with boxplot :

In [None]:
np.random.seed(1)

# Plot randomly 6 histograms and boxplots of features f_
for f in np.random.choice(range(0, 300), 6):
    
    # Initiate plot
    fig, axes = plt.subplots(2, 1, figsize=(15, 8))
    plt.suptitle("Distribution of feature f_{}".format(f), size=14)
    
    # Target histogram
    df["f_{}".format(f)].hist(bins=50, ax=axes[0])

    # Target Boxplot
    sns.boxplot(x="f_{}".format(f), data=df, ax=axes[1])
    plt.show()

Some features are centered in zero.

Some of them get outliers as the distribution is not centered. So maybe, in the future we could consider to normalize data with a Robust Scaler in order to limit the influence of outliers.

In [None]:
# List of features columns
features = [f"f_{i}" for i in range(0, 300)]

### Target
Let's see the target distribution.

In [None]:
# Initiate plot
f, axes = plt.subplots(2, 1, figsize=(15, 8))

plt.suptitle("Distribution of the target", size=14)

# Target histogram
df["target"].hist(bins=50, ax=axes[0])

# Target Boxplot
sns.boxplot(x="target", data=df, ax=axes[1])
plt.show()

In [None]:
mean_target = df["target"].mean()
std_target = df["target"].std()
print("Target mean value : ", mean_target)
print("Target std value : ", std_target)

The distribution seems to be gaussian.

Let's plot the target distribution for some investment_id :

In [None]:
np.random.seed(1)

# Initiate counter
i = 1

# Initiate plot
plt.figure(figsize=(15, 12))
plt.suptitle("Target distribution for 6 random investment_id", size=16)

# Plot randomly 6 histograms of the target
for j in np.random.choice(df["investment_id"].unique(), 6):
    plt.subplot(2, 3, i)
    df[df["investment_id"] == j]["target"].hist(bins=50)
    plt.title("Target distribution\nfor investment_id {}".format(j), size=14)
    i += 1

For individual investment_id, target distribution seems to be less gaussian. Some values are high for values being at the "tail of the distribution" (e.g. investment_id 2441, 1337).

## Bidimensional analysis
### Get a sample dataset
Let's take a sample of the data.

In [None]:
sample_df = df.sample(frac=0.05, random_state=1)
sample_df

In [None]:
# Sort by time_id and investment_id to get data in order 
# and reset index
sample_df = sample_df.sort_values(
    ["time_id", "investment_id"], ascending=[True, True]
).reset_index(drop=True)
sample_df

In [None]:
# Dataframe information
sample_df.info()

Check the target distribution

In [None]:
# Initiate plot
f, axes = plt.subplots(2, 1, figsize=(15, 8))

plt.suptitle("Distribution of the target", size=14)

# Target histogram
sample_df["target"].hist(bins=50, ax=axes[0])

# Target Boxplot
sns.boxplot(x="target", data=sample_df, ax=axes[1])
plt.show()

Distribution is close to the one in the full dataset, but we don't have outliers above 10 and less than 8 as there are in the full dataset.


### Correlation
#### Target vs features
Let's see if the target is correlated to the features f_i.

In [None]:
correlation = sample_df[["target"] + features].corr()

In [None]:
# Plot correlation values between target and features f_i
plt.figure(figsize=(7, 7))
correlation["target"].iloc[1:].hist(bins=20)
plt.title("Correlation between target and features f_i", size=14)
plt.show()

Correlation values are very low, so it means that target is not linked to features.
#### Between features
Let's see the correlation between the features f_i. As there are 300 features, it is difficult to see all correlations.

In [None]:
def mat_corr(df):
    """
    Function to plot correlation matrix heatmap between columns of a dataframe
    
    Arguments :
    - dataframe df
    
    Display :
    - correlation matrix as heatmap
    """

    # Compute correlation
    corr = df.corr()

    # Mask to display only lower part of the heatmap
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Plot initialization
    f, ax = plt.subplots(figsize=(30, 30))

    # Color mapping
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Heatmap
    sns.heatmap(
        corr,
        mask=mask,
        cmap=cmap,
        # vmax=1,
        center=0,
        square=True,
        linewidths=0.5,
        cbar=True,
        # annot=True, # do not display correlation values
    )
    plt.title("Correlation Matrix", size=20)
    plt.show()

In [None]:
# Display heatmap
mat_corr(sample_df[features])

Most of correlations are low. 

We are going to see the highest correlations. Generally, it is considered that high correlation is above 0.8.

In [None]:
# Compute correlation matrix with absolute values
corr_matrix = sample_df[features].corr().abs()

# Keep high correlations
high_corr_var = np.where(corr_matrix >= 0.80)

# Get pairs of features with high correlations
high_corr_var = [
    (corr_matrix.columns[x], corr_matrix.columns[y])
    for x, y in zip(*high_corr_var)
    if x != y and x < y
]
high_corr_var

We can see that several features are correlated to more than one feature, such as f_4, f_228, f_41, f_95, f_97...

In [None]:
# Select the lower triangle of the correlation matrix
lower = corr_matrix.where(np.tril(np.ones(corr_matrix.shape), k=-1).astype(np.bool))
# k = -1 to remove values on diagonal
lower

In [None]:
# Select the upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# k = 1 to remove values on diagonal
upper

In [None]:
# Find features with correlation greater than 0.80 in lower matrix
to_drop_low = [column for column in lower.columns if any(lower[column] >= 0.8)]
print("{} features with high correlation (>=0.8)".format(len(to_drop_low)))

In [None]:
# Find features with correlation greater than 0.80 in upper matrix
to_drop_up = [column for column in upper.columns if any(upper[column] >= 0.8)]
print("{} features with high correlation (>=0.8)".format(len(to_drop_up)))

In [None]:
# Let's see which features are both in drop lists
feat_common = [f for f in to_drop_low if f in to_drop_up]
feat_common

There are 5 features in common for the drop lists. Otherwise, 15 features are different considering upper or lower part of the matrix. Maybe this could have an incidence for the modelisation. 

Let's have a look of their distribution.

In [None]:
# Plot histograms and boxplot of these features f_
for f in feat_common:
    
    # Initiate plot
    fig, axes = plt.subplots(2, 1, figsize=(15, 8))
    plt.suptitle("Distribution of feature {}".format(f), size=14)
    
    # Target histogram
    sample_df["{}".format(f)].hist(bins=50, ax=axes[0])

    # Target Boxplot
    sns.boxplot(x="{}".format(f), data=sample_df, ax=axes[1])
    plt.show()

Distribution are not all centered in zero. Many outliers.

So we are going to compare 2 possibilities : upper matrix and lower matrix.

In [None]:
# Drop these features
sample_df_up = sample_df.drop(to_drop_up, axis=1)
sample_df_low = sample_df.drop(to_drop_low, axis=1)

In [None]:
print("sample_df_up shape : ", sample_df_up.shape)
print("sample_df_low shape : ", sample_df_low.shape)

In [None]:
# Remove others columns that are not "features"
others = ["row_id", "time_id", "investment_id", "target"]

features_up = list(sample_df_up.columns)
features_low = list(sample_df_low.columns)

for x in others:
    features_up.remove(x)
    features_low.remove(x)

In [None]:
len(features_up)

In [None]:
len(features_low)

Correlated features have been removed.

In [None]:
# Let's have a look at correlation matrix
mat_corr(sample_df[features_up])

It remains some correlation above 0.6 in absolute value.

## Split data
We are going to split data now in order that the test part is not influenced by operations done on the train part.

In [None]:
# Define X and y
X = sample_df[features].values
X_up = sample_df_up.drop(others, axis=1).values
X_low = sample_df_low.drop(others, axis=1).values
y = sample_df_up["target"].values

In [None]:
print("X shape : ", X.shape)
print("X_up shape : ", X_up.shape)
print("X_low shape : ", X_low.shape)
print("y shape : ", y.shape)

The test part has to be the end of the dataset as it is "the future" observations (remember that our data are ordered by time_id).

In [None]:
# Split data
X_train = X[:140000]
X_test = X[140000:]

X_up_train = X_up[:140000]
X_up_test = X_up[140000:]

X_low_train = X_low[:140000]
X_low_test = X_low[140000:]

y_train = y[:140000]
y_test = y[140000:]

print("X_train shape : ", X_train.shape)
print("X_test shape : ", X_test.shape)
print("X_up_train shape : ", X_up_train.shape)
print("X_up_test shape : ", X_up_test.shape)
print("X_low_train shape : ", X_low_train.shape)
print("X_low_test shape : ", X_low_test.shape)
print("y_train shape : ", y_train.shape)
print("y_test shape : ", y_test.shape)

In [None]:
perc_test = round(len(X_up_test) / len(X_up) * 100, 1)
print("Percent of data in test set : {}%".format(perc_test))

In [None]:
# Let's see information about the first line of testset
sample_df.loc[140000]

Test set contains data with time_id above 1116.

In [None]:
sample_df.loc[139999]

In [None]:
sample_df.loc[140001]

## Preprocessing
We are going to consider PCA in order to decrease the number of features.

We are going to compare normalized data and unnormalized data. As mentioned above, we are going to use Robust Scaler for normalization.

### Scale

In [None]:
# X
robust_scal = RobustScaler().fit(X_train)
X_scaled = robust_scal.transform(X_train)
X_scaled.shape

In [None]:
# Up
robust_scal_up = RobustScaler().fit(X_up_train)
X_up_scaled = robust_scal_up.transform(X_up_train)
X_up_scaled.shape

In [None]:
# Low
robust_scal_low = RobustScaler().fit(X_low_train)
X_low_scaled = robust_scal_low.transform(X_low_train)
X_low_scaled.shape

### PCA

In [None]:
# PCA X
pca = PCA(random_state=0)
pca.fit(X_train)

In [None]:
def display_scree_plot(pca, data):

    """ Function to display eigenvalues scree of pca
        
    - Arguments :
        - pca : pca model fitted
        - data : data on which PCA has been fitted (string)
    
    - Display :
        - barplot for each pca component
        - cumulated inertie percent (variance explained by pca) 
    """
    
    # Initiate plot
    plt.figure(figsize=(12, 8))
    
    # Get explained_variance_ratio_
    scree = pca.explained_variance_ratio_ * 100

    # Barplot for each component
    plt.bar(np.arange(len(scree)) + 1, scree)

    # Cumulative sum
    plt.plot(np.arange(len(scree)) + 1, scree.cumsum(), c="red", marker="o")

    plt.xlabel("rank of the axis of inertia", size=13)
    plt.ylabel("Inertie percent", size=13)
    plt.title("Eigenvalues scree of pca for {}".format(data), size=14)
    plt.show(block=False)

In [None]:
data = "X"
display_scree_plot(pca, data)

In [None]:
# PCA 0.85 X
pca_85 = PCA(n_components=0.85, random_state=0)
pca_85.fit(X_train)
X_pca85 = pca_85.transform(X_train)
X_pca85.shape

The features number decreases of 58%.

In [None]:
# PCA X_up
pca_up = PCA(random_state=0)
pca_up.fit(X_up_train)

In [None]:
data = "X_up"
display_scree_plot(pca_up, data)

Let's keep 85% of explained variance.

In [None]:
# PCA 0.85 X_up
pca_up_85 = PCA(n_components=0.85, random_state=0)
pca_up_85.fit(X_up_train)
X_up_pca85 = pca_up_85.transform(X_up_train)
X_up_pca85.shape

The features number decreases of 55%.

In [None]:
# PCA X_low
pca_low = PCA(random_state=0)
pca_low.fit(X_low_train)

In [None]:
data = "X_low"
display_scree_plot(pca_low, data)

In [None]:
# PCA 0.85 X_low
pca_low_85 = PCA(n_components=0.85, random_state=0)
pca_low_85.fit(X_low_train)
X_low_pca85 = pca_low_85.transform(X_low_train)
X_low_pca85.shape

The features number decreases of 55.4%.

In [None]:
# PCA X_scaled
pca_scal = PCA(random_state=0)
pca_scal.fit(X_scaled)

In [None]:
data = "X_scaled"
display_scree_plot(pca_scal, data)

In [None]:
# PCA 0.85 X_scaled
pca_scal_85 = PCA(n_components=0.85, random_state=0)
pca_scal_85.fit(X_scaled)
X_scal_pca85 = pca_scal_85.transform(X_scaled)
X_scal_pca85.shape

The features number decreases of 96.7% !

In [None]:
# PCA X_up_scaled
pca_up_scal = PCA(random_state=0)
pca_up_scal.fit(X_up_scaled)

In [None]:
data = "X_up_scaled"
display_scree_plot(pca_up_scal, data)

In [None]:
# PCA 0.85 X_up_scaled
pca_up_scal_85 = PCA(n_components=0.85, random_state=0)
pca_up_scal_85.fit(X_up_scaled)
X_up_scal_pca85 = pca_up_scal_85.transform(X_up_scaled)
X_up_scal_pca85.shape

The features number decreases of 96.8% !

In [None]:
# PCA X_low_scaled
pca_low_scal = PCA(random_state=0)
pca_low_scal.fit(X_low_scaled)

In [None]:
data = "X_low_scaled"
display_scree_plot(pca_low_scal, data)

In [None]:
# PCA 0.85 X_low_scaled
pca_low_scal_85 = PCA(n_components=0.85, random_state=0)
pca_low_scal_85.fit(X_low_scaled)
X_low_scal_pca85 = pca_low_scal_85.transform(X_low_scaled)
X_low_scal_pca85.shape

The features number decreases of 96.8% !

## Linear Regression
### 1st Try 
We are going to see on a simple linear regression how models perform with our different data (PCA, up, down, scaled..).

In [None]:
def display_learning_curve(model, X_train, y_train, name_model, name_X):

    """ Function to display learning curve for a model
    
    - Arguments : 
        - model : model to train
        - X_train : data to fit
        - y_train : data to compare
        - name_model : name of the model (string)
        - name_X : name of X_data (string)
    
    - Display :
        - Learning curve with training score and validation score
    
    """
    N, train_score, val_score = learning_curve(
        model,
        X_train,
        y_train,
        cv=n_folds,
        scoring="neg_root_mean_squared_error",
        train_sizes=np.linspace(0.1, 1, 10),
    )

    # Plot learning-curve
    plt.figure(figsize=(6, 6))
    plt.plot(N, -train_score.mean(axis=1), label="train_score")
    plt.plot(N, -val_score.mean(axis=1), label="validation_score")
    plt.xlabel("Dataset size", size=14)
    plt.ylabel("Mean RMSE", size=14)
    # plt.xlim([50,680])
    # plt.ylim([y_min, y_max])

    plt.title("Learning curve for {} with {}".format(name_model, name_X), size=14)
    plt.legend()
    plt.show()

In [None]:
def my_scorer(X, y):
    """
    Function to get Pearson correlation coefficient between X and y
    
    Arguments :
        - X
        _ y
    
    Returns :
        - Pearson correlation coefficient computed with 
        scipy.stats module
    """
    pearson = pearsonr(X, y)[0]
    return pearson

# Let's transform my_scorer has a scorer
my_pearson = make_scorer(my_scorer, greater_is_better=True)

# Dictionnay of scores
scoring = {
    "neg_root_mean_squared_error": "neg_root_mean_squared_error",
    "neg_mean_absolute_error": "neg_mean_absolute_error",
    "my_pearson": my_pearson,
}

In [None]:
def cross_val(model, X_train, y_train, name_model, name_X):

    """ Function to do cross-validation on a model and get scores in
    a dataframe 
        
    - Arguments :
        - model : model to test
        - X_train : X data 
        - y_train : X data
        - name_model : name of the model (string)
        - name_X : name given to the X data (string)
    
    - Return :
        - dataframe with name_model, name_X and scoring : RMSE, MAE, R2
    """

    # Cross validation
    scores = cross_validate(model, X_train, y_train, cv=n_folds, scoring=scoring,)

    # Get mean scores
    RMSE = -scores["test_neg_root_mean_squared_error"].mean()
    MAE = -scores["test_neg_mean_absolute_error"].mean()
    pearson = scores["test_my_pearson"].mean()

    # Dataframe creation for results
    df_model = pd.DataFrame(
        [[name_model, name_X, RMSE, MAE, pearson]],
        columns=["model", "X_data", "RMSE", "MAE", "Pearson_coef"],
    )

    return df_model

In [None]:
# Dico of X_data
dico_X = {
    "X_train": X_train,
    "X_scaled": X_scaled,
    "X_pca85": X_pca85,
    "X_scal_pca85": X_scal_pca85,
    "X_up_pca85": X_up_pca85,
    "X_up_scal_pca85": X_up_scal_pca85,
    "X_low_pca85": X_low_pca85,
    "X_low_scal_pca85": X_low_scal_pca85,
}

In [None]:
%%time

model = LinearRegression()
n_folds = 5

# Dataframe for results
lr_results = pd.DataFrame(columns=[
            "model", "X_data", "RMSE", "MAE", "Pearson_coef",
        ])

# for each kind of X data
for name_X, X_data in dico_X.items() :
    
    # Learning curve
    display_learning_curve(model, X_data, y_train, "LinearRegression", name_X)
                           
    # Get results
    df_lr = cross_val(model, X_data, y_train, "LinearRegression", name_X)
    lr_results = pd.concat([lr_results, df_lr], axis = 0)

lr_results

In [None]:
lr_results = lr_results.reset_index(drop=True)
lr_results.sort_values(by = "RMSE")

In [None]:
lr_results.sort_values(by = "MAE")

In [None]:
lr_results.sort_values(by = "Pearson_coef", ascending=False)

Metrics are close to each other.
We can see that top3 is the same for each metric.

### Using TimeSeriesSplit
We are going to compare our different model of LinearRegression using TimeSeriesSplit : we are going to split data in cross validation with this.

In [None]:
# TimeSeriesSplit
ts_cv = TimeSeriesSplit(n_splits=5, test_size = 20000)

In [None]:
# Initiate counter
i = 1

# Get number of samples in each fold
for train_index, val_index in ts_cv.split(X_pca85):
    print(
        "Split ",
        i,
        "\nTrain nb of samples :",
        len(train_index),
        "Validation nb of samples :",
        len(val_index),
        "\n",
    )
    i += 1

#### Model training

In [None]:
%%time

model = LinearRegression()
n_folds = ts_cv

# Dataframe for results
lr_results_ts_cv = pd.DataFrame(columns=[
            "model", "X_data", "RMSE", "MAE", "Pearson_coef",
        ])

# for each kind of X data
for name_X, X_data in dico_X.items() :
                           
    # Get results
    df_lr = cross_val(model, X_data, y_train, "LinearRegression", name_X)
    lr_results_ts_cv = pd.concat([lr_results_ts_cv, df_lr], axis = 0).reset_index(drop=True)

lr_results_ts_cv

In [None]:
lr_results_ts_cv.sort_values(by="RMSE")

Top 3 is the same, RMSE ans MAE are better, but Pearson_coef is worse (except when trained on all data).

We choose X_up_pca85 as best pre processed data and we are going to evaluate this model. 

#### Model evaluation

In [None]:
# Create Pipeline
pipeline_lr = Pipeline(
    [("pca", PCA(n_components=0.85, random_state=0)), ("lr", LinearRegression()),]
)

# Model training
pipeline_lr.fit(X_up_train, y_train)

In [None]:
# Create Pipeline
pipeline_lr = Pipeline(
    [("pca", PCA(n_components=0.85, random_state=0)), ("lr", LinearRegression()),]
)

# Model training
pipeline_lr.fit(X_up_train, y_train)

In [None]:
def get_scores(model, name_model, name_X, X_test, y_test):

    """
    Function to get target predictions and dataframe with
    metrics
    
    Arguments : 
    - model : model to evaluate
    - name_model : name of the model (string)
    - name_X : name of the data (string)
    - X_test : data to get predictions
    - y_test : actual target values
    
    Return :
    - y_pred : array of predictions values
    - results : dataframe with name_model, name_X, RMSE and
    Pearson_coef
    
    """

    # get predictions
    y_pred = model.predict(X_test)

    # Get scores
    RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
    pearson = pearsonr(y_pred, y_test)[0]

    # Data frame for results
    results = pd.DataFrame(
        [[name_model, name_X, RMSE, pearson]],
        columns=["model", "X_data", "RMSE", "Pearson_coef"],
    )

    return y_pred, results

In [None]:
# Model evaluation
y_pred, lr_final = get_scores(
    pipeline_lr, "LinearRegression", "X_up_pca85", X_up_test, y_test
)
lr_final