# **Tabular playground series January 2021**

Dataset : Open kaggle [dataset](https://www.kaggle.com/c/tabular-playground-series-jan-2021)

## Task

Make a regression model based on a specific dataset. 


## Files

- `train.csv` : training data with the `target` column. 
- `test.csv` : test set. The trained model will be applied here.
- `sample_submission.csv` : a sample submission for the test set.

*@author : Baptiste Mistral - Jan2021*

In [None]:
import os
import numpy as np 
import pandas as pd 
import seaborn as sns 
sns.set()
import matplotlib.pyplot as plt

path_input = '../input/tabular-playground-series-jan-2021/'
path_output = './'

## Exploring dataset

We start first by analysing the dataset

In [None]:
data = pd.read_csv(path_input+'train.csv')
data.head()

Let's see if we have missing data or NaN values

In [None]:
data.info()

Ok no problem with data. Let's focus on our target then. We can check its distribution

In [None]:
sns.distplot(a=data['target'], rug = True, color='g')

The 'id' column seems to be useless to predict the target. We delete it to start reducing the amount of data.

In [None]:
data=data.drop(['id'],axis=1)

In [None]:
mask = np.zeros_like(data.corr())
mask[np.tril_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(80, 15))
corr = data.corr()
sns.heatmap(corr, vmax=1, square=True,annot=True,cmap='viridis', mask=mask.T)

plt.title('Correlation between different fearures')

The correlations between conts and the target are not that obvious. However, we can see that the features itselves are strong-correlated.

## Preprocessing the data

In [None]:
from sklearn.model_selection import train_test_split
y = data.target  # all rows, target only
X = data.drop('target',axis=1)  # all rows, all the features and no target


x_train,x_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state = np.random.RandomState(0))
print("training size : {}\ntest size : {}".format(x_train.shape,x_test.shape))

# Fisrt try Training

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [None]:
boost=xgb.XGBRegressor(tree_method='gpu_hist')
boost.fit(x_train,y_train)
pred = boost.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("Result : RMSE =",rmse)

## Grid Search CV - Tuning parameters

In [None]:
from sklearn.model_selection import GridSearchCV
def tune_model(model,params):
    modelCV = GridSearchCV(estimator=model,
                             param_grid=params,
                             cv=3,
                             n_jobs=-1,
                             pre_dispatch='8',
                             scoring='neg_root_mean_squared_error',
                             verbose=4,
                             refit=True)
    modelCV.fit(x_train,y_train)
    print("CV results :\n{}\n".format(modelCV.cv_results_))
    print("-----------")
    print("Best estimator : \n{}\n".format(modelCV.best_estimator_))
    print("----------------")
    print("Best parameters : \n{}\n".format(modelCV.best_params_))
    print("-----------------")
    print("Best score : \n{}\n".format(modelCV.best_score_))
    print("-----------")
    return modelCV.best_params_

Ok with verbosity its very explicit. But it was to be sure. You can set verbosity to 1 for less informations. The important parameter is the pre_dispatch parameter. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. Visit the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) for more details.

In [None]:
import xgboost as xgb

params_xgb={'n_estimators' : [2200,2500,2800],
            'learning_rate' : [0.01, 0.015, 0.02]}

boost = xgb.XGBRegressor(
        objective = 'reg:squarederror',
        subsample = 0.8,
        colsample_bytree = 0.8,
        learning_rate = 0.01,
        tree_method = 'gpu_hist')

best_xgb=tune_model(boost,params_xgb)

In [None]:
boost=xgb.XGBRegressor(**best_xgb,tree_method = 'gpu_hist')
boost.fit(x_train,y_train)
pred = boost.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("Result : RMSE =",rmse)

# Final prediction and submission

In [None]:
data_test = pd.read_csv(path_input+'test.csv')
pred=boost.predict(data_test.drop("id",axis=1))
sub=pd.DataFrame({
    "id":data_test["id"],
    "target":pred
})
sub.to_csv(path_output+'submission.csv',index=False)