* ## This notebook intends to write down the pipeline of how Kaggle competition works, more than ML technical aspects (literally a "note"book !) 
* ## How to load the data, carry out analysis (even if it's so basic), and produce csv file.

* ## Any comments, feedback, or advice are appreciated :)

> -------

## Progress so far:
* 2/8 - Implementing Feature importance and closs validation
* 2/8 - Implementing standardisation and PCA (Public score:7.94023)
* 1/8 - The first commitment, with a linear regression (Public score:7.93942)

## What's next?
Hyperparameters tuning, SVR and random forest

---------

## Data Loading

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
test = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2021/test.csv')
train = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2021/train.csv')

In [None]:
train.shape, test.shape

In [None]:
train[:5]

In [None]:
test[:5]

## Preprocessing and EDA

In [None]:
y = train["loss"]
Xtrn = train.drop(["loss", "id"], axis = 1)

In [None]:
Xtst = test.drop("id", axis = 1)

In [None]:
#standardisation
normed_Xtst = (Xtst - Xtrn.mean()) / Xtrn.std()
normed_Xtrn = (Xtrn - Xtrn.mean()) / Xtrn.std()

In [None]:
#PCA
from sklearn.decomposition import PCA

pcas = PCA(n_components=100)
pcas.fit(normed_Xtrn) #training PCA
projected = pcas.transform(normed_Xtrn) #projecting the data onto Principal components
print(Xtrn.shape)
print(projected.shape)
plt.plot(pcas.explained_variance_); plt.grid();
plt.xlabel('Explained Variance')
plt.figure()
cumvar = np.cumsum(pcas.explained_variance_ratio_)
dimmin = np.where(cumvar > 0.95)[0][0]
print(dimmin)
plt.plot(np.arange(len(pcas.explained_variance_ratio_))+1, cumvar,'o-') #plot the scree graph
plt.axis([1,len(pcas.explained_variance_ratio_),0,1])
plt.hlines(0.95, 1, 100, colors="orange")
plt.vlines(dimmin, 0, 0.95, colors = "red")
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.title('Scree Graph')
plt.grid()
plt.show()

In [None]:
pca = PCA(n_components=94)
pca.fit(Xtrn) #training PCA
projected_Xtrn = pca.transform(Xtrn)
projected_Xtst = pca.transform(Xtst)

In [None]:
#Random forest
from sklearn.ensemble import RandomForestRegressor
RFclf = RandomForestRegressor(n_estimators=10)
RFmodel = RFclf.fit(Xtrn, y)

In [None]:
#%store RFmodel
%store -r RFmodel

How we shold use %store command on python?

In [None]:
feature_importances = pd.DataFrame([Xtrn.columns, RFmodel.feature_importances_]).T
feature_importances.columns = ['features', 'importances']
plt.figure(figsize=(20,10))
plt.title('Importances')
plt.rcParams['font.size']=5
plt.xticks(rotation=90)
sns.barplot(y=feature_importances['importances'] , x=feature_importances['features'], palette='viridis')

There aren't much differences in importance between variables.

## Analysis

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor


lr = LinearRegression()
svr =  SVR()
rf = RandomForestRegressor(n_estimators=10)
knn = KNeighborsRegressor(n_neighbors=5)

In [None]:
train_data = [Xtrn, normed_Xtrn, projected_Xtrn]
models = [lr]

## Closs Validation

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate

stratifiedkfold = StratifiedKFold(n_splits=3)

def score(X, model):
    scores = cross_validate(model, X, y, scoring="neg_root_mean_squared_error", cv=stratifiedkfold)
    return -np.mean(scores["test_score"]), np.std(scores["test_score"])

I don't know why usual RMSE is not available but negative one lol, any ideas?
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Also, it wasn't possible to set n_splits=5....

In [None]:
for i in range(len(models)): 
    for j in range(len(train_data)):
        if j==0:
            x = "raw"
        elif j==1:
            x = "normed"
        else:
            x = "pca"
        print(models[i], x, score(train_data[j],models[i]))

SVR and Random forest take ages...

In [None]:
bestmodel = lr
bestmodel.fit(Xtrn, y)

prediction = bestmodel.predict(Xtst)

## Producing the results

In [None]:
output = pd.DataFrame({'id': pd.read_csv('/kaggle/input/tabular-playground-series-aug-2021/test.csv').id,
                       'loss': prediction, 
                       })

output.to_csv('submission.csv', index = False)
output