The notebook was created after studying [this notebook](https://www.kaggle.com/ankitverma2010/tubular-playground-regression). 

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Import Libraries

In [None]:
# Maths and data imports
import numpy as np
import pandas as pd
import scipy.stats as stats

# Plots imports
import seaborn as sns
import matplotlib.pyplot as plt

# ML modeling imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor

In [None]:
%matplotlib inline
sns.set()

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
train_path = '/kaggle/input/tabular-playground-series-jan-2021/train.csv'
test_path = '/kaggle/input/tabular-playground-series-jan-2021/test.csv'

train = pd.read_csv(train_path, index_col='id')
test = pd.read_csv(test_path, index_col='id')

In [None]:
train_df = train.copy()
test_df = test.copy()

In [None]:
train_df.head()

## Data preprocessing

In [None]:
# check the shape, to find the number of examples and features in training data
train_df.shape

In [None]:
# check for null values
train_df.isnull().sum()

In [None]:
# let check for duplicate examples
train_df.duplicated().sum()

In [None]:
# let's check the dtype (examine no of categorical and numerical features)
train_df.info()

In [None]:
# lets see some stats
train_df.describe().T

## EDA

In [None]:
fig, axs = plt.subplots(7, 2, figsize=(15, 30))

for i, ax in zip(train_df.drop(['target'], axis=1), axs.flatten()):
    sns.distplot(train_df[i], ax=ax, label='Train')
    sns.distplot(test_df[i], ax=ax, color='red', label='Test')
    ax.set_xlabel(i)
    ax.legend(loc='best')
plt.show()

In [None]:
# lets check for (multi)collinearity
sns.pairplot(train_df)
plt.show()

**Conclusion**

* The explanatory variables don't seem to be multicollinear
* No explanatory variable seems to be correlated to the targets
* Further inspection required

In [None]:
fig = plt.figure(figsize=(20, 20))
sns.heatmap(train_df.corr(), annot=True)
plt.show()

In [None]:
corr = train_df.corr()

for col in corr.columns:
    for rel_col in corr[col][corr[col] > 0.7].index:
        if rel_col != col:
            print((col, rel_col))

**Conclusion**

There seem to be quite a few correlated (positively) variables. Let's try leaving them for now. (may be we'll look at them in the next iteration)

In [None]:
# lets check for outliers and skewness
fig = plt.figure(figsize=(20, 10))
sns.boxplot(data=train_df.drop(['target'], axis=1))
plt.xlabel('Exploratory Variables')
plt.ylabel('Values')
plt.show()

**Conclusion**

* The exploratory variables seem almost in the same range, so, we'll skip standardization for now.
* Few variables such as, count2, count3, count5, count8 etc seem to be skewed. Lets confirm it.

In [None]:
# lets check for skewness again
fig = plt.figure(figsize=(20, 10))
sns.violinplot(data=train_df.drop(['target'], axis=1))
plt.xlabel('Exploratory Variable')
plt.ylabel('Values')
plt.show()

**Conclusion**

* count5, count13 seem to be right skewed
* Most of the variables seem to have multiple peaks 

Maybe they have muliple clusters

## Analysing the response/target variable

In [None]:
# let draw its distribution, if it's not normal let's convert it to normal
fig = plt.figure(figsize=(10, 5))
sns.distplot(train_df['target'])
plt.show()

**Conclusion**

This seems like a bimodeal distribution. It could be the case that it is created by mixing two normal distribuitons. Let's confirm.

In [None]:
# import statsmodels.api as sm

# fig = plt.figure(figsize=(10, 5))
# sm.qqplot(train_df['target'], line='s')
# plt.show()

In [None]:
z = (train_df.target - train_df.target.mean()) / train_df.target.std()

fig = plt.figure(figsize=(5, 5))
stats.probplot(z, dist='norm', plot=plt)
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Experimental Quantiles')
plt.show()

**Conclusion**

Seems like a normal distribution.

## Train Dev Split

In [None]:
X = train_df.drop(['target'], axis=1)
y = train_df['target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
X_valid.head()

In [None]:
y_train.head()

In [None]:
y_valid.head()

## Models

In [None]:
models = {
    'RFR': RandomForestRegressor,
    'ABR': AdaBoostRegressor,
    'XGBR': XGBRegressor
}

In [None]:
def fit_model(name, model, train_ds, valid_ds):
    X, y = train_ds
    X_val, y_val = valid_ds
    
    model.fit(X, y)
    y_hat = model.predict(X)
    y_hat_val = model.predict(X_val)
    
    mse = mean_squared_error(y, y_hat)
    mse_val = mean_squared_error(y_val, y_hat_val)
    
    print(f'Model: {name}, Train MSE: {mse}, Val MSE: {mse_val}')

In [None]:
n_est = [10, 25, 50, 100, 200]
for i in range(len(n_est)):
    print(f'n_estimators: {n_est[i]}')
    for name, model in models.items():
        model = model(n_estimators=n_est[i])
        fit_model(name, model, (X_train, y_train), (X_valid, y_valid))
    print('-'*20)

**Conclusion**

It seems `AdaBoostRegressor` and `XGBRegressor` tend to perform good with `n_estimators=50` and `n_estimators=100` respectively. At `n_estimators=100`, `XGBRegressor` seem to slightly overfit.

Lets try an ensembel of both.

In [None]:
abr = AdaBoostRegressor(n_estimators=100)
xgbr = XGBRegressor(n_estimators=50)

abr.fit(X_train, y_train)
xgbr.fit(X_train, y_train)

y_hat1, y_hat2 = abr.predict(X_valid), xgbr.predict(X_valid)
y_hat = (y_hat1+y_hat2)/2
mse = mean_squared_error(y_valid, y_hat)
print(f'Ensembel MSE: {mse}')

In [None]:
pred = (abr.predict(test_df) + xgbr.predict(test_df))/2
submission = pd.DataFrame(pred, columns=['target'])
submission = pd.concat([pd.DataFrame(test_df.index), submission], axis=1)

submission.head()

In [None]:
fig = plt.figure(figsize=(20, 10))
sns.distplot(train['target'], label='Train')
sns.distplot(submission['target'], color='red', label='Test')
plt.show()

**Conclusion**

The result seems to have a normal distribution, but its a huge peak, so, we can expect an okaish performance on the test set. We can reiterate and try out a couple of things to make the model better.

In [None]:
submission.to_csv('result.csv', index=False, header=True)