## Tabular Playground competition:
In this notebook, I have explored the dataset, plotted and analyzed the features. Then I have created features from multimodal analysis. Finally I have used number of models to train and create the submission file. The model parameters provided are not the optimal ones. But I have provided suggestions on how to reach the parameters in the respective model's starting markdown. Now, the contents of this notebook are:<br/>
(1) [Basic data exploration](#section1)<br/>
(2) [Multimodal distributions and fitting of training data](#section2)<br/>
(3) [Modeling efforts](#section3)<br/>
Try forking and optimizing the models and then finally get a good result. If you are using the notebook and like the work, consider showing your appreciation. Also I am open to suggestions for improving the notebook.<br/>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor as RfReg
import xgboost as xgb
from sklearn.linear_model import LinearRegression as LinReg

## <a id = 'section1'>Basic data exploration</a>:
In this section, we will load, check shape and column names of the data. Then we will plot the different columns of both train and test data and make basic insights and observations.

In [None]:
train_data = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
test_data = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv')
print("train data shape is:",train_data.shape)
print("test data shape is:",test_data.shape)
train_data.head()

In [None]:
train_data = train_data.drop('id',axis = 1)
test_data = test_data.drop('id',axis = 1)

In [None]:
def brief_col(data,col):
    print("Name of the column:",col)
    print("the description of the column is:")
    print("Number of missing points is:",data[col].isna().sum())
    print(data[col].describe())
    plt.figure(figsize = (10,10))
    plt.hist(data[col].tolist())
    plt.title(col)
    plt.show()

In [None]:
for col in train_data.columns:
    print("in training data:")
    brief_col(train_data,col)
    print("in test data:")
    if col == 'target':
        continue
    brief_col(test_data,col)

From the above plot these are the following important observations:
(1) Number of variables are bimodal or multimodal. It is better idea to fit bimodal or multimodal distributions for outlier treatment and better modeling treatment.<br/>
(2) The train and test columns look almost similar; but there are significant difference in range and spread in some cases. So before prediction, normalizing the data is needed.<br/>
(3) We can get a concise normal distribution on the target dataset. On the prediction output also, we need to check the prediction's distribution so that it falls in similar distribution.<br/>

## <a id = 'section2'>Multimodal distributions and fitting of training data</a>:
In this section, we will go through each of the columns; try and fit the optimal number of modes with them; and fit proper distributions on them to properly model the data.<br/>
The libraries we are using for this are sklearn, scipy and statsmodels. You can read about it more here.<br/>
For multimodal distribution fitting,<br/>
(1)[read jakevdp's blog (he is the author of sklearn.kerneldensity)](https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/)<br/>
(2)[read stackoverflow](https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python)<br/>
(3)[sklearn kde fitting](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity)<br/>

In [None]:
from sklearn.neighbors import KernelDensity
from scipy.stats import gaussian_kde,norm
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_density import KDEMultivariate

Using sklearn kernel density to fit multimodal distribution<br/>
The following example is adapted from [this example by sklearn](https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html#sphx-glr-auto-examples-neighbors-plot-kde-1d-py).

In [None]:
import numpy as np
from sklearn.utils.fixes import parse_version

# `normed` is being deprecated in favor of `density` in histograms
if parse_version(matplotlib.__version__) >= parse_version('2.1'):
    density_param = {'density': True}
else:
    density_param = {'normed': True}

# ----------------------------------------------------------------------
# Plot the progression of histograms to kernels
def format_func(x, loc):
    if x == 0:
        return '0'
    elif x == 1:
        return 'h'
    elif x == -1:
        return '-h'
    else:
        return '%ih' % x

def plotstimate(X):    
    np.random.seed(1)
    X_plot = np.array(X)[:, np.newaxis]
    #X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
    bins = np.linspace(-5, 10, 10)

    fig, ax = plt.subplots(1,2, sharex=True, sharey=True)
    fig.subplots_adjust(hspace=0.05, wspace=0.05)

    # histogram 1
    #ax[0, 0].hist(X, bins=bins, fc='#AAAAFF', **density_param)
    #ax[0, 0].text(-3.5, 0.31, "Histogram")

    # histogram 2
    #ax[0, 1].hist(X[:, 0], bins=bins + 0.75, fc='#AAAAFF', **density_param)
    #ax[0, 1].text(-3.5, 0.31, "Histogram, bins shifted")

    # tophat KDE
    kde = KernelDensity(kernel='tophat', bandwidth=0.75).fit(X_plot)
    log_dens = kde.score_samples(X_plot)
    ax[0].hist(np.exp(log_dens),fc = '#AAAAFF')
    #ax[0].scatter(X_plot[:, 0], np.exp(log_dens), fc='#AAAAFF')
    ax[0].set_title("Tophat Kernel Density")

    # Gaussian KDE
    kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X_plot)
    log_dens = kde.score_samples(X_plot)
    ax[1].hist(np.exp(log_dens),fc = '#AAAAFF')
    #ax[1].fill(X_plot[:, 0], np.exp(log_dens), fc='#AAAAFF')
    ax[1].set_title("Gaussian Kernel Density")

In [None]:
plotstimate(train_data.sample(frac = 0.01)['cont1'])

In [None]:
plotstimate(train_data.sample(frac = 0.01)['cont2'])

In [None]:
plotstimate(train_data.sample(frac = 0.01)['cont3'])

This fitting effectively fits the densities therefore, but there is no actionable metric which I can take from it. I will therefore try and fit gaussian mixture models to the distributions to check effectively the different origins of the bimodal variables.<br/>
We will be following gaussian mixture model from[ sklearn's documentation](https://scikit-learn.org/stable/modules/mixture.html).

### GMM models

In [None]:
import numpy as np
from sklearn.mixture import GaussianMixture
X = np.array(train_data[['cont1']])
gm = GaussianMixture(n_components=2, random_state=0).fit(X)
print(gm.means_)
print(gm.weights_)
print(gm.covariances_)

Now, this is a tangible resource. Using this, we will break clearly bimodal distributions into two separate variables based on their predicted labels; and assign their value to respective labelled variables. The other one will be assigned a 0 in that case.<br/> 
Also we will add a binary variable denoting high or low from these.<br/>
Let's implement it now.

Before diving in, we will note what all features are bimodal, normal and all.

In [None]:
bimodal = ['cont1','cont2','cont4','cont11','cont12','cont13','cont14']
normal = ['cont3','cont6','cont7','cont9','cont10']
poisson = ['cont5','cont8']
#cont10 has a high pick near 0.8 model it as normal only, but put a check if it is 0.8 or near.
#for poisson check if it near the lowest values which is near 0.3 for both.

In [None]:
for col in bimodal:
    X = np.array(train_data[[col]])
    gm = GaussianMixture(n_components=2, random_state=0).fit(X)
    train_data[col+'_label_low'] = gm.predict(X)
    train_data[col+'_label_high'] = 1-train_data[col+'_label_low']
    train_data[col+'_val_low'] = train_data[col]*train_data[col+'_label_low']
    train_data[col+'_val_high'] = train_data[col]*train_data[col+'_label_high']
    test_data[col+'_label_low'] = gm.predict(np.array(test_data[col]).reshape(-1,1))
    test_data[col+'_label_high'] = 1-test_data[col+'_label_low']
    test_data[col+'_val_low'] = test_data[col]*test_data[col+'_label_low']
    test_data[col+'_val_high'] = test_data[col]*test_data[col+'_label_high']
def is_low_val(x):
    if x>0.2 and x<=0.3:
        return 1
    return 0
def is_near_peak(x):
    if x>0.75 and x<=0.85:
        return 1
    return 0
for col in poisson:
    train_data[col+'_lowest_val'] = train_data[col].apply(lambda x: is_low_val(x))
    test_data[col+'_lowest_val'] = test_data[col].apply(lambda x: is_low_val(x))
train_data['cont10_nearHighPeak'] = train_data['cont10'].apply(lambda x: is_near_peak(x))
test_data['cont10_nearHighPeak'] = test_data['cont10'].apply(lambda x: is_near_peak(x))

In [None]:
train_data.head()

In [None]:
for col in bimodal:
    brief_col(train_data,col+'_val_low')
    brief_col(train_data,col+'_val_high')

Clearly we have got clear separate normal distributions from the bimodal distributions. Now that we are done creating these; let's get to training different models.

##<a id='section3'> Modeling Efforts</a>:
We have tried out the following methods:<br/>
(1) [Linear Model](#linear)<br/>
(2) [Random forest regressor](#rf)<br/>
(3) [MARS spline regressor](#mars)<br/>
(4) [Xgboost regressor](#xgb)<br/>

In [None]:
print(train_data.shape)
print(test_data.shape)

### <a id = 'linear'>Linear model</a>

In [None]:
from sklearn.model_selection import train_test_split as tts
Y_train = train_data['target']
X_train = train_data.drop('target',axis = 1)
X_trainer,X_train_val,Y_trainer,Y_train_val = tts(X_train,Y_train,test_size = 0.2,
                                                  shuffle = True)

In [None]:
from sklearn.linear_model import LinearRegression as linreg
linmodel = linreg(normalize = True,n_jobs = -1)
linmodel.fit(X_trainer,Y_trainer)
pred_trainer = linmodel.predict(X_trainer)
print(rsc(pred_trainer,Y_trainer))
pred_test = linmodel.predict(X_train_val)
print(rsc(pred_test,Y_train_val))

### <a id = 'rf'>Random Forest Regressor</a>
Bigger models are pretty slow: so need to implement the [GPU model](https://medium.com/rapids-ai/accelerating-random-forests-up-to-45x-using-cuml-dfb782a31bea). This will be implemented in later versions of the model.

### I have left the fine tuning for the models left. You can tune it on your own and submit.
Tips for optimizing:<br/>
(1) increase n_estimators<br/>
(2) increase and check max_depth<br/>
(3) tune min_samples_split to optimal value, as in check different values<br/>
(4) Try increasing max_samples and check when the r_square score becomes better<br/>

In [None]:
from sklearn.ensemble import RandomForestRegressor as rfreg
from sklearn.metrics import r2_score as rsc
regressor = rfreg(n_estimators = 128,
                  max_depth = 4,
                  min_samples_split = 1,
                  max_features = 'auto',
                  max_samples = 0.1,
                  n_jobs = -1)
regressor.fit(X_trainer,Y_trainer)
pred_train = regressor.predict(X_trainer)
print("train rmse is:",rsc(Y_trainer,pred_train))
pred_train_val = regressor.predict(X_train_val)
print("test rmse is:",rsc(Y_train_val,pred_train_val))

### <a id= 'mars'>MARS model</a>
Read about it from [machine learning mastery](https://machinelearningmastery.com/multivariate-adaptive-regression-splines-mars-in-python/).

In [None]:
!pip install sklearn-contrib-py-earth
import pyearth

In [None]:
from pyearth import Earth
mars_model = Earth()
mars_model.fit(X_trainer,Y_trainer)
pred_train = mars_model.predict(X_trainer)
print("train rmse is:",rsc(Y_trainer,pred_train))
pred_train_val = mars_model.predict(X_train_val)
print("test rmse is:",rsc(Y_train_val,pred_train_val))

### <a id='xgb'>Xgboost regressor</a>

In [None]:
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', 
                          colsample_bytree = 0.3, 
                          learning_rate = 0.3,
                          max_depth = 2, 
                          alpha = 0, 
                          n_estimators = 100)
xg_reg.fit(X_trainer,Y_trainer)
pred_trainer = xg_reg.predict(X_trainer)
print(rsc(pred_trainer,Y_trainer))
pred_test = xg_reg.predict(X_train_val)
print(rsc(pred_test,Y_train_val))

### create submission file

In [None]:
submission_file = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/sample_submission.csv')
pred_submission = regressor.predict(test_data)
submission_file['target'] = pred_submission
submission_file.to_csv('third_submission.csv',index = False)