# TPS-JUNE 2021 :

![](https://storage.googleapis.com/kaggle-competitions/kaggle/25225/logos/header.png?t=2021-01-27-17-34-26)


## UPVOTE if this helps you :)

This is a quick starter for TPS-kaggle.

All the major steps have been used to find the best accuracy using the most fundamental approach.

This notebook not only holds a better way to approach any other competition, but it observes and manipulates the small and tiny changes that can lead us to better Data Engineering. 

#### Data Gathering :

In [None]:
root = '../input/tabular-playground-series-jun-2021/'
train_path = root + 'train.csv'
test_path = root + 'test.csv'
subm_path = root + 'sample_submission.csv'

#### Libraries :

Importing the basic data manipulation and visualization libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#### Data Loading :
After loading the data into the data frame/python backend, we can visualize the deeper patterns or manipulate them on our wish.

In [None]:
train_df = pd.read_csv(train_path)

train_df.head()

In [None]:
test_df = pd.read_csv(test_path)

test_df.head()

In [None]:
samp_sub = pd.read_csv(subm_path)

samp_sub.head()

The sample submission format tells us that we need to predict the likelihood of the target classes. These values are between 0.0 to 1.0.

#### Approach :

 So, now we have to decide on the approach. 
 1. We can process the data and feed it through a [neural network](https://en.wikipedia.org/wiki/Neural_network) and output as a [softmax layer](https://en.wikipedia.org/wiki/Softmax_function).
 2. We can use the [predict_proba()](https://discuss.analyticsvidhya.com/t/what-is-the-difference-between-predict-and-predict-proba/67376) to the normal Machine Learning bagging or boosting models and prepare the submission file.

### Exploratory Data Analysis and Data Processing :
---

We'll be trying to visualize the deeper data patterns and find out the anomalies that should be omitted to prepare the best trainable data.

Also an additional point,
**BEST TRAINABLE DATA** is data that has no noise and duplicates and outliers.

1. #### Target Value Count Distribution: 
---
At first, we are going to check the target value mass distribution. Cause too much difference in the can led us to a bad model learning.

In [None]:
# Target Value Count Distribution:
target_mass = train_df['target'].value_counts()
values = target_mass.values.tolist()
indexes = target_mass.index.tolist()

ax,fig = plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
plt.pie(values , labels = indexes)
plt.subplot(1,2,2)
plt.bar(indexes,values)
plt.show()

We can see that some target classes are present in a very big number and some are very few.

#### Approach :
We can take every target class row in the same count. But choosing that will reduce the data size and as we do not know which data to remove we might even remove the important rows. So, we will skip target class equalization.

 2. #### Correlation :
---
Now, we must check the data correlation. In this part, we'll be visualizing features to feature correlation.

In [None]:
fet_set = train_df.drop(labels=['id','target'],axis=1)
def plot_diag_heatmap(data):
    corr = data.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    f, ax = plt.subplots(figsize=(11, 9))
    sns.heatmap(corr, mask=mask, cmap='YlGnBu', center=0,square=True, linewidths=1, cbar_kws={"shrink": 1.0})
plot_diag_heatmap(fet_set)

Here we can see some features are light in the whole plot. So, we can indicate those as low correlated features.
So, we are going to drop those features from both train and test data.

In [None]:
corr = train_df.iloc[:,1:-1].corr()

In [None]:
corr

We are going to drop those features which are below that baseline.

In [None]:
plt.plot((abs(corr).sum()-1)/len(corr))
plt.xticks([])
plt.plot(np.ones(len(corr))*0.06,label = 'baseline',color = 'r')
plt.legend()
plt.show()

In [None]:
for col in corr.columns:
    if ((sum(corr[col])-1)/(len(corr)-1)) <0.06:
        print(col , (sum(corr[col])-1)/(len(corr)-1))

In [None]:
for col in corr.columns:
    if ((sum(corr[col])-1)/(len(corr)-1)) <0.06:
        train_df.drop(col,1,inplace=True)
        test_df.drop(col,1,inplace=True)

In [None]:
train_df.head()

In [None]:
train_df.describe()

3. #### Outliers :
---
As the outliers can not be distinguished very properly but can stay in the data as noise. We can visualize the outliers using [seaborn boxplots](https://seaborn.pydata.org/generated/seaborn.boxplot.html).

In [None]:
fig,axes = plt.subplots(1,5,figsize=(24,3))
i=1
for col in train_df.columns[1:-1]:
    plt.subplot(1,5,i)
    sns.boxplot(train_df['target'],train_df[col])
   # plt.yaxis('off')
    plt.xticks([])
    i+=1
    if i%5==1 and col!=train_df.columns[-2]:
        i=1
        plt.show()
        fig,axes = plt.subplots(1,5,figsize=(24,3))

In the boxplots, we can see that most of the data is outliers. So, we need to sensitively process data and remove those outliers.

#### Approach:

The approaches has been taken from [here](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/).

1. Using Interquartile Range .
2. Zscore



In [None]:
from  scipy.stats import zscore

In [None]:
# Using zscore

temp_df = train_df

for col in temp_df.columns[1:-1]:
    temp_df['zs'] = np.abs(zscore(temp_df[col]))
    temp_df = temp_df[temp_df['zs'] <= 2.7]
    temp_df.drop('zs' , 1 , inplace = True)
train_df.drop('zs' , 1 , inplace = True)
print(train_df.shape , '--->' , temp_df.shape)

In [None]:
from scipy.stats import iqr

In [None]:
# Using interquartile range

temp_df = train_df

for col in temp_df.columns[1:-1]:
    iqr_val = iqr(temp_df[col])
    q1 = np.quantile(temp_df[col] , 0.03)
    q3 = np.quantile(temp_df[col] , 0.97)
    temp_df = temp_df[temp_df[col]>=q1-1.5*iqr_val]
    temp_df = temp_df[temp_df[col]<=q3+1.5*iqr_val]
print(train_df.shape,'--->',temp_df.shape)

On the previous versions I have used a avery small range of iqr. In this version I will be removing the 3% of the furthest outliers using IQR method.

In [None]:
cleaned_train_df = temp_df

4. #### Dropping ID :

In [None]:
cleaned_train_df.drop('id',1,inplace=True)
idx = test_df['id']
test_df.drop('id',1,inplace=True)

5. #### Dropping Duplicates :
---
Now we are going to drop the duplicate rows and features. This will reduce the dimensionality of the train data.

In [None]:
cleaned_train_df.drop_duplicates(inplace=True)
#cleaned_train_df = cleaned_train_df.T.drop_duplicates().T  
#no need to apply these function .Takes to much unneccessary time

In [None]:
cleaned_train_df.shape

6. #### Checking other features :
---
 Let's check if there's any other pattern if we can find.

In [None]:
arr = []
plt.figure(figsize=(10,4))
for i in range(1,10):
    t_df =temp_df[temp_df['target']=='Class_'+str(i)]
    plt.scatter(t_df['feature_0'],t_df.index,label='Class_'+str(i),s=7)
plt.legend()
plt.show()

Look's like we can forward to the next step.

7. #### Splitting the data into train and validation :
---
 We are going to have an 80-20 train validation split, also we are going to change the target feature(basically change that into numerical values).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def split_data(test_size,data):
    data = data.sample(frac=1)
    x_train = data.drop('target',1)
    y_1 = data['target']
    x_train = x_train
    y_1 = y_1.to_numpy()
    X_train , X_val , y_1 , y_2 = train_test_split( x_train , y_1 ,
                                                         test_size = test_size ,
                                                        random_state =1 ,
                                                        stratify = y_1)
    y_train = []
    y_val = []
    for value in y_1:
        y_train.append(int(value[-1])-1)
    for value in y_2:
        y_val.append(int(value[-1])-1)
    return X_train , X_val , np.array(y_train) , np.array(y_val)

In [None]:
X_train , X_val , y_train , y_val = split_data(0.2,cleaned_train_df)
X_test = test_df[X_train.columns]

In [None]:
X_train.shape , X_val.shape , y_train.shape , y_val.shape , X_test.shape

8. #### Scaling :
---
Now we need to scale the data as the different magnitudes of data may create irregular clusters.

In [None]:
from sklearn.preprocessing import StandardScaler as scaler

In [None]:
def scale(train,test,validation):
  sc = scaler()
  columns = train.columns
  train = sc.fit_transform(train)
  test = sc.transform(test)
  validation = sc.transform(validation)

  train = pd.DataFrame(train , columns = columns)
  test = pd.DataFrame(test , columns = columns)
  validation = pd.DataFrame(validation , columns = columns)

  return train , test , validation

In [None]:
X_train , X_test , X_val = scale(X_train , X_test , X_val)

In [None]:
X_train.head()

Now, we will check again the target mass distribution.

In [None]:
# Target Value Count Distribution:
tm = pd.DataFrame(y_train,columns=['x'])
target_mass = tm['x'].value_counts()
values = target_mass.values.tolist()
indexes = target_mass.index.tolist()

ax,fig = plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
plt.pie(values , labels = indexes)
plt.subplot(1,2,2)
plt.bar(indexes,values)
plt.show()

It changed heavily :O

### Model Generation and Evaluation :

 As the model is training and predicting on a single data feature we might not get the correct accuracy metric. So, we are fitting and generating submission files.

In [None]:
# importing models

from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.ensemble import ExtraTreesClassifier as ext
from xgboost import XGBClassifier as xgb
from lightgbm import LGBMClassifier as lgb
from catboost import CatBoostClassifier as cbt

In [None]:
# Function to train and visualize accuracy and predict

def train_and_predict(model , x_1  , x_2 , x_3 , y_1 , y_2):
    labels = []
    for i in range(9):
        labels.append('Class_'+str(i+1))
    model.fit(x_1 , y_1)
    print('Training Completed..........')
    print('Train Accuracy : ',model.score(x_1,y_1))
    print('Validation Accuracy : ',model.score(x_2 , y_2))
    print('Model Prediction started....')
    y_pred = model.predict_proba(x_3)
    final_df = pd.DataFrame(y_pred , columns = labels)
    final_df = pd.concat([idx,final_df]  , axis = 1)    #uncomment this to find the actual submission files.
    #idxx = pd.DataFrame(np.ones(len(idx)))
    #final_df = pd.concat([idxx,final_df],axis=1)   # comment this line find actual submission files
    return final_df

In [None]:
clf1 = rfc(random_state = 2)
clf2 = ext(random_state = 2)
clf3 = xgb()
clf4 = lgb()
clf5 = cbt(verbose=0)
models = [ clf1 , clf2 , clf3 , clf4 , clf5]
names = ['rfc' , 'ext' , 'xgb' , 'lgb' , 'cbt']

In [None]:
for i in range(len(models)):
    model = models[i]
    print(names[i] , 'model has been opted for training...........')
    submission = train_and_predict(model , X_train , X_val , X_test , y_train , y_val)
    print('submission file created................................\n\n')
    submission.to_csv(names[i]+'.csv',index=False)
print('Task Completed.............................................')

# THANK YOU for visiting !!!!

## You can visit my other works at [kaggle](https://www.kaggle.com/sagnik1511/code) or in [Github](https://github.com/sagnik1511?tab=repositories).

## And Always ....................

![](https://i.ytimg.com/vi/GduXLWFxKhQ/maxresdefault.jpg)