![](https://storage.googleapis.com/kaggle-competitions/kaggle/25225/logos/header.png?t=2021-01-27-17-34-26)


# Tabular PlayGround walkthrough :
---

In this notebook we are going to see how to interpret a whole kaggle competition from one end to another.

We are following some steps which will be guided through other sub-operations / manipulations to gather knowledge and process and tune and find good accuracy. 

we are going to use custom Neural Networks using keras and tensorflow to predict.

# UPVOTE if you like this notebook and also to keep the developer sane :)

## Data Loading :
---
At first we have to gather the data pathas and load them into dataframes for further manipulations.

In [None]:
# Data Paths 

train_path = '../input/tabular-playground-series-may-2021/train.csv'

test_path = '../input/tabular-playground-series-may-2021/test.csv'

sample_submission_path = '../input/tabular-playground-series-may-2021/sample_submission.csv'

# Importing primary libraries 

# Data Manipulation
#-------------------------------
import os
import pandas as pd
import numpy as np

# Data Visualization
#--------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Train Data

train_data = pd.read_csv( train_path )

train_data.head()

In [None]:
# Test Data

test_data = pd.read_csv( test_path )

test_data.head()

In [None]:
# Sample submission Data 

samp_sub = pd.read_csv( sample_submission_path )

samp_sub.head()

## Primary Visualization & Exploratory Data Analysis:
---

Now we are going to check the basic structure of the data and how we can manipulate in to present the trainable data.

In [None]:
# Overall train data structure

print(train_data.info())

train_data.describe()

In [None]:
# Overall test data structure

print(test_data.info())

test_data.describe()

We have found that -

  1.    From this visualization we can see there's one ***ID*** column and all other non-target fearure is integer values  and the target feature is categorical. 

 2. Ther are no null values , so we do not have to process the data to fill thos places.
 3. Some of the features are binary feature where others are numerical.

In [None]:
for col in train_data.columns:
    print(col, ' : ', train_data[col].dtype ,end = ' | ')

### Target Distribution:
---

Now , let's check the target feature mass distribution. As that can share us any any leads towards finding the best trainable data.

In [None]:
# Target Value Count Distribution:

target_mass = train_data['target'].value_counts()
values = target_mass.values.tolist()
indexes = target_mass.index.tolist()

ax,fig = plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
plt.pie(values , labels = indexes)
plt.subplot(1,2,2)
plt.bar(indexes,values)
plt.show()

We have seen that the features are not distributed well. 

Now we should check each features' distributions as we ca find any feature which has no significant feature value and  can be omitted in this case .

In [None]:
ax,fig = plt.subplots(10,5,figsize=(15,15))
for i in range(50):
    plt.subplot(10,5,i+1)
    arr =train_data['feature_'+str(i)].tolist()
    plt.scatter(range(len(train_data)),arr,s = 0.2)
plt.show()

So, we've found that every single feature has a wide range of data spread.

### Heatmap:
---
so, we must check their correlation to find any oher information.

In [None]:
def plot_diag_heatmap(data):
    corr = data.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    f, ax = plt.subplots(figsize=(11, 9))
    sns.heatmap(corr, mask=mask, cmap='YlGnBu', vmax=.01, center=0,square=True, linewidths=.5, cbar_kws={"shrink": 1.0})

In [None]:
plot_diag_heatmap(train_data.iloc[:,1:])

In [None]:
 color = ['red' , 'green' , 'blue' , 'orange']
x1 = []
x2 = []
x3 = []
x4 = []
for i in range(4):
    xx = train_data[train_data['target']=='Class_'+str(i+1)]
    for col in train_data.columns[1:-1]:
        if i==0:
            x1.append(np.mean(xx[col]))
        elif i==1:
            x2.append(np.mean(xx[col]))
        elif i==2:
            x3.append(np.mean(xx[col]))
        else:
            x4.append(np.mean(xx[col]))
arr = []
arr.append(x1)
arr.append(x2)
arr.append(x3)
arr.append(x4)
plt.figure(figsize=(20,7))
#ax,fig=plt.subplots(4,1,figsize=(20,20))
for i in range(4):
    #plt.subplot(4,1,i+1)
    plt.plot(arr[i],color=color[i])
plt.legend()
#plt.title()

We can see the mean values of every single target type has negligible different, so we cannot drop any single row corresponding to this target types.

### Outliers Detection :
---

 Now we should check for the outliers in this data , as we should remove those and find a better trainable data.

In [None]:
ax,fig = plt.subplots(7,7,figsize=(25,20))
plt.suptitle('Outliers Detection in Train data',size=20)
for i in range(7):
    for j in range(7):
        plt.subplot(7,7,i*7+j+1)
        sns.violinplot(x=train_data['target'],y=train_data.iloc[:,i*7+j+1])
        plt.title(train_data.columns[i*7+j+1])
plt.show()

### Checking for any 2D pattern :
---

As the features are too much there might be any 2D pattern that might help to get a good sense of data


In [None]:
frame_pattern = train_data.iloc[:,1:-1].to_numpy()
frame_pattern.shape
ax,fig = plt.subplots(5,5,figsize=(15,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.imshow(frame_pattern[i].reshape(5,10)/255.0)
    plt.title(train_data['target'][i])
plt.show()


So, it looks like the data has no significant pattern. So, leaving this part.

## Preprocessing :
---
We're doing some basic data processing to present the trainable data

In [None]:
# Dropping the id columns

Train = train_data.drop('id',1)

Test = test_data.drop('id',1)

### MinMaxScaler :
---

 We are going to make every single feature's value lie in between 0 to 1 as their equivalence.

In [None]:
# Minmaxscaling :

def minmaxscaler(data, fin):
    
    for feature in fin.columns:
        if data[feature].dtype != 'object':
            min_value = min(data[feature])
            max_value = max(data[feature])
            data[feature] = (data[feature]-min_value) / (max_value-min_value)
            fin[feature] = (fin[feature]-min_value) / (max_value-min_value)
    
    return data,fin

In [None]:
Train,Test = minmaxscaler(Train,Test)

In [None]:
Train.head()

### Drop Low variation data :
---
 The features with low variation are no good data for training so we ust drop those features .

In [None]:
from scipy.stats import variation as var

In [None]:
for col in Test.columns[:-1]:
    print(col,' : ',var(Train[col]))

In [None]:
# dropping the features with low variance 
def drop_low_var_values(data,threshold):
    labels = []
    for col in data.columns:
        if data[col].dtype != 'object':
            if var(data[col]) >= threshold:
                labels.append(col)
        else:
            labels.append(col)
    new_data = data[labels]
    print(data.shape[1],' features ------> ',new_data.shape[1],' features .')
    return new_data

In [None]:
Train_data = drop_low_var_values(Train,1.3)
Test_data = Test[Train_data.columns[:-1]]

### Preparing Train and validation :
---
We're going to split prepare the valiation as 20 % of train data.
 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def split_data(test_size,data):
    data = data.sample(frac=1)
    x_train = data.drop('target',1)
    y_1 = data['target']
    x_train = x_train.to_numpy()
    y_1 = y_1.to_numpy()
    X_train , X_val , y_1 , y_2 = train_test_split( x_train , y_1 ,
                                                         test_size = test_size ,
                                                        random_state =1 ,
                                                        stratify = y_1)
    y_train = []
    y_val = []
    for value in y_1:
        y_train.append(int(value[-1])-1)
    for value in y_2:
        y_val.append(int(value[-1])-1)
    return X_train , X_val , np.array(y_train) , np.array(y_val)

In [None]:
X_train , X_val , y_train , y_val = split_data(0.2,Train_data)
X_test = Test_data

In [None]:
X_train.shape , X_val.shape , y_train.shape , y_val.shape , X_test.shape

## Model Generation :
---

The approach would be to predict through some famous tuned classifiers ( bagging & boosting both ) and then their prediction probability will be taken for the submission.

We will use 

1) **RandomForestClassifier**

2) **ExtraTreesClassfier**

3) **XGBoostClassifier**

4) **LGBMClassifier**

In [None]:
# importing models

from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.ensemble import ExtraTreesClassifier as ext
from xgboost import XGBClassifier as xgb
from lightgbm import LGBMClassifier as lgb

In [None]:
# Function to train and visualize accuracy and predict

def train_and_predict(model , x_1  , x_2 , x_3 , y_1 , y_2):
    labels = ['Class_1' , 'Class_2' , 'Class_3' , 'Class_4' ]
    model.fit(x_1 , y_1)
    print('Training Completed..........')
    print('Train Accuracy : ',model.score(x_1,y_1))
    print('Validation Accuracy : ',model.score(x_2 , y_2))
    print('Model Prediction started....')
    y_pred = model.predict_proba(x_3)
    final_df = pd.DataFrame(y_pred , columns = labels)
    #final_df = pd.concat([samp_sub['id'],final_df]  , axis = 1)    uncomment this to find the actual submission files.
    
    return final_df
    
    
    

In [None]:
clf1 = rfc(random_state = 2)
clf2 = ext(random_state = 2)
clf3 = xgb()
clf4 = lgb()

models = [ clf1 , clf2 , clf3 , clf4 ]
names = ['rfc' , 'ext' , 'xgb' , 'lgb']

In [None]:
for i in range(4):
    print(names[i] , models[i])

In [None]:
for i in range(len(models)):
    model = models[i]
    print(names[i] , 'model has been opted for training...........')
    submission = train_and_predict(model , X_train , X_val , X_test , y_train , y_val)
    print('submission file created................................')
    submission.to_csv(names[i]+'.csv',index=False)
print('Task Completed.............................................')

Now all the final data has been updated. 
## HURRAH !!!!!! We've reached the end.

### Tou can visit my other works at [github](https://github.com/sagnik1511)  or vsiit my kaggle profile [sagnik1511](https://kaggle.com/sagnik1511).

# THANK YOU for visiting :)

![](https://st3.depositphotos.com/1006899/12553/i/600/depositphotos_125537970-stock-photo-end-word-hanging-on-ropes.jpg)