# Tabular Playground Series - Jun 2022

### Table of Contents : 

  * [Data Manipulation](#sec1)
       * [Importing Dataset](#sec1.1)
       * [Dataset View](#sec1.2)
       * [Dataset Information](#sec1.3)
       * [Summary Statistics](#sec1.4)
       * [Checking for unique values in integer type attribute](#sec1.5)
       * [Checking for missing values in each column](#sec1.6)
       * [percentage of missing values in each column](#sec1.7)
       
  * [Data Visualization](#sec2)
       * [Missing Value Plot](#sec2.1)
       * [Density Plot of Continuous Variable](#sec2.2)
       * [Heatmap](#sec2.3)
       * [Density Plot after applying power transformer](#sec2.4)
       
  * [Modeling](#sec3)
       * [Power Transformer :- Yeo-Johnson transform](#sec3.1)
       * [Iterative Imputer with Linear Regresson for predicting missing values](#sec3.2)
       
   * [Importing Submission File](#sec4)

## Data Manipulation <a class="anchor" id="sec1"></a>

### Importing libraries 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
pd.set_option("display.max_rows", 100, "display.max_columns", 100)

### Importing dataset <a class="anchor" id="sec1.1"></a>

In [None]:
df=pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')

### Dataset View <a class="anchor" id="sec1.2"></a>

In [None]:
df.head(10)

### Dataset Information <a class="anchor" id="sec1.3"></a>

In [None]:
df.info()

### Summary Statistics <a class="anchor" id="sec1.4"></a>

In [None]:
df.describe()

### Checking for unique values in integer type attribute <a class="anchor" id="sec1.5"></a>

In [None]:
df.select_dtypes(include=['int64']).nunique().sort_values(ascending=True)

### Checking for missing values in each column <a class="anchor" id="sec1.6"></a>

In [None]:
df.isnull().sum()

### percentage of missing values in each column <a class="anchor" id="sec1.7"></a>

In [None]:
pd.options.display.float_format = '{:,.2f} %'.format
(df.isnull().sum()/len(df))*100

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

## Data Visualization <a class="anchor" id="sec2"></a>

### Missing Value Plot <a class="anchor" id="sec2.1"></a>

In [None]:
import missingno as msno

In [None]:
msno.matrix(df,labels=[df.columns],figsize=(30,16),fontsize=12)

### Checking the data distribution of each Continuous variable  <a class="anchor" id="sec2.2"></a>

In [None]:
plt.figure(figsize=(18, 18))
for i, col in enumerate(df.select_dtypes(include=['float64']).columns):
    plt.rcParams['axes.facecolor'] = 'black'
    ax = plt.subplot(11,5, i+1)
    sns.histplot(data=df, x=col, ax=ax,color='red',kde=True)
plt.suptitle('Data distribution of continuous variables')
plt.tight_layout()

Here we can see that there are a lot of attributes which are positively or negatively distributed.so we will use power transformation to make these attributes symmetrical.

In [None]:
df1=df[df.select_dtypes(include=['float64']).columns]#separating missing values column

### Power Transformer <a class="anchor" id="sec3.1"></a>

#### we will use Yeo-Johnson transform for transforming our data.A power transform will make the probability distribution of a variable more Gaussian

In [None]:
from sklearn.preprocessing import PowerTransformer

In [None]:
power = PowerTransformer(method='yeo-johnson', standardize=False)
df2=power.fit_transform(df1)

In [None]:
df2=pd.DataFrame(df2,columns=list(df1.columns))

### Heatmap <a class="anchor" id="sec2.3"></a>

In [None]:
plt.figure(figsize=(18,18))
sns.heatmap(df2.corr(),annot=False)
plt.show()

In [None]:
df2.head()

### Again Checking data distribution after applying power transformation <a class="anchor" id="sec2.4"></a>

In [None]:
plt.figure(figsize=(18,18))
for i,col in enumerate(df2.select_dtypes(include=['float64']).columns):
    plt.rcParams['axes.facecolor'] = 'black'
    ax=plt.subplot(11,5,i+1)
    sns.histplot(data=df2,x=col,ax=ax,kde=True,color='red')
plt.suptitle('density plot')
plt.tight_layout()
plt.show()

## Modeling <a class="anchor" id="sec3"></a>

### Applying Iterative Imputer with Linear Regresson for predicting missing values <a class="anchor" id="sec3.2"></a>

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

In [None]:
imp=IterativeImputer(estimator=LinearRegression(),missing_values=np.nan)

In [None]:
df3=imp.fit_transform(df2)

## final checking for missing values after predicting missing values

In [None]:
df4=pd.DataFrame(df3,columns=df2.columns)

In [None]:
df4.head()

In [None]:
df4.isnull().sum()

### Importing Submission file <a class="anchor" id="sec4"></a>

In [None]:
sub=pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv')

In [None]:
split=sub['row-col'].str.split(pat="-",expand=True)

In [None]:
row=split.iloc[:,0].astype('int64')
col=split.iloc[:,1].astype('str')

In [None]:
val=[]
for i in range(0,len(row)):
    a=row[i]
    b=col[i]
    val.append(df4.loc[a,b])

In [None]:
sub['value']=val
sub.to_csv('final_submission.csv',index=False)