**Introduction**

This notebook contains the EDA for the december 2021 tabular playground data. The libraries used are pandas, numpy, seaborn and matplot with version 3.4

**Import the necessary libraries**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

**Loading and dropping unwanted columns in the data**

In [None]:
train = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-dec-2021/test.csv")

In [None]:
train = train.drop('Id', axis = 1)
test = test.drop('Id', axis = 1)

In [None]:
print(train.shape)
print(test.shape)

In [None]:
train.head()

In [None]:
test.head()

**Find whether if there are null values in train and test data**

In [None]:
train["null_count"] = train.isnull().sum(axis=1)
counts = train.groupby("null_count")["null_count"].count().to_dict()
pie, ax = plt.subplots(figsize=[5, 5])
colors = sns.color_palette("bwr_r")[0:5]
null_data = {"{} Null values".format(k) : v for k, v in counts.items() if k < 6}
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), \
        labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=12)
plt.show()

In [None]:
test["null_count"] = test.isnull().sum(axis=1)
counts = test.groupby("null_count")["null_count"].count().to_dict()
pie, ax = plt.subplots(figsize=[5, 5])
colors = sns.color_palette("bwr_r")[0:5]
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), \
        labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Test Data)", fontsize=12)
plt.show()

**Find how the target variable is distributed**

In [None]:
fig, ax = plt.subplots()
sns.countplot(x='Cover_Type', data=train, order=sorted(train['Cover_Type'].unique()), ax=ax)
ax.set_title("Distribution of the target variable")
plt.show()

**Use describe to find more details about train and test data**

In [None]:
train.describe().T.style.bar(subset=['mean'], color='orange')\
                            .background_gradient(subset=['std'], cmap='Blues')

In [None]:
test.describe().T.style.bar(subset=['mean'], color='orange')\
                            .background_gradient(subset=['std'], cmap='Blues')

**Comparision between the test and train set using the describe**

In [None]:
(train.describe() - test.describe())[test.columns].T.iloc[:,1:].style\
        .bar(subset=['mean', 'std'], align='mid', color=['red', 'green'])

**Function to find the total percentage of zeros in the features**

In [None]:
def zerodata(zero_data):
  fig, ax = plt.subplots(1,1,figsize=(12, 20))
  ax.barh(zero_data.index, 100, color='grey', height=0.6)
  barh_label = ax.barh(zero_data.index, zero_data, color='lightblue', height=0.6)
  ax.bar_label(barh_label, fmt='%.01f %%', color='black')
  ax.spines[['left', 'bottom']].set_visible(False)
  ax.set_xticks([])
  ax.set_title('# of Zeros (by feature)', loc='center', fontweight='bold', fontsize=15)    
  plt.show()

In [None]:
zero_data_train = ((train.iloc[:,:54]==0).sum() / len(train) * 100)[::-1]
zerodata(zero_data_train)

**To find the correlation between the features**

In [None]:
fig, ax = plt.subplots(figsize=(12 , 12))
corr = train.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr,
        square=True, center=0, linewidth=0.2,
        cmap=sns.diverging_palette(250, 20, as_cmap=True),
        mask=mask, ax=ax) 
ax.set_title('Feature Correlation', loc='left', fontweight='bold')
plt.show()

If you find this notebook useful, please do upvote the notebook.

**References :**

1. [Pandas Docs](https://pandas.pydata.org/docs/)
2. [Matplot Docs](https://matplotlib.org/stable/index.html)
3. [TPS-May](https://www.kaggle.com/subinium/tps-may-categorical-eda)
4. [TPS-Dec](https://https://www.kaggle.com/craigmthomas/tps-dec-2021-eda)