<div style="padding:20px;color:#EAB4DE;margin:0;font-size:200%;text-align:center;border-radius:5px;overflow:hidden;font-weight:500">TPS June 2022</div>

# <b><span style='color:#EAB4DE'>1 |</span><span style='color:#EAB4DE'> Competition Overview</span></b>

The June edition of the 2022 Tabular Playground series is a data imputation problem. The dataset has similarities to the May 2022 Tabular Playground, except that there are no targets. Rather, there are missing data values in the dataset, and our task is to predict what these values should be.

As for TPS May 2022, The dataset contains several variables representing simulated manufacturing control data that contains missing values due to electronic errors (we know that only the continuous features have missing values.)


# <b><span style='color:#EAB4DE'>2 |</span><span style='color:#EAB4DE'>Exploratory Data Analysis</span></b>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/data.csv')
subm = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/sample_submission.csv')

print("The Training dataset is made of {} rows and {} columns.".format(len(df), len(df.columns)))

Now that we have imported the dataset, we can start to see some rows from it, in order to see how the data included in it look like:

In [None]:
pd.options.display.max_columns = df.shape[1]
df.head()

We know that the dataset includes both continuous and categorical variables. In fact, if we look at each column, we see that we have both integer and floating columns:

In [None]:
columns = df.dtypes

for elem in range(len(columns.index)):
    print("- {}: type {} \n".format(columns.index[elem], columns.values[elem]))

In [None]:
print("The dataset contains {} continuous and {} categorical features.".format(
    (len(df.select_dtypes('float64').columns)), (len(df.select_dtypes('int64').columns)-1)))

Now that we had a look at the data types included, we can look at a brief statistical summary of the features, in order to understand a little bit more the distributions:

In [None]:
summary = df.describe()
display(summary.style.format('{:,.3f}')
        .background_gradient(subset=(summary.index[1:],summary.columns[:]), cmap='PiYG'))

We can see that the columns have different count of the rows. Let's have a look at the missing values distribution in all the columns, in order to check if there is any pattern or any specific column we should be aware of.

In [None]:
missing_values = df.isna().sum()

plt.figure(figsize=(15,8))
font = {'family' : 'serif',
        'size'   : 10}

matplotlib.rc('font', **font)
ax = sns.barplot( missing_values.index, missing_values.values)
ax.set_xticklabels(labels = missing_values.index, rotation=90)
plt.show()

The distribution of missing values between the continuous features seems evenly distributed between these columns.

# <b><span style='color:#EAB4DE'>2.1 |</span><span style='color:#EAB4DE'>EDA Continuous Variables</span></b>

In [None]:
x_float = df.select_dtypes('float64')
titles=['Feature {}.{}'.format(i.split('_')[1], i.split('_')[2]) for i in x_float]
fig, ax = plt.subplots(11,5, figsize=(14, 24))
row=0
col=[0,1,2,3,4]*11
for i, column in enumerate(x_float.columns):
    if (i!=0) & (i%5==0):
        row+=1
    color='#EAB4DE'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    ax[row,col[i]].hist(x_float[column],
                        color=rgb, density=True, bins=40)
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set_title('\n\n{}'.format(titles[i]))
sns.despine(bottom=True, trim=True)
plt.suptitle('Distributions of Numerical Variables',fontsize=16)
plt.tight_layout(rect=[0, 0.02, 1, 0.99])

Almost all variables are normally distributed, with mean around zero.
But some variables (like Feature 4.10 or Feature 3.19) are not symmetric and a little bit skewed.

Now that we had a look at the distribution of the features, it can be useful to see if there is any relevant correlation between some of them:

In [None]:
corr = x_float.corr()

mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(14, 24))

cmap = sns.diverging_palette(145, 300, s=60, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Apparently the continuous variables are incorrelated, except for the ones in Feature 4.x

# <b><span style='color:#EAB4DE'>2.2 |</span><span style='color:#EAB4DE'>EDA Categorical Variables</span></b>

Now that we analyzed how the continuous variables are distributed in our dataset, we will begin to have a look at the categorical ones.

In [None]:
x_int = df.select_dtypes('int64')
x_int = x_int.drop('row_id', axis=1)

In [None]:
titles=['Feature {}.{}'.format(i.split('_')[1], i.split('_')[2]) for i in x_int]

fig, ax = plt.subplots(5,5, figsize=(14, 24))
row=0
col=[0,1,2,3,4]*5

for i, f in enumerate(x_int.columns):
    if (i!=0) & (i%5==0):
        row+=1
    color='#2CB4CF'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    
    vc_0 = x_int[f].value_counts()
    ax[row,col[i]].bar(vc_0.index, vc_0, color=rgb)
    ax[row,col[i]].set_title('\n\n{}'.format(titles[i]))

sns.despine(bottom=True, trim=True)
plt.suptitle('Distributions of Categorical Variables',fontsize=16)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])
plt.show()

# <b><span style='color:#EAB4DE'>3 |</span><span style='color:#EAB4DE'>Outlier Detection</span></b>

In [None]:
from scipy import stats
x_float_no = x_float[(np.abs(stats.zscore(x_float)) < 3).all(axis=1)]
x_float_no.shape

Looks like there are no outliers, since every Z score is lower than 3.

# <b><span style='color:#EAB4DE'>4 |</span><span style='color:#EAB4DE'>Missing Values: Median </span></b>

We already mentioned that some of the continuous variables are a little bit skewed, therefore replacing missing values with the mean may be misleading.
Since all other variables are Normally distributed (and we know that in a Normal distribution mean, median and mode coincide), we can try to replace missing values with the median.

In [None]:
df_median = df.fillna(df.median())
df_median.set_index('row_id', inplace=True)

In [None]:
df_row_id = df.set_index('row_id')
missing_df = df_row_id[df_row_id.isnull()]
#missing_df.set_index('row_id', inplace=True)
missing_df.head()

In [None]:
missing_df.shape

In [None]:
ids = []
values = []
for row in missing_df.index:
    for col in missing_df.columns:
        ids.append("{}-{}".format(row, col))
        values.append(df_median.loc[row, col])

In [None]:
submission_df = pd.DataFrame({
    "row-col" : ids,
    "value": values
})
submission_df.to_csv("submission.csv",index=False)

In [None]:
subm.head()
subm.shape

In [None]:
submission_df.head()
submission_df.shape

In [None]:
for row, col in missing_df.iterrows():
    print("Stampo {}-{}".format(row, col))