<div style="padding:20px;color:#EAB4DE;margin:0;font-size:200%;text-align:center;border-radius:5px;overflow:hidden;font-weight:500">TPS May 2022</div>

# <b><span style='color:#EAB4DE'>1 |</span><span style='color:#EAB4DE'> Competition Overview</span></b>

The May edition of the 2022 Tabular Playground series is a binary classification problem that includes a number of different feature interactions. 
The dataset contains several variables representing simulated manufacturing control datawhich can be useful to predict whether the machine is in State 0 or State 1.

# <b><span style='color:#EAB4DE'>2 |</span><span style='color:#EAB4DE'>Exploratory Data Analysis</span></b>

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib


import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv')
subm = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/sample_submission.csv')

print("The Training dataset is made of {} rows and {} columns.".format(len(train_df), len(train_df.columns)))

Below we can see some rows from the trainig dataset, in order to see how the data include in it look like:

In [None]:
pd.options.display.max_columns = train_df.shape[1]
train_df.head()

We can see that not all variables have the same type of data in it.
In fact, if we look at each column, the types we have are the following:

In [None]:
columns = train_df.dtypes

for elem in range(len(columns.index)):
    print("- {}: type {} \n".format(columns.index[elem], columns.values[elem]))

We know that the dataset includes both continuos and categorical data, therefore we can interpret integer column as categories and floating columns as numeric variables.

Before we dive into the analysis of the features, let's have a look at how the target variable is distributed in our training set, in order to make sure that we don't have skewed information, which may lead to a wrong choice when looking at the model.

In [None]:
counting = train_df['target'].value_counts()
lbl = []
for elem in counting.index:
    lbl.append('Target {}'.format(counting.index.values[elem]))

plt.figure(figsize=(15,8))
font = {'family' : 'serif',
        'weight' : 'bold',
        'size'   : 12}

matplotlib.rc('font', **font)

colors = sns.color_palette("husl", 2)
plt.pie(counting, labels = lbl, colors = colors, autopct='%.0f%%', explode=(0, 0.1),
        shadow=True, startangle=90
       )
plt.show()

It looks like the dataset is not skewed in representing the target variable.

In [None]:
print("The number of missing values in the training set is equal to: {}.".format(train_df.isnull().sum().sum()))

In [None]:
features = train_df.drop(columns = ['id', 'target'])

# <b><span style='color:#EAB4DE'>2.1 |</span><span style='color:#EAB4DE'>EDA Continuous Variables</span></b>

All variables included in the dataframe are numeric (either float or integer), except for f_27.
Since the integer columns represent factorial variables, we'll be looking only at the float ones.
In order to better understand what's inside these columns, we can look at a brief statistical summary:

In [None]:
x_float = train_df.select_dtypes('float64')
x_float.describe()

Since we have many different columns, looking at a table and getting some useful insights may be quiet difficult.
Therefore looking at a plot may be more useful:

In [None]:
sns.color_palette("husl", 8)
plt.figure(figsize=(15,8))
ax = sns.boxplot(data=x_float, orient="h")

Since f_28 is on a different scale compared to the other features, instead of plotting the boxplots all in one graph, it is better to split them by column and maybe divide them by the value of the target variable:

In [None]:
float_and_tgt=pd.concat([x_float,train_df['target']], axis=1)
titles=['Feature {}'.format(i.split('_')[-1]) for i in x_float]
fig, ax = plt.subplots(4,4, figsize=(14,24))
row=0
col=[0,1,2,3]*4
for i, column in enumerate(float_and_tgt.columns[:-1]):
    if (i!=0) & (i%4==0):
        row+=1
    color='#2CB4CF'
    rgb=matplotlib.colors.to_rgba(color,0.2)
    ax[row,col[i]].boxplot(float_and_tgt[float_and_tgt['target']==0][column], positions=[0],
                           widths=0.7, patch_artist=True,
                           boxprops=dict(color=color, facecolor=rgb, linewidth=1.5))
    color='#EAB4DE'
    rgb=matplotlib.colors.to_rgba(color,0.2)
    ax[row,col[i]].boxplot(float_and_tgt[float_and_tgt['target']==1][column], positions=[1],
                           widths=0.7, patch_artist=True,
                           boxprops=dict(color=color, facecolor=rgb, linewidth=1.5))
    ax[row,col[i]].grid(visible=True, which='major', axis='y', color='#F2F2F2')
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set_title('\n\n{}'.format(titles[i]))
sns.despine(bottom=True, trim=True)
plt.suptitle('Distributions of Numerical Variables',fontsize=16)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

We notice that all features as pretty much symmetrical and not very skewed. To confirm this hypothesis, let's have a look at the distributions by using some histograms:

In [None]:
float_and_tgt=pd.concat([x_float,train_df['target']], axis=1)
titles=['Feature {}'.format(i.split('_')[-1]) for i in x_float]
fig, ax = plt.subplots(4,4, figsize=(14,24))
row=0
col=[0,1,2,3]*4
for i, column in enumerate(float_and_tgt.columns[:-1]):
    if (i!=0) & (i%4==0):
        row+=1
    color='#2CB4CF'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    ax[row,col[i]].hist(float_and_tgt[float_and_tgt['target']==0][column],
                        color=rgb, density=True, bins=40)
    color='#EAB4DE'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    ax[row,col[i]].hist(float_and_tgt[float_and_tgt['target']==1][column],
                       color=rgb, density=True, bins=40)
    #ax[row,col[i]].grid(visible=True, which='major', axis='y', color='#F2F2F2')
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set_title('\n\n{}'.format(titles[i]))
sns.despine(bottom=True, trim=True)
plt.suptitle('Distributions of Numerical Variables',fontsize=16)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

We can confirm that the variables are symmetrical and are Normally distributed.
We can also notice that the distribution is the same for both values of the target variable for all features, except for some small spikes.

Now that we had a look at the distribution of the features, it can be useful to see if there is any relevant correlation between some of them:

In [None]:
corr = x_float.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(14, 24))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(145, 300, s=60, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

It looks like there is no particular correlation between these variables, except for Feature 28, that seems a little bit correlated (with a score between 0.2 and 0.3) with the first 6 features.

# <b><span style='color:#EAB4DE'>2.2 |</span><span style='color:#EAB4DE'>EDA Categorical Variables</span></b>

Now that we analyzed how the continuous variables are distributed in our dataset, we will begin to have a look at the categorical ones.

In [None]:
x_int = train_df.select_dtypes('int64')
x_int = x_int.drop('id', axis=1)
#x_int_tgt = pd.concat([x_int,train_df['target']], axis=1)

In [None]:
sub_titles=['Feature {}'.format(i.split('_')[-1]) for i in x_int.columns[:-1]]

fig, ax = plt.subplots(4,4, figsize=(14,24))

for i, f in enumerate(x_int.columns[:-1]):
    plt.subplot(4, 4, i+1)
    ax = plt.gca()
    color='#2CB4CF'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    
    vc_0 = x_int[x_int['target']==0][f].value_counts()
    ax.bar(vc_0.index, vc_0, color=rgb)
    
    color='#EAB4DE'
    rgb=matplotlib.colors.to_rgba(color,0.3)
    vc_1 = x_int[x_int['target']==1][f].value_counts()
    ax.bar(vc_1.index, vc_1, color=rgb)
    #ax.hist(train[f], density=False, bins=(train[f].max()-train[f].min()+1))
    #ax.set_xlabel(f'Feature {f}')
    ax.set_title('\n\n{}'.format(sub_titles[i]))
    #ax.xaxis.set_major_locator(MaxNLocator(integer=True)) # only integer labels
sns.despine(bottom=True, trim=True)
plt.suptitle('Distributions of Categorical Variables',fontsize=16)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])
plt.show()

By looking at the different bar charts, we notice that the majority of the categorical features has between 10 and 15 different levels, with a higher concentration of data in the first 5 levels.
Feature 29 has only two levels, so we may think of it as a boolean variable (more skewed on 0 than 1), while Feature 30 has 3 levels, quiet uniformely distributed.

The counts for each categorical feature are similar both for status 0 and for status 1.

# <b><span style='color:#EAB4DE'>3 |</span><span style='color:#EAB4DE'>Logistic Regression</span></b>

The first model we could try could be a Logistic Regression, since the target variable is a binary one.

In [None]:
from sklearn.model_selection  import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
x_features = pd.concat([x_float,x_int], axis=1)

In [None]:
train_x, cv_x, y_train, y_cv  = train_test_split(x_features,train_df['target'],
                                                 stratify=train_df['target'])

train_x.drop(['target'],axis=1,inplace=True)
cv_x.drop(["target"],axis=1,inplace=True)

lr = LogisticRegression(max_iter=500)
lr.fit(train_x.values, y_train.values)
pred = lr.predict(train_x.values)
print("The train accuracy of the Logistic Regression is ",accuracy_score(y_train.values,pred))
pred  = lr.predict(cv_x.values)
print("The cv accuracy of the Logistic Regression is ",accuracy_score(y_cv.values, pred))

The score we got from the logisti model isn't the best, but we noticed that the training set contains various outliers in many columns: let's try to remove them and see if we get better results.

In [None]:
from scipy import stats
x_float_no = x_float[(np.abs(stats.zscore(x_float)) < 3).all(axis=1)]
index_list_no = x_float_no.index.to_list()
x_int_no = x_int.iloc[index_list_no]
target_no = x_int_no['target']
x_int_no.drop('target', axis=1, inplace=True)

In [None]:
x_features_no = pd.concat([x_float_no,x_int_no], axis=1)

train_x, cv_x, y_train, y_cv  = train_test_split(x_features_no,target_no,
                                                 stratify=target_no)

#train_x.drop(['target'],axis=1,inplace=True)
#cv_x.drop(["target"],axis=1,inplace=True)

lr = LogisticRegression(max_iter=500)
lr.fit(train_x.values, y_train.values)
pred = lr.predict(train_x.values)
print("The train accuracy of the Logistic Regression without outliers is ",accuracy_score(y_train.values,pred))
pred  = lr.predict(cv_x.values)
print("The cv accuracy of the Logistic Regression without outliers is ",accuracy_score(y_cv.values, pred))

Looks like removing the outliers didn't make the model better, but worse.

In [None]:
test_x = test_df.select_dtypes(["int","float"])
test_id = test_x['id'].values
test_x.drop("id",axis=1,inplace=True)

In [None]:
pred = lr.predict(test_x.values)

Let's have a look at how the submissions should be like:

In [None]:
subm.head()

submission_df = pd.DataFrame({
    "id" : test_id,
    "target": pred
})
submission_df.to_csv("submission.csv",index=False)

As expected, the score the logistic regression got is quiet low (0.48).
We could try to improve it by using cross validation or try to implement a more complex model.

# <b><span style='color:#EAB4DE'>4 |</span><span style='color:#EAB4DE'>Random Forest</span></b>

First, we'll try with a simple RandomForestClassifier:

In [None]:
#x_features_no.drop('target', axis=1, inplace=True)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [None]:
x_features_notgt = x_features.drop('target', axis=1)
target = x_features['target']

x_train, x_test, y_train, y_test = train_test_split(x_features_notgt, target)


model = RandomForestClassifier(n_jobs=-1)
model.fit(x_train, y_train)

pred = model.predict(test_df.drop(["id","f_27"],axis=1))

The Random Forest got a better scoring (0.51), compared to Logistic Regression.
Maybe working on it could bring to better results.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(x_features_no,target_no, test_size = 0.25, random_state = 42)

rf = RandomForestRegressor(n_estimators = 100,
                            min_samples_leaf = 5,
                            max_depth = 15,
                            n_jobs = -1,
                            random_state = 42)
rf.fit(train_features, train_labels)

Now that we identified our model, let's look at the importance of the variables we included, in order to check if it's better to reduce the number of features:

In [None]:
feature_list = list(train_features.columns)

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {} Importance: {}'.format(*pair)) for pair in feature_importances]

Looks like the first 8 features (ordered by importance) cover more than 80% of the importance, so maybe we should try to create a model only with those.

In [None]:
feature_importances[1][0]

In [None]:
top_8_features = []
for row in range(8):
    top_8_features.append(feature_importances[row][0])

In [None]:
x_top_8 = x_features_no[top_8_features]

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(x_top_8,target_no, test_size = 0.25, random_state = 42)

rf_top8 = RandomForestRegressor(n_estimators = 100,
                            min_samples_leaf = 5,
                            max_depth = 15,
                            n_jobs = -1,
                            random_state = 42)
rf_top8.fit(train_features, train_labels)

In [None]:
# Use the forest's predict method on the test data
predictions = rf_top8.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

In [None]:
test_x = test_df[top_8_features]
test_id = test_df['id'].values
#test_x.drop("id",axis=1,inplace=True)

In [None]:
pred = rf_top8.predict(test_x.values)
pred

In [None]:
submission_df = pd.DataFrame({
    "id" : test_id,
    "target": pred
})
submission_df.to_csv("submission.csv",index=False)

Looks like limiting the features included in the model improved consistently the score of the model (0.84).

# <b><span style='color:#EAB4DE'></span><span style='color:#EAB4DE'>Disclaimer</span></b>

This is my first Kaggle competition.
In order to do this EDA Analysis I got some inspiration from Notebooks that have been published by other Kagglers and tried to do the best I could with the packages I knew.
Feel free to add suggestions both on what other analyses could be done and how the results I presented could be done better!