<b>This Notebook represents the summary of my best submission for Tabular Playground Series - May 2022 Edition.<br></b>
It shows the most important steps I took to explore the data, feature engineering and how I trained the best performing model.

I also did an analysis of feature importance through SHAP values, if you're interested in that you can review it here: <a href="https://www.kaggle.com/code/fajerbolt/tps-may-2022-shap-analysis">TPS May 2022 - SHAP Analysis</a>

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
#Import libraries

#Data Manipulation
import pandas as pd
import numpy as np

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#ML Data Prep
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
import collections

#ML Algorithms
from xgboost import XGBClassifier

#Performance metrics
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
from xgboost import plot_importance

In [None]:
#Importing data
train = pd.read_csv('../input/tabular-playground-series-may-2022/train.csv')

## 1. Initial look at the data

In [None]:
train.head()

In [None]:
train.info()

There are 31 potential predictor features, target and ID<br>
16 Continuous Features<br>
14 Integer Features<br>
1 Text Feature<br>
No missing data

## 2. EDA

In [None]:
continuous = train.select_dtypes(include = 'float64')
integers = train.select_dtypes(include = 'int64').drop(columns = ['id', 'target'])
text = train.select_dtypes(include = 'object')
numeric = train.select_dtypes(exclude = 'object').drop(columns = ['id', 'target'])

target = train.target

### 2.1 Exploring Continuous Features

In [None]:
#Checking summary statistics

continuous.describe()

Data seems to be pretty simetric<br>
Features can be split into groups of three similar scales and standard deviations follow those scales<br>

In [None]:
#Pearson correlation for continuous features

pearson_r = continuous.corr()

fig, ax = plt.subplots(figsize = (12, 10))
plt.title('Pearson r for continuous features', fontweight = 'bold')
sns.heatmap(pearson_r, 
            annot = True,
            fmt='.2f',
            center = 0)
plt.show()

In [None]:
#Distribution of continuous features

for column in continuous.columns:
    sns.displot(data = continuous, x = column)
    plt.show()

Distribution plots confirm that continuous features are pretty simetrical.

In [None]:
for column in continuous.columns:
    sns.barplot(train.target, train[column])
    plt.show()

<b>Almost all of the continuous features seem to be good predictors.</b>

### 2.2 Integer Features

In [None]:
integers.describe()

All values are positive<br>
f_29 and f_30 have lower cardinality<br>
Features with higher cardinality seem to be skewed (small number of higher values)<br>

In [None]:
for column in integers.columns:
    sns.countplot(x = column, data = integers)
    plt.show()

f_29 and f_30 seem to be categoric in nature, all the others could be both numeric or categoric.

In [None]:
integers_numeric = integers.drop(columns = ['f_29', 'f_30'])
integers_cat = integers[['f_29', 'f_30']]

In [None]:
for column in integers.columns:
    sns.barplot(train[column], train.target)
    plt.show()

In [None]:
#Percentage of positive target class by category

for column in integers_cat:
    print(train.groupby(by = column).mean()['target'])

### 2.3 Text Feature

In [None]:
print('There are ' + str(text.nunique()[0]) + ' unique values in text(f_27) feature.')

In [None]:
print('All lenghts of text field:')
text.f_27.apply(len).unique()[0]

### 2.4 Target


In [None]:
target.value_counts()

Pretty even distribution of target values.

## 3. Feature Engineering

### 3.1 Text frequency

Calculate number of times the given text sequence shows up in the data.

In [None]:
text_index = {}
text_freqs = []
for index, row in train.iterrows():
    text = row['f_27']
    if text not in text_index:
        text_index[text] = 0
    text_index[text] += 1
    text_freqs.append(text_index[text])

train['text_frequency'] = text_freqs

In [None]:
sns.barplot(x = 'text_frequency', y = 'target', data = train)

### 3.2 Encoding text feature

In [None]:
#Creating a feature for each letter of f_27

for i in range(10):
    train[f'letter_{i+1}'] = train.f_27.str.get(i).apply(ord) - ord('A')

### 3.3 Number of distinct letters in text

In [None]:
train['text_distinct_letters'] = train.f_27.apply(set).apply(len)

### 3.4 Number of duplicated letters (distinct)

In [None]:
#Calculates the distinct number of duplicated letters.
#i.e. string 'AABB' would return a value of 2

counts = train.f_27.apply(collections.Counter).apply(dict)

duplicated = []
for index, row in counts.iteritems():
    duplicates = {key:value for key, value in row.items() if value > 1}
    duplicated.append(len(duplicates.keys()))

### 3.5 Most common letter

Creating binary feature for every letter found in text feature, that designates whether that letter is one of the most common letters in a given string.

In [None]:
common_letters = []
for index, row in counts.iteritems():
    letters = [key for key, value in row.items() if value == max(row.values())]
    common_letters.append(letters)
    
common_letters_flat = [letter for letters in common_letters for letter in letters]

unique_letters = sorted(set(common_letters_flat))

common_letters_series = pd.Series(common_letters)

In [None]:
def most_common_letter(letter):
    letter_values = []
    for index, row in common_letters_series.iteritems():
        if letter in row:
            letter_values.append(1)
        else:
            letter_values.append(0)
    train[letter + '_most_common'] = letter_values  

In [None]:
for letter in unique_letters:
    most_common_letter(letter)

### 3.6 Combinations of f_29 and f_30

Idea is to create a feature that will show to which combination of features f_29 and f_30 does a row belong (6 possible features).

In [None]:
combinations = [(train['f_29'] == 0) & (train['f_30'] == 0),
                (train['f_29'] == 0) & (train['f_30'] == 1),
                (train['f_29'] == 0) & (train['f_30'] == 2),
                (train['f_29'] == 1) & (train['f_30'] == 0),
                (train['f_29'] == 1) & (train['f_30'] == 1),
                (train['f_29'] == 1) & (train['f_30'] == 2)]

values = [1, 2, 3, 4, 5, 6]


train['f_29_30'] = np.select(combinations, values)

In [None]:
train.columns

## 4. Model Training

In [None]:
x = train.drop(columns = ['id', 'target', 'f_27'])
y = train.target

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify = y, random_state = 42)

In [None]:
#Trained the model using Random Search CV

#xgb = XGBClassifier()

#random_params_xgb = {'n_estimators' : [i for i in range(10, 500, 20)],
#             'max_depth' : [i for i in range(2, 15)],
#             'learning_rate' : list(np.arange(0.01, 1, 0.03)),
#             'colsample_bytree' : list(np.arange(0.1, 1, 0.05)),
#             'subsample' : list(np.arange(0.1, 1, 0.05)),
#             'reg_lambda' : list(np.arange(0.001, 1, 0.005)),
#             'n_jobs' : [-1],
#             'random_state' : [42]
#             }

#rs_xgb = RandomizedSearchCV(xgb, random_params_xgb, scoring = 'roc_auc', n_iter = 50)
#rs_xgb.fit(x_train, y_train)



In [None]:
#Best XGB model

best_xgb = XGBClassifier(colsample_bytree = 0.8000000000000002,
                        learning_rate = 0.49,
                        max_depth = 13,
                        n_estimators = 330,
                        reg_lambda = 0.811,
                        subsample = 0.9000000000000002)

best_xgb.fit(x_train, y_train)

print('Accuracy: ' + str(accuracy_score(y_train, best_xgb.predict(x_train))))
print('ROC AUC: ' + str(roc_auc_score(y_train, best_xgb.predict_proba(x_train)[:, 1])))

## 5. Model Performance

In [None]:
print('Accuracy: ' + str(accuracy_score(y_test, best_xgb.predict(x_test))))
print('ROC AUC: ' + str(roc_auc_score(y_test, best_xgb.predict_proba(x_test)[:, 1])))