<b>UPDATES:</b>

5/27/22

- Reconstructed the Optuna function and added separate cross-validation technique.

5/30/22

- Used bag of words and added a new feature counting the amount of unique letters in each entry. 
- Added visualizations to said features.
- Changed Cross Validation code to a ROC AUC plot.

<b>TO-DO-LIST:</b>

- Experiment more w/ feature engineering (will try One-Hot encoding first, then try counting the frequencies of letters. Possibly make a Document Term Matrix)


This is my submission for the TPS May 2022 competition. For my notebook, I will be using the LGBM classification model along with Optuna to tune the hyperparameters.

To start, we will import the needed packages.

In [None]:
from string import ascii_letters
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score, roc_curve, auc, RocCurveDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import optuna
from lightgbm import LGBMClassifier, plot_importance
from category_encoders import TargetEncoder

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
main_path = '../input/tabular-playground-series-may-2022/' #the main folder where all the csv files are located

#configure options for pandas
pd.set_option('display.float_format', '{:4f}'.format)
pd.set_option('display.max_columns', None)

#load datsets
train = pd.read_csv(main_path + 'train.csv', index_col='id')
test = pd.read_csv(main_path + 'test.csv', index_col='id')
all_data = pd.concat([train,test])
train.head()

## Profiling

In [None]:
#Disply dimensions of each dataset
print('Shape of training dataset: {} rows and {} columns'.format(train.shape[0], train.shape[1]))
print('Shape of testing dataset: {} rows and {} columns'.format(test.shape[0], test.shape[1]))
print('Shape of all data: {} rows and {} columns'.format(all_data.shape[0], all_data.shape[1]))

After the combining the data, it turns out that there are over 1.5 million entries of numbers. The features do not mean anything; However, features 7 to 18 could indicate the following:
- The values are encoded from ordinal data
- The values indicate either a quantity or a rank

In [None]:
all_data.info()

From this list, we can see that there are no missing values. The target feature is indicating that there are missing values becuse the testing dataset doesn't have a target feature.

In [None]:
all_data.describe().T

In [None]:
train.target.value_counts()

In [None]:
#Display the percentages of each label
print('Percentage of target entries with a value of 0: {:.4f}'.format(len(train.query('target==0')) / len(train)))
print('Percentage of target entries with a value of 1: {:.4f}'.format(len(train.query('target==1')) / len(train)))

From the last two cells, we can see that the target vaiables are slightly imbalanced.

## Feature Engineering

Notice that in the dataset, we have a categorical feature called f_27, which shows a sequence of letters. This could potentially be used for feature engineering so our model will work better. First, let's see of there are any frequencies in this feature.

In [None]:
def create_countplot(x, title, ax=None, data=all_data):  
    values = data[x].value_counts(ascending=False)
    g = sns.countplot(x=x, data=data, order=values.index, ax=ax)
    g.set_title(title)
    g.set_xlabel('Letter')

In [None]:
all_data.f_27.value_counts().head(20)

Out of the dataset with over 1.5 million entries, there are at most 15 frquencies! Let's see how many unique entries there are:

In [None]:
all_data.f_27.nunique()

There are over 1.1 million unique entries. Theoretically, it could be possible that these are routines for the machine. Now that this feature is analyzed, we can start our engineering process. I will begin with counting the amount of unique letters in each entry, then splitting each string into its own feature.

In [None]:
#Amount of different letters
func = lambda x: len(set(x))
train['n_unique'] = train.f_27.apply(func)
test['n_unique'] = test.f_27.apply(func)
all_data['n_unique'] = all_data.f_27.apply(func)

#Split strings into separate features
new_features = [f'f_27{i}' for i in 'abcdefghij']
train[new_features] = train.f_27.str.split('', expand=True).loc[:,1:10]
test[new_features] = test.f_27.str.split('', expand=True).loc[:,1:10]
all_data[new_features] = all_data.f_27.str.split('', expand=True).loc[:,1:10]
all_data[new_features].head()

In [None]:
#Display frequency details
descriptive_stats = all_data[new_features].describe().T
descriptive_stats = descriptive_stats.sort_values('unique', ascending=False)
descriptive_stats

Some of these new features have only 2, 10, 15, and 20 unique values. Each feature uses the sequence of the alpha ending in B, O, and T.

We can now plot the count values for each position.

In [None]:
fig, ax = plt.subplots(5, 2, figsize=(20,15))
for i, (ax, letter) in enumerate(zip(ax.flatten(), ascii_letters[:10])):
    create_countplot(f'f_27{letter}', f'Position {i+1}', ax=ax)

fig.tight_layout()

In [None]:
all_data['n_unique'].describe()

In [None]:
sns.histplot(x='n_unique', kde=True, bins=9, data=all_data);

### Correlation

In [None]:
#Correlation table
corr = all_data.corr()
plt.figure(figsize=(18,18))

#Hide the upper triangular part of the coorelation matrix
#Code from: https://seaborn.pydata.org/examples/many_pairwise_correlations.html
#This will create a diagonal correlation matric
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, linewidth=0.1, fmt='.2f', annot=True, annot_kws={'size': 8});

The correlation heatmap shows that barely any of the features are correlated. The highest values are around 30-33%.

## EDA

There isn't mush exploration we can do with the data, except for checking for normality and outliers. Thus, I will plot histograms and boxplots to do exactly that.

Since there are alot of features, I will split them in half so less memory can be used, and the plots are more readable.

### Histogram

In [None]:
#Split numerical columns in half for easier plotting
num_cols = list(all_data.select_dtypes(exclude=['object']))
midpoint = len(num_cols) // 2
cols_first_half = num_cols[:midpoint]
cols_second_half = num_cols[midpoint:]
cols_second_half.pop(-1)
kde_params = dict(data=all_data, shade=True, palette=['red','green'], hue='target')
subplot_params = dict(nrows=5, ncols=3, figsize=(20,20))

#Plot the histogram for each feture on each axis
fig, ax = plt.subplots(**subplot_params)

#First half
for ax, col in zip(ax.flatten(), cols_first_half):
    sns.kdeplot(x=col, ax=ax, **kde_params)

fig.tight_layout()

In [None]:
fig, ax = plt.subplots(**subplot_params)

#Second half
for ax, col in zip(ax.flatten(), cols_second_half):
    sns.kdeplot(x=col, ax=ax, **kde_params)

fig.tight_layout()

From the histograms, we can see that:
- All of the continuous data is normally distributed
- Histograms with multiple peaks show that the data is multimodal, and proves my point earlier that the data could either be discrete or already encoded
- Feature 29 could've been formerly a boolean feature that was encoded
- Feature 30 is a multiclass label feature

### Boxplots

In [None]:
#Do the same with boxplots
fig, ax = plt.subplots(5, 3, figsize=(15,10))
for ax, col in zip(ax.flatten(), cols_first_half):
    sns.boxplot(x=col, data=all_data, ax=ax)

fig.tight_layout()

In [None]:
fig, ax = plt.subplots(4, 4, figsize=(15,10))
for ax, col in zip(ax.flatten(), cols_second_half):
    sns.boxplot(x=col, data=all_data, ax=ax)

fig.tight_layout()

## Model Building

In [None]:
#Split data into training and testing
X = train.drop(['f_27','target'], axis=1)
y = train.target
X_test = test.drop('f_27', axis=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0, stratify=y)

print('Shape of training set: {} rows and {} columns'.format(X_train.shape[0], X_train.shape[1]))
print('Shape of validation set: {} rows and {} columns'.format(X_valid.shape[0], X_valid.shape[1]))

#This will be used to encode catrgorical features
encoder = TargetEncoder(smoothing=5)
#transformer = make_column_transformer((TfidfVectorizer(analyzer='char'), 'f_27'), remainder='passthrough')

In [None]:
#Tune hyperparameters
def objective(trial):
    param_grid = dict(n_estimators=trial.suggest_int('n_estimators', 20, 1000, 10), 
                      learning_rate=trial.suggest_float('learning_rate', 0, 1), 
                      max_depth=trial.suggest_int('max_depth', 3, 12), 
                      min_split_gain=trial.suggest_float('min_split_gain', 0, 5), 
                      min_child_weight=trial.suggest_float('min_child_weight', 1, 10), 
                      colsample_bytree=trial.suggest_float("colsample_bytree", 0.2, 1),
                      subsample=trial.suggest_float("subsample", 0.2, 1),
                     )
 
    clf = make_pipeline(encoder, LGBMClassifier(**param_grid))
    clf.fit(X_train,y_train)
    y_pred = clf.predict_proba(X_valid)[:,1]
    return roc_auc_score(y_valid, y_pred).round(5)

study = optuna.create_study(direction='maximize', study_name='Hyperparameter Tuning')

#Test different hyperparameters 30 times
study.optimize(objective, n_trials=30, show_progress_bar=True)

In [None]:
#Gest the best parameters
best_params = study.best_params
print('Best parameter for:')
for k, v in best_params.items():
    print('{}: {}'.format(k,v))

In [None]:
#Create the K-Fold ROC AUC plot
#Code from:
#https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html
#https://www.kaggle.com/code/kanncaa1/roc-curve-with-k-fold-cv/notebook

cv = StratifiedKFold(10)
model = make_pipeline(encoder, LGBMClassifier(**best_params))

tprs, aucs = [], []
avg_fpr = np.linspace(0,1,100)
fig, ax = plt.subplots(figsize=(8,8))
lim = [0,1]

for i, (train_index, valid_index) in enumerate(cv.split(X,y)):
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_valid)[:,1]
    
    fpr, tpr, thresh = roc_curve(y_valid, y_pred)
    tprs.append(np.interp(avg_fpr, fpr, tpr))
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    
    #roc_auc_plot = RocCurveDisplay.from_estimator(model, X_valid, y_valid, name=f'ROC Fold {i+1}', ax=ax)
    plt.plot(fpr, tpr, label=f'ROC Fold {i+1} (AUC = {roc_auc:.5f})') #Display this on a legend
    #plt.legend(loc='best')

ax.plot(lim, lim, linestyle='--', color='r', label='Chance') #Draw the chance boundary

avg_tpr = np.mean(tprs, axis=0)
std_tpr = np.std(tprs, axis=0)
avg_auc = auc(avg_fpr, avg_tpr)
std_auc = np.std(aucs)
tprs_lower = np.maximum(avg_tpr - std_tpr, 0)
tprs_upper = np.minimum(avg_tpr + std_tpr, 1)

ax.plot(avg_fpr, avg_tpr, color='b', label=f'Mean ROC (AUC = {avg_auc:.5f} $\pm$ {std_auc:.5f})')

#Display standard deviation voundaries
ax.fill_between(avg_fpr, tprs_lower, tprs_upper, color='grey', alpha=.8, label=f'Std. Dev. = {std_auc:.5f}');
ax.set(title='ROC AUC Plot', xlabel='False Positive Rate', ylabel='True Positive Rate')
plt.legend();

In [None]:
#Create importance plot using our model
plot_importance(model['lgbmclassifier'], dpi=90, figsize=(7,7));

In [None]:
model.fit(X_train,y_train)
y_pred = model.predict_proba(X_test)[:,1]

#Make submission file
out = pd.DataFrame({'id': test.index, 'target': y_pred})
out.to_csv('results.csv', index=False)
out