# Tabular Playground Series - Nov 21

For the Playground Series of November '21, we aim to build a model to identify spam emails via various extracted features from the email. Our data consists of 100 feature variables and our target variable is binary classification. We will first perform some basic EDA to take a better look at this data following which we will start working on our models. 

## Plan

Moving forward this is the plan we are going to be following. Keep in mind, this is not a concrete plan and I might change it as we move through the notebook. This will show you my process on how I approach these datasets.

- *Memory Reduction*
- *Sampling to Reduce Training Time*
- *Basic EDA*
- *Model Development*
- *Hyperparameter Tuning*
- *Feature Importance from top models*
- *Selecting the best Model*

## Imports 

Let's import some of the libraries we will be using throughout the notebook

In [None]:
# Data Import on Kaggle
import os
import time
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing processing libraries
import numpy as np
import pandas as pd

# Importing Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for the metrics
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, KFold

# Importing libraries for the model
import xgboost as xgb 
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from collections import Counter
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingClassifier

# sklearn imports for analysis
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from scipy.stats import randint

In [None]:
data = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')

In [None]:
data = data.drop('id', axis=1)

## Memory Reduction

If you don't have any issues with memory, you can go ahead and skip this step. 
Here, we will take a look at the memory consumption by the current data and each feature following which we will try to reduce it to some extent. 

There are several other methods to save RAM - you can refer to this article on [14 tips to save RAM memory](https://www.kaggle.com/pavansanagapati/14-simple-tips-to-save-ram-memory-for-1-gb-dataset). 

In [None]:
memory_usage = data.memory_usage(deep=True) / 1024 ** 2
print('memory usage of features: \n', memory_usage.head(7))
print('memory usage sum: ',memory_usage.sum())

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

data = reduce_memory_usage(data, verbose=True)
test_data = reduce_memory_usage(test_data, verbose=True)

In [None]:
data.describe()

## Sampling Data

Now that we have reduced the memory usage by over 70%, let's sample the data. 

Why are we doing this? Well, you don't have to. But if you're like me and own a Macbook Air that can't handle a dataset bigger than 100mb, this might be a good idea.

When we are performing model selection and hyperparameter tuning later, we can't afford to let the notebook run for hours on end testing every model. Doing this, preserves the distributions of each feature while taking only 20% of the entire dataset and we can reduce the training time by using this sampled data.

We can then perform EDA, modelling, hyperparameter tuning and other steps on this sampled data. Once we decide on the model we want to use and improve its performance, we can train the final model on the entire dataset again.

In [None]:
sample_df = data.sample(int(len(data) * 0.2))
sample_df.shape

In [None]:
# Let's confirm if the sampling is retaining the feature distributions

fig, ax = plt.subplots(figsize=(6, 4))

sns.histplot(
    data=data, x="f6", label="Original data", color="red", alpha=0.3, bins=15
)
sns.histplot(
    data=sample_df, x="f6", label="Sample data", color="green", alpha=0.3, bins=15
)

plt.legend()
plt.show();

## EDA

Let's start looking at any correlations that might exist among the features.
We will also be looking at the densities of every feature.

In [None]:
sample_df

In [None]:
# Check na values
print('Amount of existing NaN values', sample_df.isna().sum())

print('---------')
# Target Class Distribution
target_dist = sample_df.target.value_counts()
print('Distribution of Target Class \n',target_dist)
print(target_dist[0]/(target_dist[0] + target_dist[1]))

There doesn't seem to be any nan values in the data. Also, the target class is split evenly between the two groups

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
corr = sample_df.iloc[:,:20].corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

Before we look at distributions, we need to split the data into continuous and categorical variables.

In [None]:
cat_variables = []

for column in sample_df.columns:
    if len(sample_df[column].unique()) < 10:
        cat_variables.append(column)
print(cat_variables)

So, we have no  categorical features in this dataset. Let's find the  distributions of all the features using kdeplot.

In [None]:
fig = plt.figure(figsize = (18, 50))

for i in range(len(sample_df.columns.tolist()[:100])):
    plt.subplot(20,5,i+1)
    sns.set_style("white")
    plt.title(sample_df.columns.tolist()[:100][i], size = 12, fontname = 'monospace')
    a = sns.kdeplot(sample_df[sample_df.columns.tolist()[:100][i]], color = '#1a5d57', shade = True, alpha = 0.9, linewidth = 1.5, edgecolor = 'black')
    plt.ylabel('')
    plt.xlabel('')
    plt.xticks(fontname = 'monospace')
    plt.yticks([])
    for j in ['right', 'left', 'top']:
        a.spines[j].set_visible(False)
        a.spines['bottom'].set_linewidth(1.2)
        
fig.tight_layout(h_pad = 3)

plt.show()

## Data Preparation

In this section, we will do some preprocessing. This part involves Feature Scaling and Splitting the data into Train and Test sets.

### Scaling 

While most of the models I plan to use in the 'model selection' section will not require any form of feature scaling (like, XGBoost, Random Forest, etc.), some of them (like, KNN and SVM) need it to work. 

##### Why

In general, algorithms that exploit distances or similarities (e.g. in the form of scalar product) between data samples, such as K-NN and Support Vector Machines, are sensitive to feature transformations.

Graphical-model based classifiers, such as Fisher LDA or Naive Bayes, as well as Decision trees and Tree-based ensemble methods (Random Forests, XGBoost) are invariant to feature scaling, but still, it might be a good idea to rescale/standardize your data.

In [None]:
features = data.columns
scale = MinMaxScaler()
sample_df[features]=scale.fit_transform(sample_df[features])
sample_df[features]= scale.transform(sample_df[features])  

print('Data scaled using : ', scale)

### Train-Test Split

Let's split our sampled data into train and test sets

In [None]:
X = sample_df.drop('target', axis=1)
y = sample_df.target

X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.7, random_state=42)

del sample_df # we do this to remove sample_df from memory

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
id_test_submission = test_data.id
X_test_submission = test_data.drop('id', axis=1)

del test_data

## Model Selection

Finally, now that we're done with the preprocessing and EDA, we are going to take a look at how some basic models perform on a subset of the data (20%) without any parameter tuning. We can then retrain and evaluate the top performing models with a bigger dataset and tuned parameters.

You can take a look at the model_dict and add any models that you think might perform well. Or leave a comment and I'll add them asap!


*P.S: Kaggle has a time-out error if the run time of a notebook exceeds a certain time limit. So, I'll comment some of these models out. However, I'll keep the top performing models uncommented.*

In [None]:
model_dict = {
    'ADABoost': AdaBoostClassifier(),
    'Light GBM': lgb.LGBMClassifier(random_state=0, verbose=-1),
    'Logistic Reg': LogisticRegression(random_state=0, max_iter=350, solver='lbfgs'),
    'Naive Bayes': GaussianNB(), 
#     'K Nearest Classifier': KNeighborsClassifier(),
            }
model_list = []
train_acc_list = []
test_acc_list = []
counter_list = []

for model, clf in model_dict.items():
    start_time = time.time()

    clf.fit(X_train, y_train)
    
    # test results
    test_pred = clf.predict(X_test)
    test_acc = roc_auc_score(y_test, test_pred)
    
    # train results
    train_pred =  clf.predict(np.float32(X_train))
    train_acc = roc_auc_score(y_train, train_pred)

    print(model, 'Model')
    print('Classification Report \n',classification_report(y_test, test_pred))
    print('Confusion Matrix \n',confusion_matrix(y_test,test_pred))
    print('Train Accuracy: ', train_acc)
    print('Test Accuracy: ', test_acc)
    print("\n Ran in %s seconds" % (time.time() - start_time))
    print('--------------------------------')
    
    model_list.append(model)
    train_acc_list.append(train_acc)
    test_acc_list.append(test_acc)   
    

results = pd.DataFrame({"model": model_list, "train_accuracy": train_acc_list, "test_acc": test_acc_list})

In [None]:
results

# Conclusion

This notebook was an introduction on how to perform EDA, Feature Scaling and build some models on the default parameters. We also went over how to deal with large datasets and limit the use of RAM using memroy reduction techniques and sampling to reduce the training time. With these skills, you can add to the EDA and integrate unique visualiations that you find might be useful. 

### Next Steps

In the future, I will be adding more models to this notebook, but more importantly, I will be working on Feature Engineering and Hyperparameter Tuning in order to improve the predicitive performance of these models.

P.S: Feature Engineering is the process of extracting useful features and attributes from the data to imrpove our model while Hyperparameter Tuning is the process of picking a model and changing each parameter to see which ones perform the best (either manually or using GridSearchCV).

## Submission

Keep in mind this submission will not yield great results since its trained on the basic model with default parameters. We will need to work on the models further and try to improve the submission results.

In [None]:
X_final = data.drop('target', axis=1)
y_final = data.target

model = lgb.LGBMClassifier(random_state=0, verbose=-1)
model.fit(X_final, y_final)
test_pred = model.predict(X_test_submission)

In [None]:
new_df = pd.DataFrame({'id': id_test_submission, 'target': test_pred})
new_df.to_csv('./lgbm_submission.csv', index=False)