# Hacking TPS-05 data and models .... next steps ...


## Since we have syntetic data .... I decided to make some crazy experiment ... 


#### I asked myself. Is it possible to build a good model on the TPS-05 data at all? I decided to simplify the issue to exclude the influence of imbalanced data between classes ans see how out-of-the LightGBM ... box works on TPS-05 data. This is how a new idea was born.

#### What if I only did a classification of two classes, eg Class_1 and Class_4. Both have the same amount of data, so the problem of the lack of balance is eliminated. In addition, the classifier should separate both classes well - it's a only 2 classes instead of 4, there is no problem of unbalance etc.

#### I decided to conduct set of experiments. See what came out. 

<div class="alert alert-success">
  <strong>In this series my TPS-05 notebooks:</strong>
    <ul>
        <li><a href = "https://www.kaggle.com/remekkinas/shap-lgbm-looking-for-best-features">SHAP + LGBM - looking for best features</a></li>
        <li><a href = "https://www.kaggle.com/remekkinas/tps-5-weighted-training-xgb-rf-lr-smote">Weighted training - XGB, RF, LR, ... SMOTE</a></li>
    </ul>
</div>

They are not so popular ... so either I'm right or ... I'm not;) But if I am wrong please comment this ... I would love to learn new things. My motivation is to build good model for TPS-05.

<div class="progress">
  <div class="progress-bar progress-bar-warning" role="progressbar" aria-valuenow="70"
  aria-valuemin="0" aria-valuemax="100" style="width:100%">
      <strong>100% Complete</strong>
  </div>
</div>

## EXPERIMENT PREPARATION

In [None]:
import shap
import pandas as pd
import numpy as np
import seaborn as sns

from tqdm import tqdm
sns.set_style('whitegrid')
import matplotlib.pyplot as plt

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

shap.initjs()

In [None]:
RANDOM_STATE = 2021

In [None]:
train = pd.read_csv("../input/tabular-playground-series-may-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-may-2021/test.csv")

In [None]:
sns.countplot(x = 'target', data = train)

## Let's divide training dataset on groups 

- Group 1 - Class_1 and Class_4 - to avoid class imbalance 
- Group 2 - Class_2 and Class_3 - to avoid class imbalance
- Group 3 - to check if we find something interesting
- Group 4 - to check if we find something interesting

In [None]:
train_C14 = train[(train.target == 'Class_1') | (train.target == 'Class_4')]
train_C23 = train[(train.target == 'Class_2') | (train.target == 'Class_3')]
train_C12 = train[(train.target == 'Class_1') | (train.target == 'Class_2')]
train_C24 = train[(train.target == 'Class_2') | (train.target == 'Class_4')]

In [None]:
sns.countplot(x = 'target', data = train_C14)

In [None]:
lencoder = LabelEncoder()

train_C14['target'] = lencoder.fit_transform(train_C14['target'])
train_C23['target'] = lencoder.fit_transform(train_C23['target'])
train_C12['target'] = lencoder.fit_transform(train_C12['target'])
train_C24['target'] = lencoder.fit_transform(train_C24['target'])


#### I separate 10% data for model validation 

In [None]:
X_train_C14, X_valid_C14, y_train_C14, y_valid_C14 = train_test_split(train_C14.drop(['id','target'], axis = 1), 
                                                                      train_C14.target,  
                                                                      stratify=train_C14.target, 
                                                                      test_size=0.1, 
                                                                      random_state= RANDOM_STATE)

X_train_C23, X_valid_C23, y_train_C23, y_valid_C23 = train_test_split(train_C23.drop(['id','target'], axis = 1), 
                                                                      train_C23.target,  
                                                                      stratify=train_C23.target, 
                                                                      test_size=0.1, 
                                                                      random_state= RANDOM_STATE)

X_train_C12, X_valid_C12, y_train_C12, y_valid_C12 = train_test_split(train_C12.drop(['id','target'], axis = 1), 
                                                                      train_C12.target,  
                                                                      stratify=train_C12.target, 
                                                                      test_size=0.1, 
                                                                      random_state= RANDOM_STATE)

X_train_C24, X_valid_C24, y_train_C24, y_valid_C24 = train_test_split(train_C24.drop(['id','target'], axis = 1), 
                                                                      train_C24.target,  
                                                                      stratify=train_C24.target, 
                                                                      test_size=0.1, 
                                                                      random_state= RANDOM_STATE)

## DEFINE SIMPLE CLASSIFIER AND K-FOLDED TRAINING LOOP 

In [None]:
 params = { 
        'objective': 'binary', 
        'boosting_type' : 'gbdt', 
        'metric': 'binary_logloss' 
    } 

In [None]:
def train_group(X, y, Xv):    
    test_preds = None
    train_rmse = 0
    val_rmse = 0
    n_splits = 5
    
    model =  LGBMClassifier(**params)
    #model = CatBoostClassifier() # there is no difference - you can try it 
    
    skf = StratifiedKFold(n_splits = n_splits, shuffle = True,  random_state = 0)
    
    for tr_index , val_index in tqdm(skf.split(X.values , y.values), total=skf.get_n_splits(), desc="k-fold"):

        x_train_o, x_val_o = X.iloc[tr_index] , X.iloc[val_index]
        y_train_o, y_val_o = y.iloc[tr_index] , y.iloc[val_index]
        
        eval_set = [(x_val_o, y_val_o)]
        
        model.fit(x_train_o, y_train_o, eval_set = eval_set, early_stopping_rounds=100, verbose=False)

        train_preds = model.predict(x_train_o)
        train_rmse += mean_squared_error(y_train_o ,train_preds , squared = False)

        val_preds = model.predict(x_val_o)
        val_rmse += mean_squared_error(y_val_o , val_preds , squared = False)
        
        if test_preds is None:
            test_preds = model.predict_proba(Xv.values)
        else:
            test_preds += model.predict_proba(Xv.values)

    print(f"\nAverage Training RMSE : {train_rmse / n_splits}")
    print(f"Average Validation RMSE : {val_rmse / n_splits}\n")

    return model, test_preds

In [None]:
def experiment(exp_title, X_train, y_train, X_valid, y_valid):
    
    model, preds = train_group(X_train, y_train, X_valid)
    
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_valid)
    shap.summary_plot(shap_values, X_valid)
    
    y_preds = np.argmax(preds, axis=1)
    print(f'MSE Score: {mean_squared_error(y_valid, y_preds)}\n')
    print(classification_report(y_valid, y_preds))
    
    sns.heatmap(pd.DataFrame(confusion_matrix(y_valid, y_preds)), annot=True, linewidths=.5, fmt="d")
    
    return shap_values, explainer

## EXPERIMENTS

#### As I wrote in the introduction. Let's simplify the problem. Only 2nd class. Nothing more. Let's build a simple classifier.

### 1. CLASS_1 AND CLASS_4 ACCURACY CLASSIFIER

In [None]:
shap_values, explainer = experiment('CLASS 1 - 4', X_train_C14, y_train_C14, X_valid_C14, y_valid_C14)

### Let's look then on feature_16 ....

In [None]:
# Let's look from Class_4 perspective (our 1) 
shap.summary_plot(shap_values[1], X_valid_C14)

In [None]:
sns.histplot(x = 'feature_16', data = X_valid_C14, bins=10)

In [None]:
X_valid_C14.feature_16.value_counts().head(5)

### Let's look on the rest TOP features ....

In [None]:
selected_features = ["feature_16", "feature_25", "feature_31", "feature_37"]

plt.figure(figsize=(20,5))
c = 1
for feat in selected_features:
    plt.subplot(1, 4, c)
    sns.histplot(x = feat, data = test, bins=10)
    c = c + 1    
plt.show()

In [None]:
for name in selected_features:
    shap.dependence_plot(name, shap_values[1], X_valid_C14)

According to Slundberg (https://slundberg.github.io/shap/notebooks/plots/dependence_plot.html)

Each dot is a single prediction (row) from the dataset.
* The x-axis is the value of the feature (from the X matrix).
* The y-axis is the SHAP value for that feature, which represents how much knowing that feature's value changes the output of the model for that sample's prediction. 
* The color corresponds to a second feature that may have an interaction effect with the feature we are plotting (by default this second feature is chosen automatically). If an interaction effect is present between this other feature and the feature we are plotting it will show up as a distinct vertical pattern of coloring. 

### Let's see one sample from class_1 ...

In [None]:
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], X_valid_C14.iloc[0,:])

In the plot above, the bold 1.62 is the modelâ€™s score for this observation. Higher scores lead the model to predict 1 and lower scores lead the model to predict 0. The features that were important to making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher, and blue representing features that pushed the score lower. Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar.

Source: https://medium.com/mlearning-ai/shap-force-plots-for-classification-d30be430e195

### ... and one sample from class_4

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1][0,:], X_valid_C14.iloc[1,:])

<div class="alert alert-danger">
  <strong>You can make interactive analysis with this chart - please choose options on TOP and on the LEFT to see feature interaction .... </strong> 
</div>

In [None]:
shap.force_plot(explainer.expected_value[0], shap_values[0][:500,:], X_valid_C14.iloc[:500,:])

<div class="alert alert-success">
  <strong>Conclutions:</strong>
    <ul>
        <li>Shap show us that BIG BLOB (0 value) drive our model to Class_4 (1) (but we have some blues on the left side as well this is why not all 1482 zeros were classified to Class_4) .... </li>
        <li>from the very beginning our model tries to do everything possible to assign observations to Class_4.</li>
        <li>BAD performance of model ..... - look on confusion matrix</li>
    </ul>
</div>

### 2. CLASS_2 AND CLASS_3 ACCURACY CLASSIFIER

In [None]:
shap_values, explainer = experiment('CLASS 2 - 3', X_train_C23, y_train_C23, X_valid_C23, y_valid_C23)

In [None]:
shap.summary_plot(shap_values[0], X_valid_C23)

In [None]:
shap.dependence_plot(14, shap_values[0], X_valid_C23)

In [None]:
selected_features = ["feature_14", "feature_15", "feature_6", "feature_34"]

plt.figure(figsize=(20,5))
c = 1
for feat in selected_features:
    plt.subplot(1, 4, c)
    sns.histplot(x = feat, data = test, bins=10)
    c = c + 1    
plt.show()

### Conclutions:
- 

<div class="alert alert-success">
  <strong>Conclutions:</strong>
    <ul>
        <li>BAD ..... - Class_2 = 5750, Class_3 = 2142 - we have balanced dataset .... but model perform bad - it sees corectly - look on confusion matrix </li>
    </ul>
</div>


### 3. CLASS_1 AND CLASS_2 ACCURACY CLASSIFIER

In [None]:
shap_values, explainer = experiment('CLASS 1 - 2', X_train_C12, y_train_C12, X_valid_C12, y_valid_C12)

### 4. CLASS_2 AND CLASS_4 ACCURACY CLASSIFIER

In [None]:
shap_values, explainer = experiment('CLASS 2 - 4', X_train_C24, y_train_C24, X_valid_C24, y_valid_C24)

# Let's look finaly on data ....... zeroes in dataset ....

In [None]:
zero_data = ((train.drop('id', axis = 1).iloc[:,:50]==0).sum() / len(train) * 100)[::-1]
_, ax = plt.subplots(1,1,figsize=(10, 20))

ax.barh(zero_data.index, 100, color='#dadada', height = 1, edgecolor = '#FFFFFF')
barh = ax.barh(zero_data.index, zero_data, height = 1)
ax.bar_label(barh, fmt='%.02f %%')
  
plt.show()

<div class="alert alert-success">
  <strong>Conclutions:</strong>
    <ul>
        <li>We can experiment such way assuming that data is symthetic (unfortunately when we divide training data such way we lose some information about class coreleations but ... still this is only experiment).</li>
        <li>As we can see class imbalance is not problem.</li>
        <li>My hypotesis is that dealing with data sparsity is the biggest challange here ....  </li>
    </ul>
</div>

### and ..... what do you think? .... it is possible to build good classifier (without overfitting) based on this data? :)))))))))))

### in my opinion ....  hmmmm  ... 


### I know begging for a vote here is not right. But I really spent a lot of time figuring out this competition. I am sharing my discoveries with you. Please appreciate my work - notebooks and this dataset. Thank you!