<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
    <h1 align="center" style="color:darkcyan;">Binary Classification with a Software Defects Dataset</h1> 
    <h3 align="center" style="color:gray;">Playground Series - Season 3, Episode 23</h3> 
</div>

# <div style="color:white;background-color:darkcyan;padding:2%;border-radius:15px 15px;font-size:1em;text-align:center">LightGBM & bayes_opt</div>

**Challenge host:** The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Software Defect Dataset](https://www.kaggle.com/datasets/semustafacevik/software-defect-prediction). Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

![](https://cdn-images-1.medium.com/max/1000/1*J5WkxgechmTbjgn_u47Ubw.png)

In [None]:
import warnings # suppress warnings
warnings.filterwarnings('ignore')
#:::::::::::::::::::::::::::::::::::
import os
import gc
import glob
import random
import numpy as np 
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from scipy import stats
from pathlib import Path
from itertools import groupby
#:::::::::::::::::::::::::::::::::::
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
%matplotlib inline
!ls ../input/*

<div class="alert alert-success">  
</div>

<div>
    <h1 align="center" style="color:darkred;">Competition Data</h1>
</div>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:darkcyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>Train Set</p></div>

In [None]:
train = pd.read_csv('../input/playground-series-s3e23/train.csv', index_col='id')
train 

In [None]:
MV = train.isnull().sum()
print('Missing Value in train:', MV[MV > 0])
print('Duplicates in train:', train.duplicated().sum())

In [None]:
display(train.info())
#train.describe().transpose()

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:darkcyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>Test Set</p></div>

In [None]:
test = pd.read_csv('../input/playground-series-s3e23/test.csv', index_col='id')
test.shape

In [None]:
MV = test.isnull().sum()
print('Missing Value in test:', MV[MV > 0])
print('Duplicates in test:', test.duplicated().sum())

In [None]:
display(test.info())
#test.describe().transpose()

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:darkcyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>Target</p></div>

In [None]:
plt.gca().set_facecolor('lightcyan')
train['defects'].value_counts(normalize=True).plot(kind='barh', figsize=(15,1), color=['lightblue','pink'])

pd.DataFrame(data= {'Number': train['defects'].value_counts(), 
                    'Percent': train['defects'].value_counts(normalize=True)})

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:darkcyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>Sample Submission</p></div>

In [None]:
df_sample = pd.read_csv('../input/playground-series-s3e23/sample_submission.csv')
df_sample.shape

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:darkcyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>Original data</p></div>

In [None]:
original_data = pd.read_csv('../input/software-defect-prediction/jm1.csv')
original_data.shape

### Attribute Information:

Reference: https://www.kaggle.com/datasets/semustafacevik/software-defect-prediction/data

**1. loc**             : numeric % McCabe's line count of code

**2. v(g)**            : numeric % McCabe "cyclomatic complexity"

**3. ev(g)**           : numeric % McCabe "essential complexity"

**4. iv(g)**           : numeric % McCabe "design complexity"

**5. n**               : numeric % Halstead total operators + operands

**6. v**               : numeric % Halstead "volume"

**7. l**               : numeric % Halstead "program length"

**8. d**               : numeric % Halstead "difficulty"

**9. i**               : numeric % Halstead "intelligence"

**10. e**               : numeric % Halstead "effort"

**11. b**               : numeric % Halstead 

**12. t**               : numeric % Halstead's time estimator

**13. lOCode**          : numeric % Halstead's line count

**14. lOComment**       : numeric % Halstead's count of lines of comments

**15. lOBlank**         : numeric % Halstead's count of blank lines

**16. lOCodeAndComment** : numeric

**17. uniq_Op**         : numeric % unique operators

**18. uniq_Opnd**       : numeric % unique operands

**19. total_Op**        : numeric % total operators

**20. total_Opnd**      : numeric % total operands

**21: branchCount**     : numeric % of the flow graph

**22. defects**         : {false,true} % module has/has not one or more % reported defects

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">Features</span>

In [None]:
features = [f for f in train.columns.tolist() if f !='defects']
len(features)

## <span style="color:darkcyan;">Train Set >> Histograms of the features</span>

In [None]:
sns.set()
plt.style.use('seaborn-whitegrid') 
_, axs = plt.subplots(7, 3, figsize=(15,35), facecolor='lightyellow')

for f, ax in zip(features, axs.ravel()):
    ax.set_facecolor('lightblue')
    ax.hist(train[f], bins=80, color='red')
    ax.set_title(f'{f}\nUnique Levels: {train[f].nunique()}', fontsize=10)

plt.suptitle('Train Set (Histograms)', y=0.90, fontsize=14, c='darkred')
plt.show()

## <span style="color:darkcyan;">Train Set >> Histograms of the features - np.log1p function</span>

Thanks to: **@ambrosm** https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sense

In [None]:
sns.set()
plt.style.use('seaborn-whitegrid') 
_, axs = plt.subplots(7, 3, figsize=(15,35), facecolor='lightyellow')

for f, ax in zip(features, axs.ravel()):
    ax.set_facecolor('lightblue')
    ax.hist(np.log1p(train[f]), bins=80, color='red')
    ax.set_title(f'{f}\nUnique Levels: {train[f].nunique()}', fontsize=10)

plt.suptitle('Train Set (Histograms)', y=0.90, fontsize=14, c='darkred')
plt.show()

## <span style="color:darkcyan;">Test Set >> Histograms of the features - np.log1p function</span>

In [None]:
sns.set()
plt.style.use('seaborn-whitegrid') 
_, axs = plt.subplots(7, 3, figsize=(15,35), facecolor='lightyellow')

for f, ax in zip(features, axs.ravel()):
    ax.set_facecolor('lightgray')
    ax.hist(np.log1p(test[f]), bins=80, color='red')
    ax.set_title(f'{f}\nUnique Levels: {test[f].nunique()}', fontsize=10)

plt.suptitle('Test Set (Histograms)', y=0.90, fontsize=14, c='darkred')
plt.show()

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">Correlation Matrix</span>

In [None]:
cor_matrix = train[features].corr()
fig = plt.figure(figsize=(10,10));

cmap=sns.diverging_palette(240, 10, s=75, l=50, sep=1, n=6, center='light', as_cmap=False);
sns.heatmap(cor_matrix, center=0, annot=False, cmap=cmap, linewidths=1);
plt.suptitle('Train Set (Heatmap)', y=0.91, fontsize=18, c='darkred');
plt.show()

In [None]:
corr = train[features].corr(numeric_only=True).round(3)
corr.style.background_gradient(cmap='Pastel1')

## <span style="color:darkcyan;">According to the above results, feature "l" is not good for now.</span>

### <span style="color:navy;"> >> Or the "l" feature should be ignored:</span>

In [None]:
#features = [f for f in features if f !='l']
#len(features)

### <span style="color:navy;"> >>Or the "l" feature should be changed:</span>

In [None]:
test['l']  = 1.0 - test['l']
train['l'] = 1.0 - train['l']

cor_matrix = train[features].corr()
fig = plt.figure(figsize=(10,10));

cmap=sns.diverging_palette(240, 10, s=75, l=50, sep=1, n=6, center='light', as_cmap=False);
sns.heatmap(cor_matrix, center=0, annot=False, cmap=cmap, linewidths=1);
plt.suptitle('Train Set (Heatmap)', y=0.91, fontsize=18, c='darkred');
plt.show()

## <span style="color:darkcyan;">X | y | test</span>

In [None]:
X = train[features]
y = train['defects']

test = test[features].copy()

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">Evaluation Metric (AUC)</span>

In this Kaggle challenge; submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc

def roc_auc(true_list, pred_list, figlen):
    
    fpr, tpr, _ = roc_curve(true_list, pred_list)    
    roc_auc = auc(fpr, tpr)
    print(f'\nROC_AUC: %0.6f\n' %roc_auc)
    
    if (figlen > 0):
        sns.set()
        plt.style.use('seaborn-whitegrid')
        plt.figure(figsize=(figlen, figlen), facecolor='lightyellow')
        plt.gca().set_facecolor('lightgray')
        plt.fill_between(fpr, tpr, color='r', alpha=0.1)
        plt.plot(fpr, tpr, color='red', lw=2, label='ROC curve')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
        plt.xlim([-0.01, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('The area under the ROC curve\n', fontsize=16, c='darkred')
        plt.legend(loc="lower right")
        plt.show()

<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">Gaussian Naive Bayes (GaussianNB)</span>

In [None]:
from sklearn.naive_bayes import GaussianNB

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import QuantileTransformer

In [None]:
transformed = pd.DataFrame(QuantileTransformer(output_distribution='normal').fit_transform(X))

pipeline = make_pipeline(QuantileTransformer(output_distribution='normal'), GaussianNB())
pipeline.fit(X, y)

In [None]:
cross_val_score(pipeline, X, y, scoring='roc_auc', cv=10).mean()

In [None]:
roc_auc(y, pipeline.predict_proba(X)[:,1], 6)

In [None]:
preds_bayes = pipeline.predict_proba(test)[:,1]

sns.set()
plt.hist(preds_bayes, bins=50)
plt.gca().set_facecolor('lightblue')
plt.suptitle('Prediction Histogram', y=0.95, fontsize=14, c='darkred')

min(preds_bayes), max(preds_bayes)

## <span style="color:darkcyan;">Submission (bayes)</span>

In [None]:
sub1 = df_sample.copy()
sub1['defects'] = preds_bayes
sub1.to_csv('submission1.csv',index=False)
!ls

<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
</div>

![](https://cdn-images-1.medium.com/max/1000/1*NTQ0gJrz4exxBqizww3roA.png)

**Bayesian optimization** works by constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize. 

![](https://cdn-images-1.medium.com/max/1000/1*1HhgVrhk7ABeEaLsTLbWHA.gif)

Reference: https://github.com/bayesian-optimization/BayesianOptimization/blob/master/README.md

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">LightGBM (Bayesian optimization)</span>

LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks. LightGBM grows tree vertically while other tree based learning algorithms grow trees horizontally.  It means that LightGBM grows tree leaf-wise while other algorithms grow level-wise. 

## <span style="color:gray;">Parameters</span>

classlightgbm.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=None, importance_type='split', **kwargs)

In [None]:
from lightgbm import LGBMClassifier
from bayes_opt import BayesianOptimization

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

## <span style="color:darkcyan;">StandardScaler</span>

In [None]:
# scaler = StandardScaler()

# X = pd.DataFrame(scaler.fit_transform(X))
# test = pd.DataFrame(scaler.fit_transform(test))

## <span style="color:darkcyan;">Bayesian optimization</span>

In [None]:
def lgbm_cl_bo(min_child_samples, colsample_bytree, learning_rate, num_leaves, reg_alpha, reg_lambda):
    
    params_lgbm = {}
    params_lgbm['min_child_samples'] = round(min_child_samples)
    params_lgbm['colsample_bytree'] = colsample_bytree
    params_lgbm['learning_rate'] = learning_rate
    params_lgbm['num_leaves'] = round(num_leaves)
    params_lgbm['reg_alpha'] = reg_alpha
    params_lgbm['reg_lambda'] = reg_lambda    
       
    params_lgbm['boosting_type'] ='gbdt'   # Manual optimization
    params_lgbm['objective'] ='binary'     # Manual optimization
    params_lgbm['subsample'] = 1.0
    params_lgbm['max_bin'] = 1023
    params_lgbm['n_jobs'] = -1

    scores = cross_val_score(LGBMClassifier(**params_lgbm, random_state=2920), X, y, scoring='roc_auc', cv=5).mean()
    score = scores.mean()
    return score

**n_iter:** How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.

**init_points:** How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

In [None]:
params_lgbm ={'min_child_samples':(800, 1200),
              'colsample_bytree':(0.3, 1.0),
              'learning_rate':(0.005, 0.1),
              'num_leaves':(20, 60),
              'reg_alpha':(0.0, 10.0),
              'reg_lambda':(0.0, 5.0)}

lgbm_bo = BayesianOptimization(lgbm_cl_bo, params_lgbm, random_state=2920)
lgbm_bo.maximize(n_iter=30, init_points=20)

In [None]:
pmax_bayes = lgbm_bo.max['params']
pmax_bayes

<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
</div>

# <span style="color:darkred;">LightGBM - The Final Model</span>

In [None]:
model = LGBMClassifier(n_estimators= 20000, 
                       learning_rate= 0.07,
                       objective= 'binary', 
                       boosting_type= 'gbdt', 
                       
                       subsample= 1.0,
                       num_leaves= 23,  
                       max_bin= 1023,
                       n_jobs= -1,
                           
                       reg_alpha= 0.65,
                       reg_lambda= 3.1,
                       colsample_bytree= 0.568,
                       min_child_samples= 864,     
                       random_state= 1920)

# <span style="color:darkred;">HistGradientBoosting - The Final Model</span>

In [None]:
model0 = HistGradientBoostingClassifier(max_iter=250,
                                        validation_fraction=None, 
                                        learning_rate=0.007, 
                                        max_depth=10, 
                                        min_samples_leaf=24, 
                                        max_leaf_nodes=60,
                                        random_state=1920,
                                        verbose=0)

## <span style="color:darkcyan;">Cross Validation - RepeatedKFold</span>

In [None]:
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1920)

print('Total number of folds:', rkf.get_n_splits(X, y))

Randomly, the LGBMClassifier framework is used for three quarters of the folds, and the HGBClassifier framework is used for one quarter of the folds. Also, if the value of AUC Score for each fold is less than or equal to 0.79 (arbitrary value), that fold is ignored.

In [None]:
counter = 0
auc_mean = 0
preds = np.zeros(len(test))
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1920)

for fold, (train_idx, valid_idx) in enumerate(rkf.split(X)):  
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_valid, y_valid = X.iloc[valid_idx], y.iloc[valid_idx]  

    print(f'\n:::::::::::::::::: Fold ~ {fold+1} :::::::::::::::::::')
    
    N = random.randrange(4) 
    if (N == 0):
        print('HGBClassifier >>')
        model0.fit(X_train, y_train)
        oof = model0.predict_proba(X_valid)[:, -1]

    if (N != 0):
        print('LGBMClassifier >>\n')
        model.fit(X_train, y_train,             
                  eval_set=[(X_valid, y_valid)], 
                  eval_metric='auc', 
                  early_stopping_rounds=250,            
                  verbose=100)
        oof = model.predict_proba(X_valid)[:, -1]
    
    auc = roc_auc_score(y_valid, oof)
    if (auc <= 0.79): 
        print('\nAUC Score:', auc, ' # was ignored.')
    
    if (auc > 0.79): 
        counter += 1
        print('\nAUC Score:', auc, ' # it is ok.')
        auc_mean += roc_auc_score(y_valid, oof) 
        preds += model.predict_proba(test)[:, -1] 

auc_mean = auc_mean / counter      
preds = preds / counter 

print('\n', '='* 40)
print(' .'* 20)
print(' AUC Score (mean):', auc_mean)
print(' .'* 20)
print('='* 40, '\n')

print('Total number of folds:', rkf.get_n_splits(X, y))
print('Number of valid folds:', counter)

In [None]:
sns.set()
plt.hist(preds, bins=50)
plt.gca().set_facecolor('lightblue')
plt.suptitle('Prediction Histogram', y=0.95, fontsize=14, c='darkred')

min(preds), max(preds)

<div class="alert alert-success">  
</div>

## <span style="color:darkred;">LightGBM - Feature importance</span>

"Feature importance" determines the relationship between features and the target variable, and also identifies features that are irrelevant to the model.

In [None]:
from lightgbm import plot_importance 

plot_importance(model, figsize=(12, 8), color=['darkcyan','red','purple','darkblue'], height=0.6, max_num_features=100,
                title='LightGBM - Feature importance', xlabel='Value', ylabel='Name Feature');

plt.gca().set_facecolor('lightblue')

## <span style="color:darkcyan;">Submission (lgbm & hgb)</span>

In [None]:
sub2 = df_sample.copy()
sub2['defects'] = preds
sub2.to_csv('submission2.csv',index=False)
!ls

<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
</div>