# Software Age-Related Bugs Prediction using Python

## Pre-Sampling Approach

In this paper, we will make a model that will predict the Age-Related Bugs (ARB) using python. In ABR, usually the classes are imbalanced in nature. The classes (defective and non-defective) are biased towards non-defective class. First, we will do the prediction and check the metrics without over-sampling approach.

### Importing relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from sklearn.metrics import roc_auc_score 
import copy

### Importing Data

In [None]:
ds1 = pd.read_csv('dataset_mysql_innodb.csv')
ds1.head()

Unnamed: 0,id,Filename,AltAvgLineBlank,AltAvgLineCode,AltAvgLineComment,AltCountLineBlank,AltCountLineCode,AltCountLineComment,AvgCyclomatic,AvgCyclomaticModified,...,Vol,Dif,Eff,AllocOps,DeallocOps,DerefUse,UniqueDerefUse,DerefSet,UniqueDerefSet,AgingRelatedBugs
0,1,storage/innobase/btr/btr0btr.c,11.0,40.0,10.0,606.14,1932.29,722.14,8.0,8.0,...,48798.0,612.14,1294678.57,0.0,0.0,75.0,29.0,9.0,6.0,0
1,2,storage/innobase/btr/btr0cur.c,15.93,53.93,14.21,774.07,2429.75,888.0,11.0,11.0,...,67150.5,882.86,2031388.79,0.0,0.0,173.5,76.21,49.0,27.93,0
2,3,storage/innobase/btr/btr0pcur.c,11.0,32.0,7.0,134.0,324.57,133.0,10.0,10.0,...,8726.5,122.0,146078.0,0.0,0.0,57.0,14.0,34.0,9.0,0
3,4,storage/innobase/btr/btr0sea.c,20.0,60.07,11.89,398.61,1113.82,317.04,13.0,13.0,...,32102.0,451.79,1123430.0,0.0,0.0,280.61,76.82,76.96,22.0,0
4,5,storage/innobase/buf/buf0buf.c,12.0,37.64,8.0,539.46,1516.89,628.18,6.0,6.0,...,33375.18,436.96,805282.68,0.0,9.0,257.46,57.0,136.0,33.0,0


### Checking for Missing Values

In [None]:
ds1.isnull().sum()

id                                    0
Filename                              0
AltAvgLineBlank                       0
AltAvgLineCode                        0
AltAvgLineComment                     0
AltCountLineBlank                     0
AltCountLineCode                      0
AltCountLineComment                   0
AvgCyclomatic                         0
AvgCyclomaticModified                 0
AvgCyclomaticStrict                   0
AvgEssential                          0
AvgLine                               0
AvgLineBlank                          0
AvgLineCode                           0
AvgLineComment                        0
CountClassBase                        0
CountClassCoupled                     0
CountClassDerived                     0
CountDeclClass                        0
CountDeclClassMethod                  0
CountDeclClassVariable                0
CountDeclFunction                     0
CountDeclInstanceMethod               0
CountDeclInstanceVariable             0


### Checking Data types

In [None]:
ds1['AgingRelatedBugs'] = ds1['AgingRelatedBugs'].astype(str)

In [None]:
ds1.dtypes

id                                      int64
Filename                               object
AltAvgLineBlank                       float64
AltAvgLineCode                        float64
AltAvgLineComment                     float64
AltCountLineBlank                     float64
AltCountLineCode                      float64
AltCountLineComment                   float64
AvgCyclomatic                         float64
AvgCyclomaticModified                 float64
AvgCyclomaticStrict                   float64
AvgEssential                          float64
AvgLine                               float64
AvgLineBlank                          float64
AvgLineCode                           float64
AvgLineComment                        float64
CountClassBase                        float64
CountClassCoupled                     float64
CountClassDerived                     float64
CountDeclClass                        float64
CountDeclClassMethod                  float64
CountDeclClassVariable            

In [None]:
ds1.columns

Index(['id', 'Filename', 'AltAvgLineBlank', 'AltAvgLineCode',
       'AltAvgLineComment', 'AltCountLineBlank', 'AltCountLineCode',
       'AltCountLineComment', 'AvgCyclomatic', 'AvgCyclomaticModified',
       'AvgCyclomaticStrict', 'AvgEssential', 'AvgLine', 'AvgLineBlank',
       'AvgLineCode', 'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLi

### Creating feature and target variables and Label Encoding the target variable

In [None]:
cols = ['id','AltAvgLineBlank', 'AltAvgLineCode', 'AltAvgLineComment',
       'AltCountLineBlank', 'AltCountLineCode', 'AltCountLineComment',
       'AvgCyclomatic', 'AvgCyclomaticModified', 'AvgCyclomaticStrict',
       'AvgEssential', 'AvgLine', 'AvgLineBlank', 'AvgLineCode',
       'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLineInactive',
       'CountLinePreprocessor', 'CountOutput', 'CountPath', 'CountSemicolon',
       'CountStmt', 'CountStmtDecl', 'CountStmtEmpty', 'CountStmtExe',
       'Cyclomatic', 'CyclomaticModified', 'CyclomaticStrict', 'Essential',
       'Knots', 'MaxCyclomatic', 'MaxCyclomaticModified',
       'MaxCyclomaticStrict', 'MaxEssentialKnots', 'MaxInheritanceTree',
       'MaxNesting', 'MinEssentialKnots', 'PercentLackOfCohesion',
       'RatioCommentToCode', 'SumCyclomatic', 'SumCyclomaticModified',
       'SumCyclomaticStrict', 'SumEssential', 'n1', 'n2', 'N1', 'N2', 'Len',
       'Voc', 'Vol', 'Dif', 'Eff', 'AllocOps', 'DeallocOps', 'DerefUse',
       'UniqueDerefUse', 'DerefSet', 'UniqueDerefSet']
X = ds1[cols]
y = ds1['AgingRelatedBugs']


In [None]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### Splitting data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

### Logistic Regression Model

In [None]:
# Fitting the model

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Computing Accuracy
y_pred1 = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of Logistic Regression train set: 0.96
Accuracy of Logistic Regression test set: 0.86


In [None]:
# Computing Performance Metrics and drawing Confusion Matrix
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred1))
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred1))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92        93
           1       0.20      0.25      0.22         8

    accuracy                           0.86       101
   macro avg       0.57      0.58      0.57       101
weighted avg       0.88      0.86      0.87       101

[[85  8]
 [ 6  2]]


In [None]:
# Applying 5-fold Cross Validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(logreg, X_test, y_test, cv=5))

[0.85714286 0.66666667 0.85714286 0.89473684 0.94736842]




In [None]:
roc_auc_score(y_test,y_pred1)

0.5819892473118279

### Support Vector Classifier (RBF Kernel)

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf')
svc.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [None]:
y_pred2 = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train, y_train)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))
print(cross_val_score(svc, X_train, y_train, cv=5))

Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.92
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        93
           1       0.00      0.00      0.00         8

    accuracy                           0.92       101
   macro avg       0.46      0.50      0.48       101
weighted avg       0.85      0.92      0.88       101

[[93  0]
 [ 8  0]]
[0.91803279 0.91803279 0.91666667 0.91666667 0.93220339]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [None]:
roc_auc_score(y_test,y_pred2)

0.5

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
y_pred3 = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))
print(cross_val_score(clf, X_train, y_train, cv=5))



Accuracy of Random Forest classifier on training set: 0.99
Accuracy of Random Forest classifier on test set: 0.86
              precision    recall  f1-score   support

           0       0.92      0.92      0.92        93
           1       0.12      0.12      0.12         8

    accuracy                           0.86       101
   macro avg       0.52      0.52      0.52       101
weighted avg       0.86      0.86      0.86       101

[[86  7]
 [ 7  1]]
[0.93442623 0.8852459  0.91666667 0.91666667 0.93220339]


In [None]:
roc_auc_score(y_test,y_pred3)

0.5248655913978495

## Post-Sampling Approach

As we saw in the Pre-Sampling Approach, due to lack of defective instaces in the dataset, all models were leading to overfitting, especially in their respective training sets. Hence, we need to apply Oversampling techniques to balance the imabalanced datasets. A lot of general algorithms are available for that. We will use the SMOTE Algorithm wbich generates duplicate instances to bring the minority class at par with the majority class. After that, we will again compute the performance metrics for all the models discussed above and compare the results. 

## SMOTE

### Balancing the Test Set Using SMOTE Algorithm

In [None]:
from imblearn.over_sampling import SMOTE

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (301, 83)
Number transactions y_train dataset:  (301,)
Number transactions X_test dataset:  (101, 83)
Number transactions y_test dataset:  (101,)




In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

Before OverSampling, counts of label '1': 24
Before OverSampling, counts of label '0': 277 



In [None]:
smt = SMOTE()
X_train_res, y_train_res = smt.fit_sample(X, y)

print('After OverSampling, the shape of X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))


After OverSampling, the shape of X: (740, 83)
After OverSampling, the shape of y: (740,) 

After OverSampling, counts of label '1': 370
After OverSampling, counts of label '0': 370


### logistic Regression with SMOTE

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {}'.format(logreg.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg, X_train_res, y_train_res, cv=5))

Accuracy of Logistic Regression train set: 0.81
Accuracy of Logistic Regression test set: 0.7821782178217822
              precision    recall  f1-score   support

           0       0.97      0.78      0.87        93
           1       0.23      0.75      0.35         8

    accuracy                           0.78       101
   macro avg       0.60      0.77      0.61       101
weighted avg       0.91      0.78      0.83       101

[[73 20]
 [ 2  6]]
[0.77702703 0.79054054 0.78378378 0.85810811 0.83783784]




In [None]:
roc_auc_score(y_test,y_pred)

0.7674731182795699

### SVM with SMOTE

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        93
           1       1.00      1.00      1.00         8

    accuracy                           1.00       101
   macro avg       1.00      1.00      1.00       101
weighted avg       1.00      1.00      1.00       101

[[93  0]
 [ 0  8]]




[0.51351351 0.51351351 0.5        0.5        0.5       ]




In [None]:
roc_auc_score(y_test,y_pred)

1.0

### Random Forest Classifier with SMOTE

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred6 = clf2.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf2.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {}'.format(clf2.score(X_test, y_test)))
print(classification_report(y_test, y_pred6))
print(confusion_matrix(y_test, y_pred6))
print(cross_val_score(clf2, X_train_res, y_train_res, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        93
           1       1.00      1.00      1.00         8

    accuracy                           1.00       101
   macro avg       1.00      1.00      1.00       101
weighted avg       1.00      1.00      1.00       101

[[93  0]
 [ 0  8]]
[0.91216216 0.9527027  0.89189189 0.99324324 0.87837838]


In [None]:
roc_auc_score(y_test,y_pred6)

1.0

## SMOTE + Standardization

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_res = sc.fit_transform(X_train_res)
X_test = sc.transform(X_test)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {}'.format(logreg.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Logistic Regression train set: 0.91
Accuracy of Logistic Regression test set: 0.8712871287128713
              precision    recall  f1-score   support

           0       0.99      0.87      0.93        93
           1       0.37      0.88      0.52         8

    accuracy                           0.87       101
   macro avg       0.68      0.87      0.72       101
weighted avg       0.94      0.87      0.89       101

[[81 12]
 [ 1  7]]
[0.91891892 0.90540541 0.86486486 0.87837838 0.69594595]


0.8729838709677419

In [None]:
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of SVM train set : 0.90
Accuracy of SVM test set : 0.83
              precision    recall  f1-score   support

           0       0.99      0.83      0.90        93
           1       0.30      0.88      0.45         8

    accuracy                           0.83       101
   macro avg       0.65      0.85      0.68       101
weighted avg       0.93      0.83      0.87       101

[[77 16]
 [ 1  7]]




[0.93243243 0.92567568 0.81081081 0.96621622 0.70945946]




0.8514784946236559

In [None]:
clf = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        93
           1       1.00      1.00      1.00         8

    accuracy                           1.00       101
   macro avg       1.00      1.00      1.00       101
weighted avg       1.00      1.00      1.00       101

[[93  0]
 [ 0  8]]
[0.91216216 0.94594595 0.89864865 0.97972973 0.85135135]


1.0

## Random Oversapmling

In [None]:
from imblearn.over_sampling import RandomOverSampler

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (301, 83)
Number transactions y_train dataset:  (301,)
Number transactions X_test dataset:  (101, 83)
Number transactions y_test dataset:  (101,)


In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))
sum(y_test==0)
sum(y_test==1)

Before OverSampling, counts of label '1': 24
Before OverSampling, counts of label '0': 277 



8

In [None]:
ros = RandomOverSampler()
X_train_resampled, y_train_resampled = ros.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_resampled.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_resampled.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_resampled==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_resampled==0)))

After OverSampling, the shape of train_X: (740, 83)
After OverSampling, the shape of train_y: (740,) 

After OverSampling, counts of label '1': 370
After OverSampling, counts of label '0': 370


In [None]:
logreg2 = LogisticRegression()
logreg2.fit(X_train_resampled, y_train_resampled)
y_pred = logreg2.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg2, X_train_resampled, y_train_resampled, cv=5))

Accuracy of Logistic Regression train set: 0.79
Accuracy of Logistic Regression test set: 0.76
              precision    recall  f1-score   support

           0       0.93      0.81      0.86        93
           1       0.10      0.25      0.14         8

    accuracy                           0.76       101
   macro avg       0.51      0.53      0.50       101
weighted avg       0.86      0.76      0.81       101

[[75 18]
 [ 6  2]]
[0.71621622 0.77702703 0.68918919 0.80405405 0.75675676]




In [None]:
svc2 = SVC()
svc2.fit(X_train_resampled, y_train_resampled)
y_pred = svc2.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of SVM test set : {:.2f}'.format(svc2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc2, X_train_resampled, y_train_resampled, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.92
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        93
           1       0.00      0.00      0.00         8

    accuracy                           0.92       101
   macro avg       0.46      0.50      0.48       101
weighted avg       0.85      0.92      0.88       101

[[93  0]
 [ 8  0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[1. 1. 1. 1. 1.]




In [None]:
clf2 = RandomForestClassifier().fit(X_train_resampled, y_train_resampled)
y_pred = clf2.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf2, X_train_resampled, y_train_resampled, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.92
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        93
           1       0.00      0.00      0.00         8

    accuracy                           0.92       101
   macro avg       0.46      0.50      0.48       101
weighted avg       0.85      0.92      0.88       101

[[93  0]
 [ 8  0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.97297297 0.96621622 0.9527027  0.98648649 0.95945946]


## ADASYN

In [None]:
from imblearn.over_sampling import ADASYN

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

ads = ADASYN(n_neighbors = 1)

X_train_rsm, y_train_rsm = ads.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_rsm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_rsm.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_rsm==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_rsm==0)))

Number transactions X_train dataset:  (301, 83)
Number transactions y_train dataset:  (301,)
Number transactions X_test dataset:  (101, 83)
Number transactions y_test dataset:  (101,)
Before OverSampling, counts of label '1': 24
Before OverSampling, counts of label '0': 277 

After OverSampling, the shape of train_X: (738, 83)
After OverSampling, the shape of train_y: (738,) 

After OverSampling, counts of label '1': 368
After OverSampling, counts of label '0': 370


In [None]:
logreg3 = LogisticRegression()
logreg3.fit(X_train_rsm, y_train_rsm)
y_pred = logreg3.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg3, X_train_rsm, y_train_rsm, cv=5))

Accuracy of Logistic Regression train set: 0.77
Accuracy of Logistic Regression test set: 0.72
              precision    recall  f1-score   support

           0       0.92      0.76      0.84        93
           1       0.08      0.25      0.12         8

    accuracy                           0.72       101
   macro avg       0.50      0.51      0.48       101
weighted avg       0.86      0.72      0.78       101

[[71 22]
 [ 6  2]]
[0.60135135 0.69594595 0.77027027 0.93877551 0.60544218]




In [None]:
svc3 = SVC()
svc3.fit(X_train_rsm, y_train_rsm)
y_pred = svc3.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of SVM test set : {:.2f}'.format(svc3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc3, X_train_rsm, y_train_rsm, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.92
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        93
           1       0.00      0.00      0.00         8

    accuracy                           0.92       101
   macro avg       0.46      0.50      0.48       101
weighted avg       0.85      0.92      0.88       101

[[93  0]
 [ 8  0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.52027027 0.56081081 0.50675676 0.52380952 0.5170068 ]




In [None]:
clf3 = RandomForestClassifier().fit(X_train_rsm, y_train_rsm)
y_pred = clf3.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf3, X_train_rsm, y_train_rsm, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.92
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        93
           1       0.00      0.00      0.00         8

    accuracy                           0.92       101
   macro avg       0.46      0.50      0.48       101
weighted avg       0.85      0.92      0.88       101

[[93  0]
 [ 8  0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.78378378 0.85135135 0.93918919 0.97278912 0.8707483 ]


After applying oversampling techniques like SMOTE, Random Oversampling and ADASYN with Logistic Regression, SVM and Random Forest, we observe that ADASYN along with Logistic Regression, performs the best throughot tbe training and test sets with least overfitting and maximum accuracy.