# Software Age-Related Bugs Prediction using Python

## Pre-Sampling Approach

In this paper, we will make a model that will predict the Age-Related Bugs (ARB) using python. In ABR, usually the classes are imbalanced in nature. The classes (defective and non-defective) are biased towards non-defective class. First, we will do the prediction and check the metrics without over-sampling approach.

### Importing relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import copy

### Importing Data

In [None]:
ds1 = pd.read_csv('dataset_linux_driver_net.csv')
ds1.head()

Unnamed: 0,id,Filename,AltAvgLineBlank,AltAvgLineCode,AltAvgLineComment,AltCountLineBlank,AltCountLineCode,AltCountLineComment,AvgCyclomatic,AvgCyclomaticModified,...,Vol,Dif,Eff,AllocOps,DeallocOps,DerefUse,UniqueDerefUse,DerefSet,UniqueDerefSet,AgingRelatedBugs
0,1,drivers/net/3c501.c,6.0,27.82,9.42,129.89,445.95,377.26,5.0,5.0,...,11479.39,186.29,289423.61,0.07,0.07,73.95,19.07,25.75,9.6,0
1,2,drivers/net/3c501.h,0.0,0.0,0.0,17.0,59.24,35.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,3,drivers/net/3c503.c,4.79,35.61,7.58,90.05,529.74,152.68,6.58,6.58,...,14708.61,279.63,594338.34,0.0,0.07,102.85,18.32,16.27,6.05,0
3,4,drivers/net/3c503.h,0.0,0.0,0.0,16.0,44.0,68.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,5,drivers/net/3c505.c,4.0,28.66,6.0,225.5,1038.03,441.71,6.74,6.0,...,37936.53,638.55,1667451.55,0.07,1.07,337.5,53.25,117.7,23.95,0


### Checking for Missing Values

In [None]:
ds1.isnull().sum()

id                                    0
Filename                              0
AltAvgLineBlank                       0
AltAvgLineCode                        0
AltAvgLineComment                     0
AltCountLineBlank                     0
AltCountLineCode                      0
AltCountLineComment                   0
AvgCyclomatic                         0
AvgCyclomaticModified                 0
AvgCyclomaticStrict                   0
AvgEssential                          0
AvgLine                               0
AvgLineBlank                          0
AvgLineCode                           0
AvgLineComment                        0
CountClassBase                        0
CountClassCoupled                     0
CountClassDerived                     0
CountDeclClass                        0
CountDeclClassMethod                  0
CountDeclClassVariable                0
CountDeclFunction                     0
CountDeclInstanceMethod               0
CountDeclInstanceVariable             0


### Checking Data types

In [None]:
ds1['AgingRelatedBugs'] = ds1['AgingRelatedBugs'].astype(str)    
ds1.dtypes

id                                      int64
Filename                               object
AltAvgLineBlank                       float64
AltAvgLineCode                        float64
AltAvgLineComment                     float64
AltCountLineBlank                     float64
AltCountLineCode                      float64
AltCountLineComment                   float64
AvgCyclomatic                         float64
AvgCyclomaticModified                 float64
AvgCyclomaticStrict                   float64
AvgEssential                          float64
AvgLine                               float64
AvgLineBlank                          float64
AvgLineCode                           float64
AvgLineComment                        float64
CountClassBase                          int64
CountClassCoupled                       int64
CountClassDerived                       int64
CountDeclClass                          int64
CountDeclClassMethod                    int64
CountDeclClassVariable            

In [None]:
ds1.columns    

Index(['id', 'Filename', 'AltAvgLineBlank', 'AltAvgLineCode',
       'AltAvgLineComment', 'AltCountLineBlank', 'AltCountLineCode',
       'AltCountLineComment', 'AvgCyclomatic', 'AvgCyclomaticModified',
       'AvgCyclomaticStrict', 'AvgEssential', 'AvgLine', 'AvgLineBlank',
       'AvgLineCode', 'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLi

### Creating feature and target variables and Label Encoding the target variable

In [None]:
cols = ['id', 'AltAvgLineBlank', 'AltAvgLineCode',
       'AltAvgLineComment', 'AltCountLineBlank', 'AltCountLineCode',
       'AltCountLineComment', 'AvgCyclomatic', 'AvgCyclomaticModified',
       'AvgCyclomaticStrict', 'AvgEssential', 'AvgLine', 'AvgLineBlank',
       'AvgLineCode', 'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLineInactive',
       'CountLinePreprocessor', 'CountOutput', 'CountPath', 'CountSemicolon',
       'CountStmt', 'CountStmtDecl', 'CountStmtEmpty', 'CountStmtExe',
       'Cyclomatic', 'CyclomaticModified', 'CyclomaticStrict', 'Essential',
       'Knots', 'MaxCyclomatic', 'MaxCyclomaticModified',
       'MaxCyclomaticStrict', 'MaxEssentialKnots', 'MaxInheritanceTree',
       'MaxNesting', 'MinEssentialKnots', 'PercentLackOfCohesion',
       'RatioCommentToCode', 'SumCyclomatic', 'SumCyclomaticModified',
       'SumCyclomaticStrict', 'SumEssential', 'n1', 'n2', 'N1', 'N2', 'Len',
       'Voc', 'Vol', 'Dif', 'Eff', 'AllocOps', 'DeallocOps', 'DerefUse',
       'UniqueDerefUse', 'DerefSet', 'UniqueDerefSet']
X = ds1[cols]
y = ds1['AgingRelatedBugs']


In [None]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
y

array([0, 0, 0, ..., 0, 0, 0])

### Splitting data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

### Logistic Regression Model

In [None]:
# Fitting the model

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Computing Accuracy
y_pred1 = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of Logistic Regression train set: 1.00
Accuracy of Logistic Regression test set: 0.99


In [None]:
# Computing Performance Metrics and drawing Confusion Matrix
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred1))
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred1))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[568   1]
 [  4   0]]


In [None]:
# Applying 5-fold Cross Validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(logreg, X_train, y_train, cv=5))



[0.99709302 0.99418605 0.99418605 0.99709302 0.97959184]




In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred1)

0.4991212653778559

### Support Vector Classifier (RBF Kernel)

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf')
svc.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [None]:
y_pred2 = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train, y_train)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))
print(cross_val_score(svc, X_train, y_train, cv=5))

Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.99
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[569   0]
 [  4   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.99709302 0.99709302 0.99709302 0.99709302 0.99708455]


In [None]:
roc_auc_score(y_test,y_pred2)

0.5

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
y_pred3 = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))
print(cross_val_score(clf, X_train, y_train, cv=5))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.99
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[569   0]
 [  4   0]]
[0.99709302 0.99709302 0.99709302 0.99709302 0.99708455]


In [None]:
roc_auc_score(y_test,y_pred3)

0.5

## Post-Sampling Approach

As we saw in the Pre-Sampling Approach, due to lack of defective instaces in the dataset, all models were leading to overfitting, especially in their respective training sets. Hence, we need to apply Oversampling techniques to balance the imabalanced datasets. A lot of general algorithms are available for that. We will use the SMOTE Algorithm wbich generates duplicate instances to bring the minority class at par with the majority class. After that, we will again compute the performance metrics for all the models discussed above and compare the results. 

## SMOTE

### Balancing the Test Set Using SMOTE Algorithm

In [None]:
from imblearn.over_sampling import SMOTE

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (1719, 83)
Number transactions y_train dataset:  (1719,)
Number transactions X_test dataset:  (573, 83)
Number transactions y_test dataset:  (573,)




In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

Before OverSampling, counts of label '1': 5
Before OverSampling, counts of label '0': 1714 



In [None]:
smt = SMOTE()
X_train_res, y_train_res = smt.fit_sample(X, y)

print('After OverSampling, the shape of X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))


After OverSampling, the shape of X: (4566, 83)
After OverSampling, the shape of y: (4566,) 

After OverSampling, counts of label '1': 2283
After OverSampling, counts of label '0': 2283


### logistic Regression with SMOTE

In [None]:
logreg1 = LogisticRegression()
logreg1.fit(X_train_res, y_train_res)
y_pred4 = logreg1.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg1.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg1.score(X_test, y_test)))
print(classification_report(y_test, y_pred4))
print(confusion_matrix(y_test, y_pred4))
print(cross_val_score(logreg1, X_train_res, y_train_res, cv=5))

Accuracy of Logistic Regression train set: 0.87
Accuracy of Logistic Regression test set: 0.86
              precision    recall  f1-score   support

           0       1.00      0.86      0.92       569
           1       0.05      1.00      0.09         4

    accuracy                           0.86       573
   macro avg       0.52      0.93      0.51       573
weighted avg       0.99      0.86      0.92       573

[[487  82]
 [  0   4]]




[0.72210066 0.8512035  0.89715536 0.96491228 0.93530702]


In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred4)

0.9279437609841829

In [None]:
from sklearn.metrics import precision_score
precision_score(y_test,y_pred4)

0.046511627906976744

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test,y_pred4)

1.0

### SVM with SMOTE

In [None]:
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))



Accuracy of SVM train set : 1.0
Accuracy of SVM test set : 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       569
           1       1.00      1.00      1.00         4

    accuracy                           1.00       573
   macro avg       1.00      1.00      1.00       573
weighted avg       1.00      1.00      1.00       573

[[569   0]
 [  0   4]]




[0.5        0.5        0.50109409 0.50219298 0.50109649]


In [None]:
roc_auc_score(y_test,y_pred)

1.0

### Random Forest Classifier with SMOTE

In [None]:
clf = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {}'.format(clf.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {}'.format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf, X_train_res, y_train_res, cv=5))



Accuracy of Random Forest classifier on training set: 0.9997809899255365
Accuracy of Random Forest classifier on test set: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       569
           1       1.00      1.00      1.00         4

    accuracy                           1.00       573
   macro avg       1.00      1.00      1.00       573
weighted avg       1.00      1.00      1.00       573

[[569   0]
 [  0   4]]
[0.99343545 0.99452954 0.99562363 0.99122807 1.        ]


In [None]:
roc_auc_score(y_test,y_pred)

1.0

## SMOTE + Standardization

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_res = sc.fit_transform(X_train_res)
X_test = sc.transform(X_test)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {}'.format(logreg.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Logistic Regression train set: 0.98
Accuracy of Logistic Regression test set: 0.956369982547993
              precision    recall  f1-score   support

           0       1.00      0.96      0.98       569
           1       0.14      1.00      0.24         4

    accuracy                           0.96       573
   macro avg       0.57      0.98      0.61       573
weighted avg       0.99      0.96      0.97       573

[[544  25]
 [  0   4]]




[0.91575492 0.9726477  0.95842451 0.98245614 0.97149123]


0.9780316344463972

In [None]:
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of SVM train set : 0.99
Accuracy of SVM test set : 0.98
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       569
           1       0.25      1.00      0.40         4

    accuracy                           0.98       573
   macro avg       0.62      0.99      0.69       573
weighted avg       0.99      0.98      0.99       573

[[557  12]
 [  0   4]]




[0.95514223 0.98468271 0.96280088 0.97368421 0.9879386 ]


0.9894551845342707

In [None]:
clf = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       569
           1       0.80      1.00      0.89         4

    accuracy                           1.00       573
   macro avg       0.90      1.00      0.94       573
weighted avg       1.00      1.00      1.00       573

[[568   1]
 [  0   4]]
[0.99124726 0.99343545 0.99562363 0.99122807 1.        ]


0.999121265377856

## Random Oversapmling

In [None]:
from imblearn.over_sampling import RandomOverSampler

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (1719, 83)
Number transactions y_train dataset:  (1719,)
Number transactions X_test dataset:  (573, 83)
Number transactions y_test dataset:  (573,)


In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

Before OverSampling, counts of label '1': 5
Before OverSampling, counts of label '0': 1714 



In [None]:
ros = RandomOverSampler()
X_train_resampled, y_train_resampled = ros.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_resampled.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_resampled.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_resampled==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_resampled==0)))

After OverSampling, the shape of train_X: (4566, 83)
After OverSampling, the shape of train_y: (4566,) 

After OverSampling, counts of label '1': 2283
After OverSampling, counts of label '0': 2283


In [None]:
logreg2 = LogisticRegression()
logreg2.fit(X_train_resampled, y_train_resampled)
y_pred = logreg2.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg2, X_train_resampled, y_train_resampled, cv=5))

Accuracy of Logistic Regression train set: 0.81




Accuracy of Logistic Regression test set: 0.73
              precision    recall  f1-score   support

           0       1.00      0.72      0.84       569
           1       0.02      1.00      0.05         4

    accuracy                           0.73       573
   macro avg       0.51      0.86      0.44       573
weighted avg       0.99      0.73      0.83       573

[[412 157]
 [  0   4]]




[0.63676149 0.83150985 0.8238512  0.93969298 0.94298246]


In [None]:
svc2 = SVC()
svc2.fit(X_train_resampled, y_train_resampled)
y_pred = svc2.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of SVM test set : {:.2f}'.format(svc2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc2, X_train_resampled, y_train_resampled, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.99
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[569   0]
 [  4   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[1. 1. 1. 1. 1.]


In [None]:
clf2 = RandomForestClassifier().fit(X_train_resampled, y_train_resampled)
y_pred = clf2.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf2, X_train_resampled, y_train_resampled, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.99
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[569   0]
 [  4   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.99890591 1.         0.99671772 0.99671053 1.        ]


## ADASYN

In [None]:
from imblearn.over_sampling import ADASYN

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

ads = ADASYN()

X_train_rsm, y_train_rsm = ads.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_rsm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_rsm.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_rsm==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_rsm==0)))

Number transactions X_train dataset:  (1719, 83)
Number transactions y_train dataset:  (1719,)
Number transactions X_test dataset:  (573, 83)
Number transactions y_test dataset:  (573,)
Before OverSampling, counts of label '1': 5
Before OverSampling, counts of label '0': 1714 

After OverSampling, the shape of train_X: (4564, 83)
After OverSampling, the shape of train_y: (4564,) 

After OverSampling, counts of label '1': 2281
After OverSampling, counts of label '0': 2283


In [None]:
logreg3 = LogisticRegression()
logreg3.fit(X_train_rsm, y_train_rsm)
y_pred = logreg3.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg3, X_train_rsm, y_train_rsm, cv=5))



Accuracy of Logistic Regression train set: 0.87
Accuracy of Logistic Regression test set: 0.58
              precision    recall  f1-score   support

           0       1.00      0.58      0.73       569
           1       0.01      0.75      0.02         4

    accuracy                           0.58       573
   macro avg       0.50      0.66      0.38       573
weighted avg       0.99      0.58      0.73       573

[[329 240]
 [  1   3]]




[0.72100656 0.86966046 0.76779847 0.92653509 0.86403509]


In [None]:
svc3 = SVC()
svc3.fit(X_train_rsm, y_train_rsm)
y_pred = svc3.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of SVM test set : {:.2f}'.format(svc3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc3, X_train_rsm, y_train_rsm, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.01
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       569
           1       0.01      1.00      0.01         4

    accuracy                           0.01       573
   macro avg       0.00      0.50      0.01       573
weighted avg       0.00      0.01      0.00       573

[[  0 569]
 [  0   4]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.5        0.49945235 0.50054765 0.50219298 0.5       ]


In [None]:
clf3 = RandomForestClassifier().fit(X_train_rsm, y_train_rsm)
y_pred = clf3.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf3, X_train_rsm, y_train_rsm, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.99
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       569
           1       0.00      0.00      0.00         4

    accuracy                           0.99       573
   macro avg       0.50      0.50      0.50       573
weighted avg       0.99      0.99      0.99       573

[[569   0]
 [  4   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.99124726 0.99452355 0.91456736 0.99342105 0.83552632]


After applying oversampling techniques like SMOTE, Random Oversampling and ADASYN with Logistic Regression, SVM and Random Forest, we observe that ADASYN along with Logistic Regression, performs the best throughot tbe training and test sets with least overfitting and maximum accuracy.