# Software Age-Related Bugs Prediction using Python

## Pre-Sampling Approach

In this paper, we will make a model that will predict the Age-Related Bugs (ARB) using python. In ABR, usually the classes are imbalanced in nature. The classes (defective and non-defective) are biased towards non-defective class. First, we will do the prediction and check the metrics without over-sampling approach.

### Importing relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from sklearn.metrics import roc_auc_score 
import copy

### Importing Data

In [None]:
ds1 = pd.read_csv('dataset_linux_driver_scsi.csv')
ds1.head()

Unnamed: 0,id,Filename,AltAvgLineBlank,AltAvgLineCode,AltAvgLineComment,AltCountLineBlank,AltCountLineCode,AltCountLineComment,AvgCyclomatic,AvgCyclomaticModified,...,Vol,Dif,Eff,AllocOps,DeallocOps,DerefUse,UniqueDerefUse,DerefSet,UniqueDerefSet,AgingRelatedBugs
0,1,drivers/scsi/3w-9xxx.c,5.47,33.83,3.0,314.73,1676.7,315.0,6.93,6.93,...,58314.1,1034.3,2377344.4,0.47,1.0,584.88,161.56,262.34,87.81,0
1,2,drivers/scsi/3w-9xxx.h,0.0,0.0,0.0,39.47,566.4,103.23,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,3,drivers/scsi/3w-sas.c,6.0,29.0,3.0,314.14,1373.43,291.0,6.0,6.0,...,45835.43,826.14,1346717.0,0.0,1.0,434.57,146.57,193.0,86.0,0
3,4,drivers/scsi/3w-sas.h,0.0,0.0,0.0,33.0,291.0,92.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,5,drivers/scsi/3w-xxxx.c,6.0,35.24,3.0,384.71,1951.37,482.55,7.26,6.87,...,69306.61,1060.37,2670582.32,0.25,2.25,647.88,189.68,348.32,104.47,0


### Checking for Missing Values

In [None]:
ds1.isnull().sum()

id                                    0
Filename                              0
AltAvgLineBlank                       0
AltAvgLineCode                        0
AltAvgLineComment                     0
AltCountLineBlank                     0
AltCountLineCode                      0
AltCountLineComment                   0
AvgCyclomatic                         0
AvgCyclomaticModified                 0
AvgCyclomaticStrict                   0
AvgEssential                          0
AvgLine                               0
AvgLineBlank                          0
AvgLineCode                           0
AvgLineComment                        0
CountClassBase                        0
CountClassCoupled                     0
CountClassDerived                     0
CountDeclClass                        0
CountDeclClassMethod                  0
CountDeclClassVariable                0
CountDeclFunction                     0
CountDeclInstanceMethod               0
CountDeclInstanceVariable             0


### Checking Data types

In [None]:
ds1['AgingRelatedBugs'] = ds1['AgingRelatedBugs'].astype(str)

In [None]:
ds1.dtypes

id                                      int64
Filename                               object
AltAvgLineBlank                       float64
AltAvgLineCode                        float64
AltAvgLineComment                     float64
AltCountLineBlank                     float64
AltCountLineCode                      float64
AltCountLineComment                   float64
AvgCyclomatic                         float64
AvgCyclomaticModified                 float64
AvgCyclomaticStrict                   float64
AvgEssential                          float64
AvgLine                               float64
AvgLineBlank                          float64
AvgLineCode                           float64
AvgLineComment                        float64
CountClassBase                          int64
CountClassCoupled                       int64
CountClassDerived                       int64
CountDeclClass                          int64
CountDeclClassMethod                    int64
CountDeclClassVariable            

In [None]:
ds1.columns

Index(['id', 'Filename', 'AltAvgLineBlank', 'AltAvgLineCode',
       'AltAvgLineComment', 'AltCountLineBlank', 'AltCountLineCode',
       'AltCountLineComment', 'AvgCyclomatic', 'AvgCyclomaticModified',
       'AvgCyclomaticStrict', 'AvgEssential', 'AvgLine', 'AvgLineBlank',
       'AvgLineCode', 'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLi

### Creating feature and target variables and Label Encoding the target variable

In [None]:
cols = ['id','AltAvgLineBlank', 'AltAvgLineCode', 'AltAvgLineComment',
       'AltCountLineBlank', 'AltCountLineCode', 'AltCountLineComment',
       'AvgCyclomatic', 'AvgCyclomaticModified', 'AvgCyclomaticStrict',
       'AvgEssential', 'AvgLine', 'AvgLineBlank', 'AvgLineCode',
       'AvgLineComment', 'CountClassBase', 'CountClassCoupled',
       'CountClassDerived', 'CountDeclClass', 'CountDeclClassMethod',
       'CountDeclClassVariable', 'CountDeclFunction',
       'CountDeclInstanceMethod', 'CountDeclInstanceVariable',
       'CountDeclInstanceVariablePrivate',
       'CountDeclInstanceVariableProtected', 'CountDeclInstanceVariablePublic',
       'CountDeclMethod', 'CountDeclMethodAll', 'CountDeclMethodConst',
       'CountDeclMethodFriend', 'CountDeclMethodPrivate',
       'CountDeclMethodProtected', 'CountDeclMethodPublic', 'CountInput',
       'CountLine', 'CountLineBlank', 'CountLineCode', 'CountLineCodeDecl',
       'CountLineCodeExe', 'CountLineComment', 'CountLineInactive',
       'CountLinePreprocessor', 'CountOutput', 'CountPath', 'CountSemicolon',
       'CountStmt', 'CountStmtDecl', 'CountStmtEmpty', 'CountStmtExe',
       'Cyclomatic', 'CyclomaticModified', 'CyclomaticStrict', 'Essential',
       'Knots', 'MaxCyclomatic', 'MaxCyclomaticModified',
       'MaxCyclomaticStrict', 'MaxEssentialKnots', 'MaxInheritanceTree',
       'MaxNesting', 'MinEssentialKnots', 'PercentLackOfCohesion',
       'RatioCommentToCode', 'SumCyclomatic', 'SumCyclomaticModified',
       'SumCyclomaticStrict', 'SumEssential', 'n1', 'n2', 'N1', 'N2', 'Len',
       'Voc', 'Vol', 'Dif', 'Eff', 'AllocOps', 'DeallocOps', 'DerefUse',
       'UniqueDerefUse', 'DerefSet', 'UniqueDerefSet']
X = ds1[cols]
y = ds1['AgingRelatedBugs']


In [None]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### Splitting data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

### Logistic Regression Model

In [None]:
# Fitting the model

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Computing Accuracy
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of Logistic Regression train set: 1.00
Accuracy of Logistic Regression test set: 1.00


In [None]:
# Computing Performance Metrics and drawing Confusion Matrix
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred))
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       0.00      0.00      0.00         1

    accuracy                           1.00       241
   macro avg       0.50      0.50      0.50       241
weighted avg       0.99      1.00      0.99       241

[[240   0]
 [  1   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [None]:
# Applying 5-fold Cross Validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(logreg, X_test, y_test, cv=5))



ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

In [None]:
roc_auc_score(y_test,y_pred)

0.5

### Support Vector Classifier (RBF Kernel)

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf')
svc.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [None]:
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train, y_train)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train, y_train, cv=5))

Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       0.00      0.00      0.00         1

    accuracy                           1.00       241
   macro avg       0.50      0.50      0.50       241
weighted avg       0.99      1.00      0.99       241

[[240   0]
 [  1   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[0.99310345 0.99310345 0.99310345 1.         1.        ]




In [None]:
roc_auc_score(y_test,y_pred)

0.5

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
y_pred3 = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))
print(cross_val_score(clf, X_train, y_train, cv=5))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       0.00      0.00      0.00         1

    accuracy                           1.00       241
   macro avg       0.50      0.50      0.50       241
weighted avg       0.99      1.00      0.99       241

[[240   0]
 [  1   0]]
[0.99310345 0.99310345 0.99310345 1.         1.        ]


In [None]:
roc_auc_score(y_test,y_pred3)

0.5

## Post-Sampling Approach

As we saw in the Pre-Sampling Approach, due to lack of defective instaces in the dataset, all models were leading to overfitting, especially in their respective training sets. Hence, we need to apply Oversampling techniques to balance the imabalanced datasets. A lot of general algorithms are available for that. We will use the SMOTE Algorithm wbich generates duplicate instances to bring the minority class at par with the majority class. After that, we will again compute the performance metrics for all the models discussed above and compare the results. 

## SMOTE

### Balancing the Test Set Using SMOTE Algorithm

In [None]:
from imblearn.over_sampling import SMOTE

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (721, 83)
Number transactions y_train dataset:  (721,)
Number transactions X_test dataset:  (241, 83)
Number transactions y_test dataset:  (241,)




In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

Before OverSampling, counts of label '1': 3
Before OverSampling, counts of label '0': 718 



In [None]:
smt = SMOTE(k_neighbors = 1)
X_train_res, y_train_res = smt.fit_sample(X, y)

print('After OverSampling, the shape of X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))


After OverSampling, the shape of X: (1916, 83)
After OverSampling, the shape of y: (1916,) 

After OverSampling, counts of label '1': 958
After OverSampling, counts of label '0': 958


### logistic Regression with SMOTE

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {}'.format(logreg.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg, X_train_res, y_train_res, cv=5))



Accuracy of Logistic Regression train set: 0.94
Accuracy of Logistic Regression test set: 0.9128630705394191
              precision    recall  f1-score   support

           0       1.00      0.91      0.95       240
           1       0.05      1.00      0.09         1

    accuracy                           0.91       241
   macro avg       0.52      0.96      0.52       241
weighted avg       1.00      0.91      0.95       241

[[219  21]
 [  0   1]]




[0.828125   0.96614583 0.90104167 0.96596859 0.97382199]




In [None]:
roc_auc_score(y_test,y_pred)

0.9562499999999999

### SVM with SMOTE

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       1.00      1.00      1.00         1

    accuracy                           1.00       241
   macro avg       1.00      1.00      1.00       241
weighted avg       1.00      1.00      1.00       241

[[240   0]
 [  0   1]]




[0.50260417 0.50520833 0.5        0.5        0.5       ]


In [None]:
roc_auc_score(y_test,y_pred)

1.0

### Random Forest Classifier with SMOTE

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred6 = clf2.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf2.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf2.score(X_test, y_test)))
print(classification_report(y_test, y_pred6))
print(confusion_matrix(y_test, y_pred6))
print(cross_val_score(clf2, X_train_res, y_train_res, cv=5))



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       1.00      1.00      1.00         1

    accuracy                           1.00       241
   macro avg       1.00      1.00      1.00       241
weighted avg       1.00      1.00      1.00       241

[[240   0]
 [  0   1]]
[0.99739583 1.         1.         0.9973822  0.9947644 ]


In [None]:
roc_auc_score(y_test,y_pred)

1.0

## SMOTE + Standardization

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_res = sc.fit_transform(X_train_res)
X_test = sc.transform(X_test)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg.score(X_train_res, y_train_res)))
print('Accuracy of Logistic Regression test set: {}'.format(logreg.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Logistic Regression train set: 0.99
Accuracy of Logistic Regression test set: 0.970954356846473
              precision    recall  f1-score   support

           0       1.00      0.97      0.99       240
           1       0.12      1.00      0.22         1

    accuracy                           0.97       241
   macro avg       0.56      0.99      0.60       241
weighted avg       1.00      0.97      0.98       241

[[233   7]
 [  0   1]]
[0.95572917 0.99479167 0.96875    0.98429319 0.95026178]


0.9854166666666666

In [None]:
svc = SVC()
svc.fit(X_train_res, y_train_res)
y_pred = svc.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc.score(X_train_res, y_train_res)))
print('Accuracy of SVM test set : {:.2f}'.format(svc.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of SVM train set : 1.00
Accuracy of SVM test set : 0.99
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       240
           1       0.25      1.00      0.40         1

    accuracy                           0.99       241
   macro avg       0.62      0.99      0.70       241
weighted avg       1.00      0.99      0.99       241

[[237   3]
 [  0   1]]




[0.9921875  1.         0.98697917 0.9973822  0.98429319]


0.99375

In [None]:
clf = RandomForestClassifier().fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf.score(X_train_res, y_train_res)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf, X_train_res, y_train_res, cv=5))
roc_auc_score(y_test,y_pred)



Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       240
           1       1.00      1.00      1.00         1

    accuracy                           1.00       241
   macro avg       1.00      1.00      1.00       241
weighted avg       1.00      1.00      1.00       241

[[240   0]
 [  0   1]]
[0.99479167 1.         1.         0.9973822  0.9973822 ]


1.0

## Random Oversapmling

In [None]:
from imblearn.over_sampling import RandomOverSampler

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))
sum(y_test==0)
sum(y_test==1)

In [None]:
ros = RandomOverSampler()
X_train_resampled, y_train_resampled = ros.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_resampled.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_resampled.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_resampled==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_resampled==0)))

In [None]:
logreg2 = LogisticRegression()
logreg2.fit(X_train_resampled, y_train_resampled)
y_pred = logreg2.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg2, X_train_resampled, y_train_resampled, cv=5))

In [None]:
svc2 = SVC()
svc2.fit(X_train_resampled, y_train_resampled)
y_pred = svc2.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of SVM test set : {:.2f}'.format(svc2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc2, X_train_resampled, y_train_resampled, cv=5))

In [None]:
clf2 = RandomForestClassifier().fit(X_train_resampled, y_train_resampled)
y_pred = clf2.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf2.score(X_train_resampled, y_train_resampled)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf2.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf2, X_train_resampled, y_train_resampled, cv=5))

## ADASYN

In [None]:
from imblearn.over_sampling import ADASYN

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

ads = ADASYN(n_neighbors = 1)

X_train_rsm, y_train_rsm = ads.fit_sample(X, y)

print('After OverSampling, the shape of train_X: {}'.format(X_train_rsm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_rsm.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_rsm==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_rsm==0)))

In [None]:
logreg3 = LogisticRegression()
logreg3.fit(X_train_rsm, y_train_rsm)
y_pred = logreg3.predict(X_test)
print('Accuracy of Logistic Regression train set: {:.2f}'.format(logreg3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Logistic Regression test set: {:.2f}'.format(logreg3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(logreg3, X_train_rsm, y_train_rsm, cv=5))

In [None]:
svc3 = SVC()
svc3.fit(X_train_rsm, y_train_rsm)
y_pred = svc3.predict(X_test)
print('Accuracy of SVM train set : {:.2f}'.format(svc3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of SVM test set : {:.2f}'.format(svc3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(svc3, X_train_rsm, y_train_rsm, cv=5))

In [None]:
clf3 = RandomForestClassifier().fit(X_train_rsm, y_train_rsm)
y_pred = clf3.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(clf3.score(X_train_rsm, y_train_rsm)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf3.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(cross_val_score(clf3, X_train_rsm, y_train_rsm, cv=5))

After applying oversampling techniques like SMOTE, Random Oversampling and ADASYN with Logistic Regression, SVM and Random Forest, we observe that ADASYN along with Logistic Regression, performs the best throughot tbe training and test sets with least overfitting and maximum accuracy.