# Handling Imbalanced Dataset with Machine Learning
#pip install imbalanced-learn(INSTALL in anaconda prompt)

## Credit Card Kaggle- Handle Imbalanced Dataset

### Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

### Content
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Inspiration
Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

### Acknowledgements
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

In [None]:
import pandas as pd
df=pd.read_csv('creditcard.csv')
df.head()

In [3]:
df.shape

(284807, 31)

In [4]:
df['Class'].value_counts() #Huge difference are there in the dataset

0    284315
1       492
Name: Class, dtype: int64

In [28]:
#### Independent and Dependent Features
X=df.drop("Class",axis=1) #df.iloc[:,:-1],
y=df['Class'] #df.iloc[:,-1]

In [29]:
X

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,1.475829,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00


In [31]:
y

0         0
1         0
2         0
3         0
4         0
         ..
284802    0
284803    0
284804    0
284805    0
284806    0
Name: Class, Length: 284807, dtype: int64

#### Cross Validation Like KFOLD and Hyperpaqrameter Tuning

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report #(Donot check the accuracy for the unbalanced dataset)
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV

In [39]:
print(10.0**-2)
print(10.0**2)

0.01
100.0


In [42]:
np.arange(-20,2,2) # It will give number from from startinbg point, endpoint-1 step every time shift 2

array([-20, -18, -16, -14, -12, -10,  -8,  -6,  -4,  -2,   0])

In [35]:
10.0 **np.arange(-2,3) # from -2 to 2

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

In [48]:
log_class=LogisticRegression()# we have parameeters has c ,penality in this check(shift+Tab)
grid={'C':10.0**np.arange(-2,3),'penalty':['l1','l2']}# l2 is the penalty in the logistic regression in it. l1 and l2 are the regularization parameters
#Cv cross validation
cv=KFold(n_splits=5,random_state=None,shuffle=False)

In [49]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7)

In [50]:
clf=GridSearchCV(log_class,grid,cv=cv,n_jobs=-1,scoring='f1_macro')
# macro_avg will compute the metric independently for each classes and then take avg(hence reated all classes equally)
# Micro will agregate the contributions of all classes to compute the avg metric
clf.fit(X_train,y_train)

25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\HP\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\HP\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\HP\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.84869833        nan 0.8

GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'penalty': ['l1', 'l2']},
             scoring='f1_macro')

In [53]:
y_pred=clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85250    48]
 [   49    96]]
0.9988647402361809
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.67      0.66      0.66       145

    accuracy                           1.00     85443
   macro avg       0.83      0.83      0.83     85443
weighted avg       1.00      1.00      1.00     85443



In [39]:
y_pred=clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85251    47]
 [   41   104]]
0.9989700736163291
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.69      0.72      0.70       145

    accuracy                           1.00     85443
   macro avg       0.84      0.86      0.85     85443
weighted avg       1.00      1.00      1.00     85443



In [64]:
347*100

34700

In [56]:
X_train.value_counts()

Time      V1         V2         V3         V4         V5         V6         V7         V8         V9         V10         V11        V12        V13        V14        V15        V16        V17        V18        V19        V20        V21        V22        V23        V24        V25        V26        V27        V28        Amount
163152.0  -1.203617   1.574009   2.889277   3.381404   1.538663   3.698747   0.560211  -0.150911   0.124136   4.220998    1.384569  -0.706897  -0.256274  -1.562583   1.692915  -0.787338  -0.226776  -0.412354   0.234322   1.385597  -0.366727   0.522223  -0.357329  -0.870174  -0.134166   0.327019  -0.042648  -0.855262  1.51      12
          -1.196037   1.585949   2.883976   3.378471   1.511706   3.717077   0.585362  -0.156001   0.122648   4.217934    1.385525  -0.709405  -0.256168  -1.564352   1.693218  -0.785210  -0.228008  -0.412833   0.234834   1.375790  -0.370294   0.524395  -0.355170  -0.869790  -0.133198   0.327804  -0.035702  -0.858197  7.56      11
43153.0   

In [43]:
y_train.value_counts()

0    199017
1       347
Name: Class, dtype: int64

In [57]:
from sklearn.ensemble import RandomForestClassifier
classifer=RandomForestClassifier()
classifer.fit(X_train,y_train)

RandomForestClassifier()

In [59]:
y_pred=classifer.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85293     5]
 [   34   111]]
0.9995435553526912
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.96      0.77      0.85       145

    accuracy                           1.00     85443
   macro avg       0.98      0.88      0.93     85443
weighted avg       1.00      1.00      1.00     85443



In [60]:
# hyper parameter tuning  
class_weight=dict({0:1,1:100})
class_weight

{0: 1, 1: 100}

In [61]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(class_weight=class_weight)
classifier.fit(X_train,y_train)

RandomForestClassifier(class_weight={0: 1, 1: 100})

In [62]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85297     1]
 [   34   111]]
0.9995903701883126
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.99      0.77      0.86       145

    accuracy                           1.00     85443
   macro avg       1.00      0.88      0.93     85443
weighted avg       1.00      1.00      1.00     85443



# Under Sampling
#Reduce the pointd of the maximum labels


In [65]:
from collections import Counter
Counter(y_train) # it is imbalanced

Counter({0: 199017, 1: 347})

In [66]:
from collections import Counter
Counter(X_train)

Counter({'Time': 1,
         'V1': 1,
         'V2': 1,
         'V3': 1,
         'V4': 1,
         'V5': 1,
         'V6': 1,
         'V7': 1,
         'V8': 1,
         'V9': 1,
         'V10': 1,
         'V11': 1,
         'V12': 1,
         'V13': 1,
         'V14': 1,
         'V15': 1,
         'V16': 1,
         'V17': 1,
         'V18': 1,
         'V19': 1,
         'V20': 1,
         'V21': 1,
         'V22': 1,
         'V23': 1,
         'V24': 1,
         'V25': 1,
         'V26': 1,
         'V27': 1,
         'V28': 1,
         'Amount': 1})

In [82]:
from collections import Counter
from imblearn.under_sampling import NearMiss
ns=NearMiss(sampling_strategy=0.8)
X_train_ns,y_train_ns=ns.fit_resample(X_train,y_train)
print('The no of classs before fit {}'.format(Counter(y_train)))
print('The no of classs after fit {}'.format(Counter(y_train_ns)))

The no of classs before fit Counter({0: 199017, 1: 347})
The no of classs after fit Counter({0: 433, 1: 347})


In [97]:
433*0.8

346.40000000000003

In [84]:
from collections import Counter
from imblearn.under_sampling import NearMiss
ns=NearMiss(sampling_strategy='auto')
X_train_ns,y_train_ns=ns.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 347, 1: 347})


In [92]:
from collections import Counter
from imblearn.under_sampling import NearMiss
ns=NearMiss(sampling_strategy=1.0)
X_train_ns,y_train_ns=ns.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 347, 1: 347})


In [93]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [94]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[63564 21734]
 [    7   138]]
0.7455496646887398
              precision    recall  f1-score   support

           0       1.00      0.75      0.85     85298
           1       0.01      0.95      0.01       145

    accuracy                           0.75     85443
   macro avg       0.50      0.85      0.43     85443
weighted avg       1.00      0.75      0.85     85443



In [None]:
# Dont use under sampling we will loose so much data if dataset is small the we use


##### Over Sampling

In [98]:
from imblearn.over_sampling import RandomOverSampler

In [100]:
os=RandomOverSampler(sampling_strategy=0.75)
X_train_ns,y_train_ns=os.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 199017, 1: 149262})


In [101]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [102]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85291     7]
 [   29   116]]
0.9995786664794073
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.94      0.80      0.87       145

    accuracy                           1.00     85443
   macro avg       0.97      0.90      0.93     85443
weighted avg       1.00      1.00      1.00     85443



#### SMOTETomek

In [67]:
# Oversampling in n same dot same point will create
# SMOTETomek based on nearest points more points are created. That's why it will take so much time.
#under sampling and over smapling these methods will create during tr model creation

In [106]:
from imblearn.combine import SMOTETomek
#where ever it is low. It will create new points
os=SMOTETomek(sampling_strategy=0.75)
X_train_ns,y_train_ns=os.fit_resample(X_train,y_train)
print("the no of classes before fit{}".format(Counter(y_train)))
print("the no of classes after fit{}".format(Counter(y_train_ns)))

the no of classes before fitCounter({0: 199017, 1: 347})
the no of classes after fitCounter({0: 198332, 1: 148577})


In [107]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [108]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85280    18]
 [   23   122]]
0.9995201479348805
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.87      0.84      0.86       145

    accuracy                           1.00     85443
   macro avg       0.94      0.92      0.93     85443
weighted avg       1.00      1.00      1.00     85443



#### Ensemble Techniques

In [109]:
from imblearn.ensemble import EasyEnsembleClassifier

In [111]:
easy=EasyEnsembleClassifier()


In [116]:
easy.fit(X_train_ns,y_train_ns)

EasyEnsembleClassifier()

In [117]:
easy.

SyntaxError: invalid syntax (3777505745.py, line 1)

In [118]:
y_pred=easy.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[83883  1415]
 [   17   128]]
0.9832402888475358
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     85298
           1       0.08      0.88      0.15       145

    accuracy                           0.98     85443
   macro avg       0.54      0.93      0.57     85443
weighted avg       1.00      0.98      0.99     85443



In [6]:
# Under sampling
from collections import Counter
from imblearn.under_sampling import NearMiss
ns=NearMiss(sampling_strategy=0.8)
X_train_ns,y_train_ns=ns.fit_resample(X_train,y_train)
print('The no of classs before fit {}'.format(Counter(y_train)))
print('The no of classs after fit {}'.format(Counter(y_train_ns)))

In [4]:
#Over sampling
from imblearn.over_sampling import RandomOverSampler
os=RandomOverSampler(sampling_strategy=0.75)
X_train_ns,y_train_ns=os.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

In [None]:
#SMOTE Tomek
from imblearn.combine import SMOTETomek
#where ever it is low. It will create new points
os=SMOTETomek(sampling_strategy=0.75)
X_train_ns,y_train_ns=os.fit_resample(X_train,y_train)
print("the no of classes before fit{}".format(Counter(y_train)))
print("the no of classes after fit{}".format(Counter(y_train_ns)))

In [None]:
#ensemble Techniques
from imblearn.ensemble import EasyEnsembleClassifier
easy=EasyEnsembleClassifier()
easy.fit(X_train_ns,y_train_ns)