## Supervised Learning: Challenge

In this challenge, we will try to predict credit card fraud.

Download the data from [here](https://drive.google.com/file/d/1FCQY1SiWIjh_ME6Wtb3FG8Y1sKoRwAUc/view?usp=sharing). The data is originally from a [Kaggle Competition](https://www.kaggle.com/mlg-ulb/creditcardfraud).

The dataset contains transactions made by credit cards within two days in September 2013 by European cardholders.  Where **we have 492 occurrences of fraud out of the total of 284,807 transactions**. This dataset is highly unbalanced, with the positive class (frauds) account for 0.172% of all transactions.

____________________
### **Challenge:** Identify fraudulent credit card transactions.

Features V1, V2, … V28 are the principal components obtained with PCA. The only features that are not transformed with PCA are `'Time'` and `'Amount'`.  

- The feature `'Time'` contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature `'Amount'` is the transaction amount; this feature can be used for example-dependant cost-sensitive learning. 
- The feature `'Class'` is the target variable, and it takes the value of 1 in case of fraud and 0 otherwise.

> #### Warning
> There is a huge class imbalance ratio, so we need to be careful when evaluating. It might be better to use the method `.predict_proba()` with a custom cut-off to search for fraudulent transactions.

!pip install lazypredict

In [2]:
import numpy as np

In [121]:
# import pandas

import pandas as pd

from sklearn.metrics import (confusion_matrix,accuracy_score,classification_report)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier


#from lazypredict.Supervised import LazyClassifier

In [5]:
df = pd.read_csv('creditcard.csv')

In [6]:
X = df.drop(columns='Class')
y = df['Class']

X[:5]

In [7]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3) 

In [83]:
y_test                            

0        0
1        0
2        0
3        0
4        0
        ..
85438    0
85439    0
85440    0
85441    0
85442    0
Name: Class, Length: 85443, dtype: int64

In [82]:
y_test.reset_index(drop=True,inplace=True)

In [11]:
svc = SVC()

In [12]:
svc.fit(X_train,y_train)print

SVC()

In [13]:
y_pred = svc.predict(X_test)

In [14]:
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


0.9995201479348805
[[85301     3]
 [   38   101]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85304
           1       0.97      0.73      0.83       139

    accuracy                           1.00     85443
   macro avg       0.99      0.86      0.92     85443
weighted avg       1.00      1.00      1.00     85443



### Trt Naive_bayes

In [15]:
gnb = GaussianNB()

In [16]:
gnb.fit(X_train,y_train)

GaussianNB()

In [52]:
y_pred_gnb = gnb.predict(X_test)

In [53]:
print(accuracy_score(y_test, y_pred_gnb))
print(confusion_matrix(y_test, y_pred_gnb))
print(classification_report(y_test, y_pred_gnb))

0.976299989466662
[[83296  2008]
 [   17   122]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     85304
           1       0.06      0.88      0.11       139

    accuracy                           0.98     85443
   macro avg       0.53      0.93      0.55     85443
weighted avg       1.00      0.98      0.99     85443



In [84]:
df = pd.DataFrame(gnb.predict_proba(X_test))

In [99]:
def judge_cls(x):
    if x >= 0.1:
        return 0
    else:
        return 1

In [85]:
df['50%'] = df[0].apply(judge_cls)

In [100]:
df['80%'] = df[0].apply(judge_cls)

In [87]:
df = pd.concat([df,y_test],axis=1)

In [101]:
df[df['80%'] != df['Class']].count()

0        1914
1        1914
50%      1914
Class    1914
80%      1914
dtype: int64

In [93]:
df.head()

Unnamed: 0,0,1,50%,Class,80%
0,1.0,2.274867e-16,0,0,0
1,1.0,1.923143e-17,0,0,0
2,1.0,6.596116e-17,0,0,0
3,1.0,1.2034860000000002e-17,0,0,0
4,1.0,5.744683e-16,0,0,0


### Try GXboost

In [114]:
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)

In [117]:
data_test = xgb.DMatrix(data=X_test)

In [115]:
params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

In [110]:
xg_reg = xgb.XGBClassifier(objective ='binary:logisticr', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [116]:
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)



In [118]:
y_pred_xg = xg_reg.predict(data_test)

In [119]:
y_pred_xg

array([0.17967121, 0.17967121, 0.17967121, ..., 0.17967121, 0.17967121,
       0.17967121], dtype=float32)

### Try Random Forest

In [123]:

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100,n_jobs=-1)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

In [124]:
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9996371850239341
[[85298     6]
 [   25   114]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85304
           1       0.95      0.82      0.88       139

    accuracy                           1.00     85443
   macro avg       0.97      0.91      0.94     85443
weighted avg       1.00      1.00      1.00     85443



In [126]:
df1 = pd.DataFrame(clf.predict_proba(X_test),)

In [138]:
df1.head()

Unnamed: 0,0,1,ypred
0,1.0,0.0,0
1,1.0,0.0,0
2,1.0,0.0,0
3,1.0,0.0,0
4,1.0,0.0,0


In [135]:
def pred(x):
    if x > 0.15:
        return 1
    else:
        return 0

In [136]:
df1['ypred'] = df1[1].apply(pred)

In [137]:
df1[(df1[0]<0.85) & (df1[1]>0.15)].describe()

Unnamed: 0,0,1,ypred
count,149.0,149.0,149.0
mean,0.22906,0.77094,1.0
std,0.252416,0.252416,0.0
min,0.0,0.16,1.0
25%,0.04,0.6,1.0
50%,0.1,0.9,1.0
75%,0.4,0.96,1.0
max,0.84,1.0,1.0


In [139]:
print(accuracy_score(y_test, df1['ypred']))
print(confusion_matrix(y_test, df1['ypred']))
print(classification_report(y_test, df1['ypred']))

0.9995318516437859
[[85279    25]
 [   15   124]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85304
           1       0.83      0.89      0.86       139

    accuracy                           1.00     85443
   macro avg       0.92      0.95      0.93     85443
weighted avg       1.00      1.00      1.00     85443

