### Credit Card Fraud Detection

### Building credit card fraud detection model, using machine learning algorithms. 
Companies suffer losses due to fraudulent activities, many companies worldwide have lost billions of dollars yearly.
This project aims to use the combination of fraud and non-fraud transactions from the historical data with different people's credit card transaction data to estimate fraud or non-fraud on credit card transactions.
It is a classification supervised learning exercise and different models have been employed and imbalance in data addressed to arrive at the best possible classification

### Importing libraries and dataset

In [25]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

In [6]:
fd=pd.read_csv('creditcard.csv')

### Initial Analysis

In [7]:
fd.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [8]:
fd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [9]:
fd.shape

(284807, 31)

#### Dataset has 284807 rows and 31 features. The result of the shape variable is a tuple that has the number of rows, number of columns of the dataset.

In [10]:
print('Column names:\n',fd.columns)

Column names:
 Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


In [11]:
fd['Class'].unique()

array([0, 1], dtype=int64)

#### The target variable Class has 0 and 1 values. Here

0 for non-fraudulent transactions
1 for fraudulent transactions

In [12]:
fd['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

#### There is an imbalance in the dataset as seen from the value counts above and will affect the analysis.

### Data Preprocessing

#### Preprocessing is the process of cleaning the dataset. In this step, we will apply different methods to clean the raw data to feed more meaningful data for the modeling phase. This method includes

    #Removing duplicates or irrelevant samples
    #Updating missing values with the most relevant values 
    #Convert one data type to another example, categorical to integers, etc.

In [13]:
fd = fd.drop(['Time'], axis=1)

In [14]:
fd['norm_amount'] = StandardScaler().fit_transform(fd['Amount'].values.reshape(-1,1))
fd = fd.drop(['Amount'], axis=1)
print(f"few values of Amount column after applying StandardScaler:- \n {fd['norm_amount'][0:4]}")

few values of Amount column after applying StandardScaler:- 
 0    0.244964
1   -0.342475
2    1.160686
3    0.140534
Name: norm_amount, dtype: float64


### Splitting dependent and independent features and train_test_split

In [15]:
X = fd.drop(['Class'], axis=1)
y = fd[['Class']]

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(199364, 29)
(85443, 29)
(199364, 1)
(85443, 1)


### Our project  falls under the supervised learning category i.e. the dataset has a target value for each row or sample in the dataset. 
#### Credit card fraud detection is a classification problem. Target variable values of Classification problems have integer(0,1) or categorical values(fraud, non-fraud). The target variable of our dataset ‘Class’ has only two labels - 0 (non-fraudulent) and 1 (fraudulent
    #The decision tree algorithm considers all the provided features of the data and comes up with important features.
    #The random forest algorithm falls under the ensemble learning algorithm category. In the random forest algorithm, we        build N decision tree model
 

### DECISION TREE CLASSIFICATION

In [18]:
def decision_tree_classification(X_train, y_train, X_test, y_test):
    # initialize object for DecisionTreeClassifier class
    dt_classifier = DecisionTreeClassifier()
    # train model by using fit method
    print("Model training starts........")
    dt_classifier.fit(X_train, y_train.values.ravel()) 
    #ravel is used to change a 2-dimensional array or a multi-dimensional array into a contiguous flattened array.
    print("Model training completed")
    acc_score = dt_classifier.score(X_test, y_test)
    print(f'Accuracy of model on test dataset :- {acc_score}')
    # predict result using test dataset
    y_pred = dt_classifier.predict(X_test)
    # confusion matrix
    print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
    # classification report for f1-score
    print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")  

In [19]:
# calling decision_tree_classification method to train and evaluate model
decision_tree_classification(X_train, y_train, X_test, y_test)

Model training starts........
Model training completed
Accuracy of model on test dataset :- 0.9992158515033414
Confusion Matrix :- 
 [[85266    30]
 [   37   110]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.79      0.75      0.77       147

    accuracy                           1.00     85443
   macro avg       0.89      0.87      0.88     85443
weighted avg       1.00      1.00      1.00     85443



#### Although accuracy is at 0.99% the f1-score at 0.77 is less, owing to the imbalanced data , as discussed earlier.

### Random Forest Classifier

In [20]:
def random_forest_classifier(X_train, y_train, X_test, y_test):
     # initialize object for DecisionTreeClassifier class
     rf_classifier = RandomForestClassifier(n_estimators=50)
     # train model by using fit method
     print("Model training starts........")
     rf_classifier.fit(X_train, y_train.values.ravel())
     acc_score = rf_classifier.score(X_test, y_test)
     print(f'Accuracy of model on test dataset :- {acc_score}')
     # predict result using test dataset
     y_pred = rf_classifier.predict(X_test)
     # confusion matrix
     print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
     # classification report for f1-score
     print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")

In [21]:
# calling random_forest_classifier
random_forest_classifier(X_train, y_train, X_test, y_test)

Model training starts........
Accuracy of model on test dataset :- 0.9994967405170698
Confusion Matrix :- 
 [[85290     6]
 [   37   110]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.95      0.75      0.84       147

    accuracy                           1.00     85443
   macro avg       0.97      0.87      0.92     85443
weighted avg       1.00      1.00      1.00     85443



#### Accuracy is high but recall and f1score is low owing to unbalanced data, which  means one class label samples are  higher and dominating the other class label.

#### TREATMENT OF UNBALANCED DATA
We can use any of the below-mentioned metrics for unbalanced or skewed datasets.

Recall
Precision
F1-score
Area Under ROC curve

Confusion Matrix
True Positive (TP):-  
The number of positive labels correctly predicted by trained models.  
This means the number of Class-1 samples correctly predicted as Class-1.

True Negative (TN):-
The number of negative labels correctly predicted by trained models.  
This means the number of Class-0 samples correctly predicted as Class-0.

False Positive (FP):-  
The number of positive labels incorrectly predicted by trained models. 
This means the number of Class-1 samples incorrectly predicted as Class-0.

False Negative (FN):-  
The number of negative labels incorrectly predicted by trained models. 
This means the number of Class-0 samples incorrectly predicted as Class-1.

#### Area Under ROC curve is another evaluation metric for classification problems. 
This is mostly suitable for skewed datasets.
It tells us about model performance, such as the model's capability to distinguish between target classes. 

The effective model has a higher Area Under the ROC curve value. 
Here we measure the ability of class separability of a model by using the Area Under ROC curve.

Good models have AUC value near to 1, and the worst models have AUC value near 0.

In [23]:
class_val = fd['Class'].value_counts()
print(f"Number of samples for each class :- \n {class_val}")
non_fraud = class_val[0]
fraud = class_val[1]
print(f"Non Fraudulent Numbers :- {non_fraud}")
print(f"Fraudulent Numbers :- {fraud}")
# Equal both the target samples to the same level
# take indexes of non fraudulent
nonfraud_indexies = fd[fd.Class == 0].index
fraud_indices = np.array(fd[fd['Class'] == 1].index)
# take random samples from non fraudulent that are equal to fraudulent samples
random_normal_indexies = np.random.choice(nonfraud_indexies, fraud, replace=False)
random_normal_indexies = np.array(random_normal_indexies)

Number of samples for each class :- 
 0    284315
1       492
Name: Class, dtype: int64
Non Fraudulent Numbers :- 284315
Fraudulent Numbers :- 492


### Undersampling techniques

In [24]:
# concatenate both indices of fraud and non fraud
under_sample_indices = np.concatenate([fraud_indices, random_normal_indexies])

#extract all features from whole data for under sample indices only
under_sample_data = fd.iloc[under_sample_indices, :]

# now we have to divide under sampling data to all features & target
x_undersample_data = under_sample_data.drop(['Class'], axis=1)
y_undersample_data = under_sample_data[['Class']]
# now split dataset to train and test datasets as before
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(
x_undersample_data, y_undersample_data, test_size=0.2, random_state=0)

In [26]:
#DecisionTreeClassifier after applying undersampling technique
def decision_tree_classification(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 dt_classifier = DecisionTreeClassifier()
 # train model by using fit method
 print("Model training start........")
 dt_classifier.fit(X_train, y_train.values.ravel())
 print("Model training completed")
 acc_score = dt_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = dt_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

# calling decision tree classifier function 
decision_tree_classification(X_train_sample, y_train_sample, 
X_test_sample, y_test_sample)

Model training start........
Model training completed
Accuracy of model on test dataset :- 0.9035532994923858
Confusion Matrix :- 
 [[93 13]
 [ 6 85]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.94      0.88      0.91       106
           1       0.87      0.93      0.90        91

    accuracy                           0.90       197
   macro avg       0.90      0.91      0.90       197
weighted avg       0.91      0.90      0.90       197

AROC score :- 
 0.905712212315986


In [27]:
## RandomForestClassifier after apply the undersampling techniques

def random_forest_classifier(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 rf_classifier = RandomForestClassifier(n_estimators=50)
 # train model by using fit method
 print("Model training start........")
 rf_classifier.fit(X_train, y_train.values.ravel())
 acc_score = rf_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = rf_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 # area under roc curve
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

random_forest_classifier(X_train_sample, y_train_sample, X_test_sample, y_test_sample)

Model training start........
Accuracy of model on test dataset :- 0.9441624365482234
Confusion Matrix :- 
 [[101   5]
 [  6  85]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.94      0.95      0.95       106
           1       0.94      0.93      0.94        91

    accuracy                           0.94       197
   macro avg       0.94      0.94      0.94       197
weighted avg       0.94      0.94      0.94       197

AROC score :- 
 0.9434480613725896


### Oversampling techniques

In [29]:
# concatenate both indices of fraud and non fraud
over_sample_indices = np.concatenate([fraud_indices, random_normal_indexies])

#extract all features from whole data for over sample indices only
over_sample_data = fd.iloc[over_sample_indices, :]

# now we have to divide over sampling data to all features & target
x_oversample_data = over_sample_data.drop(['Class'], axis=1)
y_oversample_data = over_sample_data[['Class']]
# now split dataset to train and test datasets as before
X_traino_sample, X_testo_sample, y_traino_sample, y_testo_sample = train_test_split(
x_oversample_data, y_oversample_data, test_size=0.2, random_state=0)

In [30]:
#DecisionTreeClassifier after applying oversampling technique
def decision_tree_classification(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 dt_classifier = DecisionTreeClassifier()
 # train model by using fit method
 print("Model training start........")
 dt_classifier.fit(X_train, y_train.values.ravel())
 print("Model training completed")
 acc_score = dt_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = dt_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

# calling decision tree classifier function 
decision_tree_classification(X_traino_sample, y_traino_sample, 
X_testo_sample, y_testo_sample)

Model training start........
Model training completed
Accuracy of model on test dataset :- 0.8984771573604061
Confusion Matrix :- 
 [[95 11]
 [ 9 82]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.91      0.90      0.90       106
           1       0.88      0.90      0.89        91

    accuracy                           0.90       197
   macro avg       0.90      0.90      0.90       197
weighted avg       0.90      0.90      0.90       197

AROC score :- 
 0.8986626580966204


In [31]:
## RandomForestClassifier after apply the oversampling techniques

def random_forest_classifier(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 rf_classifier = RandomForestClassifier(n_estimators=50)
 # train model by using fit method
 print("Model training start........")
 rf_classifier.fit(X_train, y_train.values.ravel())
 acc_score = rf_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = rf_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 # area under roc curve
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

random_forest_classifier(X_traino_sample, y_traino_sample, X_testo_sample, y_testo_sample)

Model training start........
Accuracy of model on test dataset :- 0.9593908629441624
Confusion Matrix :- 
 [[102   4]
 [  4  87]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.96      0.96      0.96       106
           1       0.96      0.96      0.96        91

    accuracy                           0.96       197
   macro avg       0.96      0.96      0.96       197
weighted avg       0.96      0.96      0.96       197

AROC score :- 
 0.9591540534936762


#### For the best models, we have the AROC value near to 1. Here we implemented the undersampling & oversampling technique to address imbalance in data.

## Conclusion :Finally, our oversampling Random Forest classifier,model gives 95% of the Area Under the ROC curve value. We can improve model results by adding more trees or applying additional data preprocessing techniques.

### Source code:https://dataaspirant.com/credit-card-fraud-detection-classification-algorithms-python/#:~:text=The%20credit%20card%20fraud%20classification%20problem%20is%20used,become%20a%20major%20problem%20to%20credit%20card%20companies.