# Credit Card Fraud Detection: A Crisp DM Approach

### Business Understanding

Credit Card Fraud Detection is a classic class-imbalance problem where the number of fraud transactions is much lesser than the number of legitimate transaction for any bank. Most of the approaches involve building model on such imbalanced data, and thus fails to produce results on real-time new data because of overfitting on training data and a bias towards the majoritarian class of legitimate transactions. Thus, we can see this as an anomaly detection problem. 

1. What time does the Credit Card Frauds usually take place?
2. What are the general trends of amounts for Credit Card Fraud Transactions?
3. How do we balance the data to not let the model overfit on legitimate transactions?


In [2]:
# Importing Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

from mlxtend.plotting import plot_learning_curves
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef

import warnings
warnings.filterwarnings("ignore")

The __SGD classifier (Stochastic Gradient Descent)__ is a linear classifier used for classification problems, including binary classification. It is a popular algorithm for large-scale machine learning tasks as it can handle large datasets efficiently.SGD classifier works by updating the model's weights using a gradient descent optimization algorithm that updates the weights incrementally for each training example, rather than computing the gradient over the entire dataset. This makes it particularly efficient for large datasets. It also allows the model to make online predictions, meaning it can make predictions on data as it comes in, without having to wait for the entire dataset to be processed.

The SGD classifier is flexible in terms of the loss function it can optimize. It can optimize loss functions such as hinge loss, logistic loss, and squared loss, making it suitable for a range of tasks. In terms of hyperparameters, the SGD classifier has several parameters that can be tuned to optimize performance, such as the learning rate, the regularization parameter, and the type of penalty used for regularization. The optimal combination of hyperparameters can be found using methods such as grid search or randomized search.

In credit card fraud detection, the SGD classifier can be used to build a model that learns to distinguish between fraudulent and legitimate transactions based on features such as transaction amount, location, time of day, and so on. The classifier can then be used to predict the probability of fraud for each transaction and flag those transactions that are most likely to be fraudulent.

## Self-built SGD Classifier

The code snippet shows the evaluation of a trained SGDClassifier model using the evaluation() function. The function takes in the test set (y_test, X_test) and the trained model (grid_sgd). The classification report shows the precision, recall, and F1-score for each class (0 and 1), as well as the accuracy, macro-average, and weighted-average F1-score. The AUC-ROC score and F1-score are also shown. The precision for class 0 is 1.00, meaning that all the predicted negative instances are actually negative, while the precision for class 1 is 0.14, indicating that only a small percentage of predicted positive instances are actually positive. The recall for class 0 is 0.99, meaning that almost all the actual negative instances are correctly classified, while the recall for class 1 is 0.91, indicating that the model is able to identify a large proportion of positive instances. The F1-score for class 0 is 1.00, indicating a perfect balance between precision and recall, while the F1-score for class 1 is 0.25, indicating poor performance for positive instances. The accuracy of the model is 0.990, meaning that the model is able to correctly classify almost 99% of the instances. The AUC-ROC score is 0.948, which is a good indicator of the overall performance of the model. Overall, the model seems to perform well for negative instances, but not so well for positive instances.

In [80]:
import numpy as np

class SGDClassifier:
    
    def __init__(self, alpha=0.0001, max_iter=1000, tol=1e-3):
        # Initialize the class with the given parameters
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
    
    def fit(self, X, y):
        # Fit the model to the training data
        n_samples, n_features = X.shape 
        self.w = np.zeros(n_features) # Initialize the weight vector
        self.b = 0 # Initialize the bias
        errors = [] # Keep track of the errors for each iteration
        print("Training started...")
        for iter in range(self.max_iter): # Loop over the number of iterations
            iter_errors = 0
            for i in range(n_samples): # Loop over the training examples
                xi = X[i]
                yi = y[i]
                #print(f"Current weight vector: {self.w}")
                #print(f"Current bias: {self.b}")
                # If the example is misclassified, update the weight vector and bias
                if yi * (np.dot(xi, self.w) + self.b) <= 1:
                    self.w = self.w + self.alpha * ((yi * xi) - (2 * (1/self.max_iter) * self.w))
                    self.b = self.b + self.alpha * (yi - (2 * (1/self.max_iter) * self.b))
                    
                    #print(f"Updated weight vector: {self.w}")
                    #print(f"Updated bias: {self.b}")
                    iter_errors += 1
                else:
                    # If the example is correctly classified, update the weight vector and bias
                    self.w = self.w + self.alpha * (-2 * (1/self.max_iter) * self.w)
                    self.b = self.b + self.alpha * (-2 * (1/self.max_iter) * self.b)
            errors.append(iter_errors)
            # Print the number of errors for the current iteration
            #print(f"Iteration {iter+1}/{self.max_iter}, Errors: {iter_errors}")
            # If there are no errors, stop iterating
            if iter_errors == 0:
                break
        print(f"Training Complete. Final Weight Vector: {self.w}, Final Bias: {self.b}")
        return self
    
    def predict(self, X):
        # Make predictions on the input data using the learned weight vector and bias
        #print("Prediction started...")
        return np.sign(np.dot(X, self.w) + self.b)

### Data Understanding and Data Preparation
We used the Kaggle Credit Card Fraud Detection Dataset : <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud">Link</a>

Since the data set is imbalanced SMOTE technique is used to balance the datatset

In [3]:
# Read Data into a Dataframe
df = pd.read_csv('creditcard.csv')
df1 = pd.read_csv('creditcard.csv',header=None)

In [4]:
df1=df1.drop(0)
df1 = df1.reset_index(drop=True)
df1=df1.astype(float)

In [5]:
# Describe Data
df1.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.16598e-15,3.416908e-16,-1.37315e-15,2.086869e-15,9.604066e-16,1.490107e-15,-5.556467e-16,1.177556e-16,-2.406455e-15,...,1.656562e-16,-3.44485e-16,2.578648e-16,4.471968e-15,5.340915e-16,1.685502e-15,-3.662461e-16,-1.220404e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [6]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [7]:
df.isna().sum()
df1.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
dtype: int64

In [8]:
# Create Train and Test Data in ratio 70:30
X = df.drop(labels='Class', axis=1) # Features
y = df.loc[:,'Class']               # Target Variable


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

In [9]:
# Create Train and Test Data in ratio 70:30
X1 = df1.drop(df1.columns[-1], axis=1)       # Features
y1=df1[df1.columns[-1]]                      # Target Variable
y1=y1.astype(int)
X_train_sgd, X_test_sgd, y_train_sgd, y_test_sgd = train_test_split(X1, y1, test_size=0.3, random_state=1, stratify=y1)

In [10]:
# Use Synthetic Minority Oversampling
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

In [67]:
# Use Synthetic Minority Oversampling
sm1 = SMOTE(random_state=42)
X_res_knn, y_res_knn = sm1.fit_resample(X_train_knn, y_train_knn)

### Evaluation

We make use of AUC-ROC Score, Classification Report, Accuracy and F1-Score to evaluate the performance of the classifiers

The code defines a class named SGDClassifier which implements a binary classifier using stochastic gradient descent algorithm. The class has three parameters - alpha (learning rate), max_iter (maximum number of iterations) and tol (tolerance for stopping criteria).

The fit method takes two arguments X (training data of shape (n_samples, n_features)) and y (target variable of shape (n_samples,)). The method initializes the weight vector and bias to zero and iteratively updates them based on the misclassification errors in the training examples. The method prints the current weight vector, bias, updated weight vector and bias, and number of errors for each iteration. The method stops iterating if there are no errors or the maximum number of iterations is reached. The method returns the object itself.

The predict method takes one argument X (input data of shape (n_samples, n_features)), uses the learned weight vector and bias from the fit method to make predictions on the input data and returns the predicted classes (-1 or 1). The method also prints "Prediction started..." message.

Overall, the class can be used to fit a binary classifier using stochastic gradient descent algorithm and make predictions on new data.

In [81]:
# create an instance of SGDClassifier
clf = SGDClassifier(alpha=0.001, max_iter=100)

# fit the model to your training data
clf.fit(X_res_knn.values, y_res_knn.values)

# make predictions on your test data
y_pred = clf.predict(X_test)

Training started...
Training Complete. Final Weight Vector: [ 2.45559249e-01 -1.42132492e-03  8.32044712e-04 -7.83547340e-04
  2.26649875e-03 -1.89493166e-04 -8.24680801e-04 -1.32428419e-03
  7.33433053e-04 -1.48596822e-03 -1.52358653e-03  1.67092964e-03
 -1.56970558e-03 -2.72103114e-04 -2.38263451e-03  3.35120045e-04
 -5.61938047e-04 -1.46183329e-03  1.01904849e-04  2.39249452e-04
  2.02322119e-04  3.16713957e-04  9.39075778e-06 -1.58248846e-04
  1.50881224e-04  4.15742428e-05  8.49447617e-05  1.22209647e-04
 -7.36605060e-05  3.39794849e-02], Final Bias: 0.0005943838489211871


The code creates an instance of the SGDClassifier class with alpha=0.001 and max_iter=100. The fit method is then called on the classifier object, passing in the training data X_res_knn.values and y_res_knn.values. This trains the model on the given data. The predict method is then called on the trained classifier object, passing in the test data X_test. The output of this prediction is assigned to the variable y_pred, which contains the predicted labels for the test data.

In [82]:
print(classification_report(y_test_knn, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00     85295
           1       0.00      1.00      0.00       148

    accuracy                           0.00     85443
   macro avg       0.00      0.50      0.00     85443
weighted avg       0.00      0.00      0.00     85443



The classification_report function from the sklearn.metrics module computes and returns a text report showing the main classification metrics on a per-class basis. It takes two arguments:
the true labels of the test set (y_test_knn), 
the predicted labels of the test set (y_pred).
It returns a string that contains four main metrics: precision, recall, f1-score, and support. These metrics are computed for each class (in the case of binary classification, there will only be two classes: 0 and 1). The metrics are defined as follows:

__Precision:__ the proportion of true positives among the total number of positive predictions (i.e., the ability of the model not to label a negative sample as positive)

__Recall:__ the proportion of true positives among the total number of actual positives (i.e., the ability of the model to find all the positive samples)

__F1-score:__ the harmonic mean of precision and recall

__Support:__ the number of samples in each class

Looking at the report, it seems that the model is not performing well at all, as it is predicting all samples to belong to class 0, resulting in precision, recall, and F1-score of 0 for class 1. This could be due to class imbalance in the data, or because the model is not complex enough to capture the patterns in the data.

In [83]:
print('AUC-ROC')
print(roc_auc_score(y_test_knn, y_pred))
      
print('F1-Score')
print(f1_score(y_test_knn, y_pred))
    
print('Accuracy')
print(accuracy_score(y_test_knn, y_pred))

AUC-ROC
0.5
F1-Score
0.0034583075323340075
Accuracy
0.0017321489179921118


Based on the classification report, the precision and recall values for both classes are zero, which indicates that the model is not making any correct predictions. Therefore, it is not surprising that the evaluation metrics you printed are also very low. The AUC-ROC score of 0.5 indicates that the model's predictions are no better than random guessing, while the F1-score and accuracy score of close to zero indicate that the model is not able to correctly classify any samples. In this case, it seems that the model is not well-suited to the data, and simply trying to tune the hyperparameters or change the algorithm may not be sufficient to achieve good performance. It may be necessary to re-evaluate the data or consider alternative modeling approaches. The accuracy of the classifier can be improved by tuning the hyperparameters of the SGDClassifier. We tried changing the learning rate, the number of iterations, or the regularization strength to see if the performance improves. 

In [74]:
# Evaluation of Classifiers
def grid_eval(grid_clf):
    """
        Method to Compute the best score and parameters computed by grid search
        Parameter:
            grid_clf: The Grid Search Classifier 
    """
    print("Best Score", grid_clf.best_score_)
    print("Best Parameter", grid_clf.best_params_)
    
def evaluation(y_test, grid_clf, X_test):
    """
        Method to compute the following:
            1. Classification Report
            2. F1-score
            3. AUC-ROC score
            4. Accuracy
        Parameters:
            y_test: The target variable test set
            grid_clf: Grid classifier selected
            X_test: Input Feature Test Set
    """
    y_pred = grid_clf.predict(X_test)
    print('CLASSIFICATION REPORT')
    print(classification_report(y_test, y_pred))
    
    print('AUC-ROC')
    print(roc_auc_score(y_test, y_pred))
      
    print('F1-Score')
    print(f1_score(y_test, y_pred))
    
    print('Accuracy')
    print(accuracy_score(y_test, y_pred))

In [77]:
# The parameters of each classifier are different
# Hence, we do not make use of a single method and this is not to violate DRY Principles
# We set pipelines for each classifier unique with parameters
param_grid_sgd = [{
    'model__loss': ['log'],
    'model__penalty': ['l1', 'l2'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20)
}, {
    'model__loss': ['hinge'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20),
    'model__class_weight': [None, 'balanced']
}]

pipeline_sgd = Pipeline([
    ('scaler', StandardScaler(copy=False)),
    ('model', SGDClassifier(max_iter=1000, tol=1e-3, random_state=1, warm_start=True))
])

MCC_scorer = make_scorer(matthews_corrcoef)
grid_sgd = GridSearchCV(estimator=pipeline_sgd, param_grid=param_grid_sgd, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)


grid_sgd.fit(X_res, y_res)

Fitting 5 folds for each of 80 candidates, totalling 400 fits




## Sklearn Model

In [84]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

# create an instance of SGDClassifier
clf = SGDClassifier()

# define the parameter grid to search
param_grid = {
    'alpha': [0.001, 0.01, 0.1],
    'max_iter': [100, 500, 1000],
    'penalty': ['l1', 'l2', 'elasticnet']
}

# create a grid search object
grid_search = GridSearchCV(clf, param_grid, cv=5)

# fit the grid search to your training data
grid_search.fit(X_res_knn, y_res_knn)

# get the best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

Best parameters: {'alpha': 0.1, 'max_iter': 500, 'penalty': 'l1'}
Best score: 0.9518666465681841


__Grid search__ is a technique for tuning hyperparameters by searching over a specified range of hyperparameters and evaluating the model performance for each combination of hyperparameters.

The code above will perform a grid search over a range of hyperparameters and return the best set of parameters and the corresponding score. We can then use these parameters to train the classifier and evaluate its performance on the test set. 
- First, the code imports the GridSearchCV class from the sklearn.model_selection module and the SGDClassifier class from the sklearn.linear_model module. It then creates an instance of the SGDClassifier class.

- Next, the code defines a dictionary param_grid that specifies the hyperparameters to search over and their corresponding values. In this case, the hyperparameters being searched are alpha, max_iter, and penalty.

- After defining the parameter grid, the code creates a GridSearchCV object with the SGDClassifier model and the parameter grid. The cv parameter specifies the number of cross-validation folds to use during the search.

- The GridSearchCV object is then fit to the training data using the fit method. This will perform a search over all possible combinations of hyperparameters specified in the parameter grid, and evaluate the performance of the model for each combination using cross-validation.

- Finally, the code prints the best parameters found by the grid search and the corresponding score. This can be used to determine the optimal hyperparameters to use for the SGDClassifier model.

In [45]:
grid_eval(grid_dt)

Best Score 0.9510174856798311
Best Parameter {'alpha': 0.01, 'max_iter': 500, 'penalty': 'l1'}


The function grid_eval(grid_sgd) is called. The output shows that the best score achieved by the grid search is 0.951, and the corresponding best parameters are {'alpha': 0.01, 'max_iter': 500, 'penalty': 'l1'}. This means that the combination of hyperparameters alpha=0.01, max_iter=500, and penalty='l1' resulted in the best score of __0.951__.

In [46]:
evaluation(y_test_knn, grid_search, X_test_knn)

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     85295
           1       0.04      0.90      0.07       148

    accuracy                           0.96     85443
   macro avg       0.52      0.93      0.53     85443
weighted avg       1.00      0.96      0.98     85443

AUC-ROC
0.930067626979814
F1-Score
0.07459338194054964
Accuracy
0.961377760612338


### Conclusion

The precision, recall, F1-score, accuracy, and AUC-ROC score are computed. The model has an accuracy of 96.14%, indicating that it correctly classified 96.14% of the transactions in the test set. However, the precision for the minority class (fraudulent transactions) is low at 0.04, indicating that the model has a high false positive rate. The recall for the minority class is high at 0.90, indicating that the model has a low false negative rate. The code is evaluating the best model selected by the GridSearchCV by passing the test set and the grid search classifier to the evaluation() function. The y_test parameter is the target variable test set, the grid_search parameter is the grid search classifier selected, and the X_test parameter is the input feature test set.

### Sources

Data - https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud