# Credit Card Fraud Detection
---

# Data Dictionary

| Column | Description |
|--------|-------------|
| Time | Seconds elapsed between each transaction and the first transaction in the dataset. |
| V1-V28 | Principal components obtained through PCA on the original features. Due to confidentiality, the exact nature of these components is not disclosed. |
| Amount | The transaction amount. Can be used for cost-sensitive analysis or to identify patterns related to transaction amounts. |
| Class | The response variable indicating the legitimacy of the transaction (1 for fraud, 0 otherwise). |

## Dataset Characteristics

- Contains transactions made by credit cards in September 2013 by European cardholders.
- Comprises 284,807 transactions over two days, with 492 fraudulent cases.
- The dataset is highly imbalanced, with fraudulent transactions accounting for only 0.172% of all transactions.

## Evaluation Recommendations

- Given the class imbalance, traditional accuracy metrics may not be meaningful.
- It's recommended to use metrics like the Area Under the Precision-Recall Curve (AUPRC).
- Focus on precision and recall to evaluate the performance of fraud detection models.

## Implications for Analysis

### Data Preprocessing
- Consider techniques to handle the imbalanced dataset, such as resampling methods.
- Feature scaling may be necessary for the Amount and Time variables if algorithms used are sensitive to feature scales.

### Modeling Approaches
- Algorithms that perform well with imbalanced data, such as anomaly detection models, might be appropriate.
- Ensemble methods like Random Forests or Gradient Boosting Machines can also be effective.
- Incorporate cross-validation to ensure the robustness of your model.

In [5]:
import sys
print(sys.version)

3.10.15 (main, Sep  7 2024, 00:20:06) [Clang 16.0.0 (clang-1600.0.26.3)]


### We are using Python 3.10.15

In [6]:
# # Install required libraries with specific versions
# %pip install numpy==1.26.4
# %pip install pandas==2.1.4
# %pip install matplotlib==3.7.1
# %pip install seaborn==0.13.1
# %pip install scikit-learn==1.3.2
# %pip install imbalanced-learn==0.12.3
# %pip install torch==2.4.1

In [8]:
# List versions of installed libraries
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import sklearn
import imblearn

print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"seaborn version: {sns.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"imbalanced-learn version: {imblearn.__version__}")

numpy version: 1.26.4
pandas version: 2.1.4
matplotlib version: 3.7.1
seaborn version: 0.13.1
scikit-learn version: 1.3.2
imbalanced-learn version: 0.12.3


In [9]:
# Install required libraries

# Imported Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time

# Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections


# Other Libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings("ignore")

# # List versions of installed libraries
# print(f"numpy version: {np.__version__}")
# print(f"pandas version: {pd.__version__}")
# print(f"tensorflow version: {tf.__version__}")
# print(f"matplotlib version: {plt.__version__}")
# print(f"seaborn version: {sns.__version__}")
# print(f"scikit-learn version: {sklearn.__version__}")
# print(f"imbalanced-learn version: {imblearn.__version__}")
# print(f"collections version: {collections.__version__}")

In [10]:
df = pd.read_csv('creditcard.csv')

In [11]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [12]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [13]:
# Good No Null Values!
df.isnull().sum().max()

0

In [14]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [15]:
# The classes are heavily skewed we need to solve this issue later.
print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')

No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset


**Note**: Notice how imbalanced is our original dataset! Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud. But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

In [16]:
colors = ["#0101DF", "#DF0101"]

sns.countplot('Class', data=df, palette=colors)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)


fig, ax = plt.subplots(1, 2, figsize=(18,4))

amount_val = df['Amount'].values
time_val = df['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

plt.show()

TypeError: countplot() got multiple values for argument 'data'