Description: Develop a machine learning model to detect fraudulent transactions in a financial dataset.

Steps:

Data Collection: Obtain historical transaction data, including features like transaction amount, timestamp, etc.

Data Preprocessing: Clean the data, handle missing values, and balance the dataset if needed.

Feature Engineering: Extract relevant features and engineer new ones that could aid in fraud detection.

Exploratory Data Analysis (EDA): Visualize patterns and anomalies in the data using graphs and statistics.

Model Selection: Choose classification algorithms like Random Forest, Support Vector Machines, or Neural Networks.

Model Training: Train the model on the preprocessed dataset.

Model Evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score.

Tech Stack:

Python

Data manipulation libraries

Machine learning libraries

Deep learning libraries


In [52]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# loading the csv Dataset to the Pandas DataFrame
credit_card_data=pd.read_csv("/content/creditcard.csv")

In [None]:
#to obtain first five rows of the dataset
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [None]:
#to obtain last five rows of the dataset
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
81294,58872,1.104942,-0.900171,0.271719,-0.528821,-1.064148,-0.533231,-0.366603,-0.121795,-0.855554,...,-0.384664,-0.8985,0.062703,0.098835,-0.024267,0.920347,-0.062583,0.029725,133.76,0.0
81295,58872,-0.438214,0.49212,2.440138,0.454998,-0.148495,0.476077,0.115241,0.232635,-0.057007,...,0.263317,0.865982,-0.093755,-0.026039,-0.602399,-0.453104,0.032011,-0.068108,17.85,0.0
81296,58872,-0.473593,1.109975,1.704117,-0.019309,0.076774,-0.58551,0.718295,-0.023019,-0.619852,...,-0.165078,-0.312285,0.027214,0.417189,-0.266886,0.076342,0.291547,0.123537,0.89,0.0
81297,58873,-2.769864,1.253167,0.752106,0.455407,-1.894425,0.539661,1.08124,-0.188369,1.288851,...,-0.139859,0.276332,0.23268,0.403653,-0.6998,-0.894091,-1.689498,-0.295784,245.0,0.0
81298,58874,1.116529,-0.157694,-0.149952,1.059937,0.256352,,,,,...,,,,,,,,,,


In [None]:
#dataset information
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81299 entries, 0 to 81298
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    81299 non-null  int64  
 1   V1      81299 non-null  float64
 2   V2      81299 non-null  float64
 3   V3      81299 non-null  float64
 4   V4      81299 non-null  float64
 5   V5      81299 non-null  float64
 6   V6      81298 non-null  float64
 7   V7      81298 non-null  float64
 8   V8      81298 non-null  float64
 9   V9      81298 non-null  float64
 10  V10     81298 non-null  float64
 11  V11     81298 non-null  float64
 12  V12     81298 non-null  float64
 13  V13     81298 non-null  float64
 14  V14     81298 non-null  float64
 15  V15     81298 non-null  float64
 16  V16     81298 non-null  float64
 17  V17     81298 non-null  float64
 18  V18     81298 non-null  float64
 19  V19     81298 non-null  float64
 20  V20     81298 non-null  float64
 21  V21     81298 non-null  float64
 22

In [None]:
#checking number of missing values in each column
credit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        1
V7        1
V8        1
V9        1
V10       1
V11       1
V12       1
V13       1
V14       1
V15       1
V16       1
V17       1
V18       1
V19       1
V20       1
V21       1
V22       1
V23       1
V24       1
V25       1
V26       1
V27       1
V28       1
Amount    1
Class     1
dtype: int64

In [None]:
# Distribution of legit and fradulent_transactions
credit_card_data["Class"].value_counts()

0.0    81100
1.0      198
Name: Class, dtype: int64

This Dataset is highly unbalanced because of large difference between Fraudulent and Legit transactions is high.

To be balanced dataset fraudulent and Legit transactions should be same

0 --->Legit /Normal Transactions

1 --->Fraudulent Transaction

In [None]:
#separating the data for Analysis
legit=credit_card_data[credit_card_data.Class == 0 ]
fraud=credit_card_data[credit_card_data.Class ==1 ]

In [None]:
print(legit.shape)
print(fraud.shape)

(81100, 31)
(198, 31)


In [17]:
legit.Amount.describe()

count    81100.000000
mean        98.107837
std        269.738371
min          0.000000
25%          7.730000
50%         26.990000
75%         89.530000
max      19656.530000
Name: Amount, dtype: float64

In [18]:
fraud.Amount.describe()

count     198.000000
mean       93.992778
std       211.099415
min         0.000000
25%         1.000000
50%         6.410000
75%        99.990000
max      1809.680000
Name: Amount, dtype: float64

In [21]:
# compare the values for both transactions
credit_card_data.groupby("Class").mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,37682.000937,-0.244848,-0.049587,0.701227,0.150821,-0.266673,0.100652,-0.097695,0.045923,-0.00512,...,0.041506,-0.031324,-0.104952,-0.037601,0.008468,0.134375,0.025923,0.000776,0.002077,98.107837
1.0,33363.156566,-6.471142,4.615004,-8.642655,5.16364,-4.72632,-1.982527,-6.839513,3.005157,-3.106588,...,0.381224,0.754921,-0.156429,-0.222226,-0.092181,0.229928,0.089311,0.568505,0.0539,93.992778


Under Sampling

In [22]:
# Build a sample dataset containing similar distributions of normal and fradulent transactions
#number of fradulent transactions-->198
legit_sample=legit.sample(n=198)

Concatenating two dataframes

In [23]:
new_dataset=pd.concat([legit_sample,fraud],axis=0)

In [25]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
63442,50717,-2.610969,-1.55222,0.707471,-1.225738,1.29807,-2.256848,-0.344909,0.429789,0.563636,...,0.365515,0.264119,-0.258938,0.434109,0.546325,-0.722874,0.011578,-0.323859,31.88,0.0
58646,48485,1.425331,-0.445505,0.420292,-0.749901,-0.938776,-0.836112,-0.544497,-0.177098,-0.763418,...,-0.116029,-0.458606,0.087143,0.013649,0.279336,-0.459968,0.009496,0.019892,12.36,0.0
65479,51603,-0.719207,0.779626,1.937894,-0.530865,0.406526,-0.340402,0.601786,0.137095,-0.912934,...,-0.114895,-0.448577,-0.091192,0.223174,0.191762,0.015543,-0.066244,-0.035279,1.29,0.0
8220,11054,-0.201642,0.69284,1.602409,-0.730746,0.422121,0.362535,0.368477,-0.222186,1.046974,...,-0.049633,0.254356,-0.347401,-0.786066,-0.100415,1.002001,-0.353959,-0.23927,14.95,0.0
39943,40032,-0.244696,1.484332,-0.401158,0.465751,0.902428,-0.006098,0.616202,-0.833625,-0.490033,...,0.846109,0.248029,-0.025853,-0.845498,-0.507681,-0.350188,0.57661,0.276724,1.0,0.0


In [26]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
79835,58199,0.340391,2.015233,-2.77733,3.812024,-0.461729,-1.152022,-2.001959,0.548681,-2.344042,...,0.299769,-0.583283,-0.187696,-0.329256,0.732328,0.05808,0.553143,0.318832,1.75,1.0
79874,58217,-0.443794,1.271395,1.206178,0.790371,0.418935,-0.848376,0.917691,-0.235511,-0.285692,...,0.119279,0.513479,-0.264243,0.443311,0.029516,-0.335141,-0.188815,-0.123391,5.09,1.0
79883,58222,-1.322789,1.552768,-2.276921,2.992117,-1.947064,-0.480288,-1.362388,0.953242,-2.329629,...,0.614969,-0.1952,0.590711,-0.233378,-0.164285,-0.277498,0.42861,0.246394,270.0,1.0
80760,58642,-0.451383,2.225147,-4.95305,4.342228,-3.65619,-0.020121,-5.407554,-0.748436,-1.362198,...,-0.575924,0.495889,1.154128,-0.016186,-2.079928,-0.554377,0.455179,0.001321,113.92,1.0
81186,58822,-4.384221,3.264665,-3.077158,3.403594,-1.938075,-1.221081,-3.310317,-1.111975,-1.977593,...,2.076383,-0.990303,-0.330358,0.158378,0.006351,-0.49386,-1.537652,-0.994022,45.64,1.0


In [28]:
new_dataset["Class"].value_counts()

0.0    198
1.0    198
Name: Class, dtype: int64

In [30]:
new_dataset.groupby("Class").mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,37183.050505,-0.280416,0.155356,0.670913,0.02764,-0.24564,0.023012,-0.175428,-0.190262,-0.08157,...,0.091457,-0.1369,-0.082922,-0.083254,0.007683,0.090733,-0.010932,0.004752,0.014121,84.499747
1.0,33363.156566,-6.471142,4.615004,-8.642655,5.16364,-4.72632,-1.982527,-6.839513,3.005157,-3.106588,...,0.381224,0.754921,-0.156429,-0.222226,-0.092181,0.229928,0.089311,0.568505,0.0539,93.992778


Splitting the data into Features and Targets

In [31]:
X=new_dataset.drop(columns="Class", axis=1)
Y=new_dataset["Class"]

In [32]:
print(X)

        Time        V1        V2        V3        V4        V5        V6  \
63442  50717 -2.610969 -1.552220  0.707471 -1.225738  1.298070 -2.256848   
58646  48485  1.425331 -0.445505  0.420292 -0.749901 -0.938776 -0.836112   
65479  51603 -0.719207  0.779626  1.937894 -0.530865  0.406526 -0.340402   
8220   11054 -0.201642  0.692840  1.602409 -0.730746  0.422121  0.362535   
39943  40032 -0.244696  1.484332 -0.401158  0.465751  0.902428 -0.006098   
...      ...       ...       ...       ...       ...       ...       ...   
79835  58199  0.340391  2.015233 -2.777330  3.812024 -0.461729 -1.152022   
79874  58217 -0.443794  1.271395  1.206178  0.790371  0.418935 -0.848376   
79883  58222 -1.322789  1.552768 -2.276921  2.992117 -1.947064 -0.480288   
80760  58642 -0.451383  2.225147 -4.953050  4.342228 -3.656190 -0.020121   
81186  58822 -4.384221  3.264665 -3.077158  3.403594 -1.938075 -1.221081   

             V7        V8        V9  ...       V20       V21       V22  \
63442 -0.3449

In [33]:
print(Y)

63442    0.0
58646    0.0
65479    0.0
8220     0.0
39943    0.0
        ... 
79835    1.0
79874    1.0
79883    1.0
80760    1.0
81186    1.0
Name: Class, Length: 396, dtype: float64


Split the data into training and testing data

In [34]:
X_train,X_test,Y_train,Y_test= train_test_split(X,Y,test_size=0.2, stratify=Y, random_state=2)

In [36]:
print(X.shape,X_train.shape,X_test.shape)

(396, 30) (316, 30) (80, 30)


Model Training

Logistic Regression Model

In [37]:
model=LogisticRegression()

In [41]:
#training the Logistic Regression Model with Training Data
model.fit(X_train,Y_train)

Model Evaluation

Accuracy Score

In [44]:
#accuracy on training data
X_train_prediction=model.predict(X_train)

In [45]:
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [48]:
print("Accuracy on Training data:", training_data_accuracy)

Accuracy on Training data: 0.9620253164556962


In [50]:
#Accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print("Accuracy score on Test Data:",test_data_accuracy)

Accuracy score on Test Data: 0.925


if Accuracy Score on trained data is much differ from test Data Then the model is Overfitted or Underfitted

Now we got Accuracy Score on trained_data similar to Accuracy Score on Trained Data