<a href="https://colab.research.google.com/github/rg81073/Machine_Learning_Projects/blob/main/Credit_Card_Fault_Detection_Using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this Dataset named **credit card fraud detection** dataset contains the following columns:



**Time**: The time elapsed between this transaction and the first transaction in the dataset (in seconds).

**V1 to V28**: These are anonymized features resulting from a PCA (Principal Component Analysis) transformation. Due to privacy concerns, the original features have been transformed into these principal components to protect the users' sensitive information.

**Amount**: The transaction amount, representing the monetary value of the transaction.

**Class**: The target variable, where 1 indicates a fraudulent transaction and 0 indicates a legitimate (non-fraudulent) transaction.

**Importing the Dependencies**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
#  Loading the dataset in to the Pandas DataFrame

credit_card_data = pd.read_csv('/content/creditcard.csv')

In [3]:
# First 5 rows of the DataSet

credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


In [5]:
# Dataset Information

credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [6]:
# Checking the number of missing values in each column

credit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [7]:
# Distribution of Legit transactions and Fraudulent transactions

credit_card_data['Class'].value_counts()


0    284315
1       492
Name: Class, dtype: int64

**This Dataset is Highly Imbalanced**

0 --> Normal Transaction
1 --> fraudulent Transaction

In [8]:
# Separating the data for analysis

legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]


In [9]:
print(legit.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


In [10]:
# Statistical Measures of the data

legit.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [11]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [12]:
# Compare the values for both transactions

credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


**Under-Sampling**


Building a Sample dataset Containing Similar Distribution of Normal Transactions
and Fraudulent Transactions

Number of Fraudulent Transactions ---> 492





Undersampling is a technique used in imbalanced datasets to address the problem of having significantly more instances of one class (majority class) than another class (minority class). In such cases, the learning algorithm may be biased towards the majority class and struggle to properly learn the minority class, leading to poor performance in predicting the minority class.

Undersampling involves reducing the number of instances in the majority class to balance the class distribution, typically by randomly removing data points from the majority class until it matches the size of the minority class. By doing so, the dataset becomes more balanced, and the learning algorithm can give equal importance to both classes during training.

However, while undersampling can help mitigate class imbalance, it also comes with potential downsides, such as losing valuable information from the majority class due to data reduction. This can lead to underutilization of available data and potentially result in decreased overall model performance.

In [13]:
legit_sample = legit.sample(n=492)

**Concatenating Two DataFrames**

In [14]:
new_dataset = pd.concat([legit_sample, fraud], axis =0)

In [15]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
10429,16888.0,0.642693,-0.742042,1.256343,2.17934,-1.022701,0.533586,-0.347161,0.139048,2.52008,...,-0.416222,-0.976711,-0.051149,0.352054,0.272955,-0.590839,0.022854,0.062789,226.51,0
79859,58211.0,-4.582533,3.980154,-0.463068,-1.196031,-0.79108,-0.659116,0.487344,0.305398,2.995103,...,-0.847823,-0.909629,0.072467,-0.114144,0.61253,0.043727,1.092191,0.232851,8.97,0
237073,149083.0,1.941444,-1.414306,0.056841,-0.175477,-1.386071,0.604657,-1.434495,0.195395,0.807275,...,-0.41441,-0.501631,0.295861,0.501749,-0.547999,0.502048,0.020494,-0.016732,81.0,0
72099,54563.0,-0.870128,1.183404,1.621406,-0.066113,-0.190596,-1.112997,0.873975,-0.273749,0.170327,...,-0.328324,-0.584337,0.011152,0.677426,-0.168571,0.033608,0.312916,0.045155,16.99,0
281487,170211.0,2.029622,-0.107618,-1.213999,0.424454,0.015783,-1.00699,0.252247,-0.335974,0.52704,...,-0.238325,-0.523593,0.250473,-0.142818,-0.180681,0.285717,-0.070915,-0.06139,20.85,0


In [16]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
280143,169347.0,1.378559,1.289381,-5.004247,1.41185,0.442581,-1.326536,-1.41317,0.248525,-1.127396,...,0.370612,0.028234,-0.14564,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.2137,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.65225,...,0.751826,0.834108,0.190944,0.03207,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.39973,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.2537,245.0,1
281674,170348.0,1.991976,0.158476,-2.583441,0.40867,1.151147,-0.096695,0.22305,-0.068384,0.577829,...,-0.16435,-0.295135,-0.072173,-0.450261,0.313267,-0.289617,0.002988,-0.015309,42.53,1


In [17]:
new_dataset['Class'].value_counts()

0    492
1    492
Name: Class, dtype: int64

In [18]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,92747.471545,0.061653,0.071769,0.018243,0.123135,-0.035506,0.034304,0.019264,-0.029384,0.048903,...,0.04454,-0.013247,0.037467,-0.007817,0.031027,-0.017879,-0.003388,0.044237,0.016075,92.07878
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Spiltting the data into Features & Targets

In [19]:
X = new_dataset.drop(columns='Class' , axis =1)
Y = new_dataset['Class']

In [20]:
print(X)


            Time        V1        V2        V3        V4        V5        V6  \
10429    16888.0  0.642693 -0.742042  1.256343  2.179340 -1.022701  0.533586   
79859    58211.0 -4.582533  3.980154 -0.463068 -1.196031 -0.791080 -0.659116   
237073  149083.0  1.941444 -1.414306  0.056841 -0.175477 -1.386071  0.604657   
72099    54563.0 -0.870128  1.183404  1.621406 -0.066113 -0.190596 -1.112997   
281487  170211.0  2.029622 -0.107618 -1.213999  0.424454  0.015783 -1.006990   
...          ...       ...       ...       ...       ...       ...       ...   
279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   

              V7        V8        V9  .

In [21]:
print(Y)

10429     0
79859     0
237073    0
72099     0
281487    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64


**Split the data into Trainning and Test data**

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [23]:
print(X.shape, X_train.shape, X_test.shape)

(984, 30) (787, 30) (197, 30)


In [24]:
print(Y.shape, Y_train.shape, Y_test.shape)

(984,) (787,) (197,)


**Model Training : Logistic Regression**

In [25]:
model = LogisticRegression()

In [26]:
# Training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)


**Model Evaluation: Calculating Accuracy Score**

In [27]:
# Accuracy score on training data

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)


In [28]:
# Print Accuracy on Training Data

print('Accuracy on training Data : ', training_data_accuracy)

Accuracy on training Data :  0.9212198221092758


In [29]:
 # Accuracy on test data

 X_test_prediction = model.predict(X_test)
 test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [30]:
print('Accuracy Score on test data: ', test_data_accuracy)

Accuracy Score on test data:  0.9086294416243654


**Completed..!!**