<a href="https://colab.research.google.com/github/radha0601/radha0601/blob/main/PYTHON_ML_Credit_Card_FraudDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Dependencies

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# I am loading the dataset by copying path to a Pandas Dataframe.

cc_data = pd.read_csv('/content/sample_data/creditcard.csv')

In [3]:
#five rows
cc_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Above is the first five rows of the credit card data set from sample data.
Converted all into Principle Analysis Method. Amount is given in US dollars. Time is time passed from first transaction in seconds. Classes:
0 = normal transaction.
1 = fraudulent.

In [4]:
cc_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


Above is the last five rows of data set. Time is in seconds again. *Almost two days of transactions.* Next we will gather info.

In [5]:
cc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Above tells us the number of entries/columns. Null values = missing. Non-Null = Present values. Now we check for the number of missing values in each column.

In [6]:
cc_data.isnull().sum()


Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Missing values == good thing. Now see how class is distributed

In [9]:
cc_data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64


* Normal Transactions: 284315
* Fraud Transactions: 492



More than 99% of data is from one class. So we cannot recognize fraudulent because we have very less data points (very unbalanced). Processing comes into play. Analyze more:

In [10]:
#creating variables:
real = cc_data[cc_data.Class == 0]
fraud = cc_data[cc_data.Class == 1]


In [11]:
print(real.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


In [12]:
real.Amount.describe() #amount column == money transaction amount

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [13]:
fraud.Amount.describe() #same as above but with fraud data (class 1)

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [14]:
cc_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


We can see a wide difference between the mean of normal transaction and fraudulent transaction. Look at values of V1 between time from class 0 and class 1. Mean of amounts in general are also very different as you can see in the last column. Now we'll build a sample data set from the original making it have similar distribution of the real transactions and fraud transactions.
This means take a random 492 transactions from the 284315 real transactions and compare it to the 492 fraud transactions so that the data is BALANCED.

In [16]:
real_sample = real.sample(n = 492)


In [17]:
new_data = pd.concat([real_sample, fraud], axis = 0)


In [18]:
new_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
205648,135850.0,2.076793,-0.840551,-0.597995,-0.432304,-0.578205,0.385664,-0.978279,0.160193,-0.354317,...,-0.583066,-1.175672,0.473866,0.155543,-0.65317,-0.083262,0.003191,-0.040372,17.0,0
200486,133453.0,-1.224163,-0.399112,-1.766858,1.288104,-7.446837,4.472428,8.023282,-1.390962,-0.033205,...,-0.698906,-0.478018,0.572865,-0.063001,-1.234399,-1.049478,1.204458,-0.357178,1732.12,0
30586,35993.0,-0.653798,0.77317,1.156996,0.33954,0.045392,-1.027201,0.758313,-0.253625,-0.428612,...,-0.395715,-1.218245,0.126267,0.34919,-0.939598,-0.047007,-0.077395,0.125541,50.9,0
179607,124156.0,1.152773,-0.687627,-2.597123,1.810194,0.547325,-0.711881,1.030763,-0.367577,-0.003638,...,-0.089775,-1.148235,-0.108799,0.379367,-0.205944,-0.994288,-0.044309,0.093037,447.39,0
171737,120792.0,2.159317,-1.522302,-1.726698,-1.849936,-0.564546,-0.197293,-0.764753,-0.215694,-1.895442,...,0.155976,0.690835,-0.080271,0.133631,0.142177,0.050146,-0.032696,-0.050897,110.65,0


First column is serial number. Class shows these are the real samples above.

In [19]:
new_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
280143,169347.0,1.378559,1.289381,-5.004247,1.41185,0.442581,-1.326536,-1.41317,0.248525,-1.127396,...,0.370612,0.028234,-0.14564,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.2137,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.65225,...,0.751826,0.834108,0.190944,0.03207,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.39973,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.2537,245.0,1
281674,170348.0,1.991976,0.158476,-2.583441,0.40867,1.151147,-0.096695,0.22305,-0.068384,0.577829,...,-0.16435,-0.295135,-0.072173,-0.450261,0.313267,-0.289617,0.002988,-0.015309,42.53,1


Class shows these are fraud samples above.

In [20]:
new_data['Class'].value_counts()

0    492
1    492
Name: Class, dtype: int64

Now this is a uniformly distributed dataset.

In [21]:
new_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,96959.987805,0.114729,0.041442,-0.143357,0.050625,-0.048981,-0.005017,-0.017473,-0.041142,0.014272,...,0.045387,-0.011835,-0.063985,-0.021939,-0.021603,-0.036113,-0.01837,0.016744,-0.005722,103.191179
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Now you can see the means again with this new data set with uniformly balanced data. The nature of the dataset has not changed.

In [22]:
A = new_data.drop(columns = 'Class', axis = 1)
B = new_data['Class']

In [23]:
print(A)

            Time        V1        V2        V3        V4        V5        V6  \
205648  135850.0  2.076793 -0.840551 -0.597995 -0.432304 -0.578205  0.385664   
200486  133453.0 -1.224163 -0.399112 -1.766858  1.288104 -7.446837  4.472428   
30586    35993.0 -0.653798  0.773170  1.156996  0.339540  0.045392 -1.027201   
179607  124156.0  1.152773 -0.687627 -2.597123  1.810194  0.547325 -0.711881   
171737  120792.0  2.159317 -1.522302 -1.726698 -1.849936 -0.564546 -0.197293   
...          ...       ...       ...       ...       ...       ...       ...   
279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   

              V7        V8        V9  .

In [24]:
print(B)

205648    0
200486    0
30586     0
179607    0
171737    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64


Training/Testing Data

In [26]:
A_train, A_test, B_train, B_test = train_test_split(A, B, test_size = 0.2, stratify = B, random_state = 2)
# 0.2 == 20% of data.
# random_state is how I want to split the data

In [28]:
print(A.shape, A_train.shape, A_test.shape)

(984, 30) (787, 30) (197, 30)


787 values go into A_train and 197 values go into A_test.

In [35]:
model = LogisticRegression()

In [36]:
model.fit(A_train, B_train)
#A_train is features of training data
#B_train is corresponding labels (0 or 1)

In [38]:
A_train_prediction = model.predict(A_train)
training_data_accuracy = accuracy_score(A_train_prediction, B_train)

In [39]:
print('Accuracy On Training Data: ', training_data_accuracy)

Accuracy On Training Data:  0.9212198221092758


Out of 100 predictions, this model can predict 92%. This is good.

In [40]:
A_test_prediction = model.predict(A_test)
test_data_accuracy = accuracy_score(A_test_prediction, B_test)

In [42]:
print('Accuracy on Training Data: ', test_data_accuracy)

Accuracy on Training Data:  0.8984771573604061


If accuracy score on training data is very different from the accuracy score on testing data, model is overfitted or underfitted. These accuracy scores are similar!