# Credit Card Fraud Detection



Credit card companies need to spot fraudulent transactions to ensure customers aren't billed for items they didn't buy. This dataset includes European cardholders' transactions from September 2013, with 492 frauds out of 284,807 total transactions, making it highly unbalanced.The dataset includes only numerical input variables created through PCA (Principal Component Analysis), which simplifies complex data. Because of privacy reasons, we can't share the original features and extra details about the data. The data features are numerical, derived from a PCA transformation, with 'Time' and 'Amount' as the original features. The 'Class' feature indicates fraud (1) or no fraud (0).

# Principal Component Analysis 


Principal Component Analysis (PCA) is a method used to simplify complex data. Imagine you have a big, messy set of information with lots of details. PCA helps by finding the most important parts of that information and summarizing it into a smaller, easier-to-understand set of data without losing much of the important stuff.

# Dataset Features

Features:

V1, V2, … V28: These are the new variables created by PCA.

Time: This is the number of seconds between each transaction and the first transaction in the dataset.

Amount: This is the amount of money for each transaction, useful for cost-sensitive learning (considering the cost of fraud detection).

Class: This is the target variable, showing 1 for fraud and 0 for non-fraud transactions.

# Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score

# Data Collection

In [2]:
creditcard_dataset=pd.read_csv("creditcard.csv")
creditcard_dataset

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [3]:

creditcard_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
creditcard_dataset.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [5]:
# distribution of legit transactions & fraudulent transactions
creditcard_dataset['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64


This is an unbalanced dataset because the non-fraud transaction class has a large number of entries, while the fraud class has very few entries.

# Unbalanced datasets

Unbalanced datasets have a lot more examples of one thing compared to another. For example, there are many more non-fraud transactions than fraud transactions.


0-Normal Transaction

1-Fraudulent Transaction

In [6]:
# separating the data for analysis
legit=creditcard_dataset[creditcard_dataset.Class==0]
fraud=creditcard_dataset[creditcard_dataset.Class==1]

In [7]:
print(legit.shape)
print(fraud.shape)


(284315, 31)
(492, 31)


In [8]:
# statistical measure of datasets
legit.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [9]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64


Based on the statistical measures of the dataset, the mean of legitimate transactions is 88.291022. After applying the same measures to fraudulent transactions, the mean is 122.211321. This shows that the mean of fraudulent transactions amount is higher compared to legitimate transactions amount

In [10]:
# compare the values for both transactions\
creditcard_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


# Under-sampling method is used to handle unbalanced datasets

Undersampling: Reduce the number of examples in the larger group to match the smaller group.

Oversampling: Increase the number of examples in the smaller group by duplicating or creating new examples.

In [11]:
legit_sample=legit.sample(n=492)

# Concatenating two dataframes

In [12]:
new_datasets=pd.concat([legit_sample,fraud],axis=0)

In [13]:
new_datasets

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
103331,68600.0,1.261190,-0.034248,-0.248462,0.024486,-0.203934,-1.013095,0.271710,-0.279106,0.103549,...,-0.095604,-0.373524,-0.124154,-0.056400,0.461366,1.037250,-0.113516,-0.002604,48.76,0
69286,53351.0,0.058954,-1.945350,-0.839532,0.550254,-1.006639,-1.031586,1.050187,-0.329990,-0.237256,...,0.375478,-0.430924,-0.602363,0.603425,0.206530,0.979750,-0.246060,0.087741,630.95,0
22726,32404.0,1.305646,-0.798684,0.936919,-0.661343,-1.401596,-0.271885,-1.134607,0.112047,-0.293985,...,0.408681,1.105499,-0.075355,0.110938,0.325264,-0.018992,0.046823,0.023260,24.99,0
15203,26556.0,-0.426772,1.014181,1.280786,-0.101670,0.250977,-0.193988,0.512672,0.227729,-0.618047,...,-0.187817,-0.484541,0.004503,-0.040602,-0.272283,0.085183,0.256661,0.084085,2.99,0
190859,129011.0,1.795251,-0.337696,-0.412867,1.178650,-0.264675,0.235488,-0.446233,0.199219,0.623006,...,-0.130409,-0.476011,0.373588,0.616993,-0.447666,-1.036243,0.034503,-0.018770,65.75,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00,1
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00,1


In [14]:
new_datasets['Class'].value_counts()

0    492
1    492
Name: Class, dtype: int64

After applying under-sampling method we balance the datasets

In [15]:
new_datasets.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,97527.193089,-0.152262,-0.077768,-0.035806,0.001265,0.057347,-0.083225,0.015826,-0.036859,0.053826,...,-0.058627,-0.0349,0.022001,-0.020008,0.020968,0.021798,-0.051268,-0.015923,0.002954,85.986748
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In this case, it is used to check whether the sample is good or bad. As we find, the nature of the dataset has not changed.








# Splitting the Features and Target

In [16]:
X=new_datasets.drop(columns='Class',axis=1)
Y=new_datasets['Class']

In [17]:
X

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
103331,68600.0,1.261190,-0.034248,-0.248462,0.024486,-0.203934,-1.013095,0.271710,-0.279106,0.103549,...,-0.008605,-0.095604,-0.373524,-0.124154,-0.056400,0.461366,1.037250,-0.113516,-0.002604,48.76
69286,53351.0,0.058954,-1.945350,-0.839532,0.550254,-1.006639,-1.031586,1.050187,-0.329990,-0.237256,...,1.088399,0.375478,-0.430924,-0.602363,0.603425,0.206530,0.979750,-0.246060,0.087741,630.95
22726,32404.0,1.305646,-0.798684,0.936919,-0.661343,-1.401596,-0.271885,-1.134607,0.112047,-0.293985,...,0.039374,0.408681,1.105499,-0.075355,0.110938,0.325264,-0.018992,0.046823,0.023260,24.99
15203,26556.0,-0.426772,1.014181,1.280786,-0.101670,0.250977,-0.193988,0.512672,0.227729,-0.618047,...,0.075609,-0.187817,-0.484541,0.004503,-0.040602,-0.272283,0.085183,0.256661,0.084085,2.99
190859,129011.0,1.795251,-0.337696,-0.412867,1.178650,-0.264675,0.235488,-0.446233,0.199219,0.623006,...,-0.124711,-0.130409,-0.476011,0.373588,0.616993,-0.447666,-1.036243,0.034503,-0.018770,65.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,1.252967,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.226138,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.247968,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.306271,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00


In [18]:
Y

103331    0
69286     0
22726     0
15203     0
190859    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64

# Splitting Training and Testing Data

In [19]:
 X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=32)

In [20]:
print(X.shape,X_train.shape,X_test.shape)

(984, 30) (787, 30) (197, 30)


In [33]:
X_train

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
102445,68207.0,-13.192671,12.785971,-9.906650,3.320337,-4.801176,5.760059,-18.750889,-37.353443,-0.391540,...,-3.493050,27.202839,-8.887017,5.303607,-0.639435,0.263203,-0.108877,1.269566,0.939407,1.00
269558,163684.0,-3.333827,-4.031111,0.581391,2.247840,2.372068,-2.421126,-2.767098,0.853465,0.155642,...,1.374202,0.541667,-0.036976,0.181361,0.080679,-1.508438,-0.686318,0.317141,-0.524581,1.00
42549,41147.0,-5.314173,4.145944,-8.532522,8.344392,-5.718008,-3.043536,-10.989185,3.404129,-6.167234,...,1.150017,2.331466,0.862996,-0.614453,0.523648,-0.712593,0.324638,2.245091,0.497321,88.23
86899,61445.0,-2.425205,-0.676900,2.694565,2.407243,1.718996,-0.139595,-0.388501,0.400843,-1.254244,...,0.028984,0.300560,0.502795,0.030031,0.011529,0.284794,-0.219990,-0.190089,0.426937,63.01
198868,132688.0,0.432554,1.861373,-4.310353,2.448080,4.574094,-2.979912,-2.792379,-2.719867,-0.276704,...,0.318853,-1.384477,-0.348904,-3.979948,-0.828156,-2.419446,-0.767070,0.387039,0.319402,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107067,70270.0,-1.512516,1.133139,-1.601052,2.813401,-2.664503,-0.310371,-1.520895,0.852996,-1.496495,...,1.249586,0.729828,0.485286,0.567005,0.323586,0.040871,0.825814,0.414482,0.267265,318.11
43773,41646.0,-3.240187,2.978122,-4.162314,3.869124,-3.645256,-0.126271,-4.744730,-0.065331,-2.168366,...,-0.224043,2.601441,0.231910,-0.036490,0.042640,-0.438330,-0.125821,0.421300,0.003146,172.32
10484,17187.0,1.088375,0.898474,0.394684,3.170258,0.175739,-0.221981,-0.022989,-0.010874,0.860044,...,-0.217358,-0.423554,-0.800852,0.077614,0.167608,0.350182,-0.118941,0.012948,0.054254,3.79
150668,93860.0,-10.632375,7.251936,-17.681072,8.204144,-10.166591,-4.510344,-12.981606,6.783589,-4.659330,...,-0.810146,2.715357,0.695603,-1.138122,0.459442,0.386337,0.522438,-1.416604,-0.488307,188.52


In [46]:
single_row = X_train.iloc[0]
single_row_array = single_row.to_numpy()
single_row_array

array([ 6.82070000e+04, -1.31926710e+01,  1.27859706e+01, -9.90665002e+00,
        3.32033688e+00, -4.80117593e+00,  5.76005856e+00, -1.87508892e+01,
       -3.73534426e+01, -3.91539744e-01, -5.05250237e+00,  4.40680552e+00,
       -4.61075648e+00, -1.90948797e+00, -9.07271093e+00, -2.26074451e-01,
       -6.21155748e+00, -6.24814535e+00, -3.14924669e+00,  5.15761185e-02,
       -3.49304992e+00,  2.72028392e+01, -8.88701714e+00,  5.30360690e+00,
       -6.39434802e-01,  2.63203123e-01, -1.08876930e-01,  1.26956636e+00,
        9.39407363e-01,  1.00000000e+00])

In [34]:
y_train

102445    1
269558    0
42549     1
86899     0
198868    1
         ..
107067    1
43773     1
10484     1
150668    1
74496     1
Name: Class, Length: 787, dtype: int64

In [36]:
X_test

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
57248,47826.0,-0.887287,1.390002,1.219686,1.661425,1.009228,-0.733908,0.855829,0.000077,-1.275631,...,-0.268347,-0.083734,-0.346930,-0.050619,0.231044,-0.450760,-0.376205,0.034504,0.157775,7.58
57615,47982.0,-1.232804,2.244119,-1.703826,1.492536,-1.192985,-1.686110,-1.864612,0.856122,-1.973535,...,0.207889,0.560475,0.165682,-0.013754,0.474935,-0.218725,0.302809,0.466031,0.250134,0.76
147246,88286.0,1.921039,-0.292250,-0.463353,0.418612,-0.523264,0.021848,-0.961882,0.256224,1.031747,...,-0.098232,0.025630,0.166943,0.318331,0.675312,-0.588244,0.316303,0.003792,-0.009883,10.46
255556,157284.0,-0.242245,4.147186,-5.672349,6.493741,1.591168,-1.602523,-0.950463,0.722903,-4.128505,...,0.562030,0.249023,-0.480286,-0.286080,-1.153575,-0.035571,0.559628,0.409446,0.221048,0.77
124028,77151.0,1.250981,0.325102,0.062237,0.512883,0.118907,-0.284001,0.029608,-0.057516,-0.005526,...,-0.064592,-0.289817,-0.776186,0.086587,-0.491664,0.230858,0.155864,-0.004260,0.023001,1.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188956,128208.0,1.925830,-0.383597,-0.146061,0.428985,-0.727647,-0.327027,-0.632131,0.020410,1.055213,...,-0.151015,0.263142,1.028673,0.106533,0.129616,-0.106560,-0.241739,0.044460,-0.043369,9.99
235293,148332.0,1.960400,-0.492671,-0.334183,0.287344,-0.555782,0.039386,-0.747754,0.114090,0.970122,...,-0.145289,0.198656,0.787262,0.103266,-0.396408,-0.260445,0.577306,-0.017914,-0.060251,11.50
276981,167392.0,-1.303063,0.135984,0.905445,-1.827999,-1.885599,-0.362997,-0.938751,0.673399,-2.491332,...,-0.458309,-0.037223,0.055041,-0.254089,-0.009866,0.368597,0.111560,-0.368289,-0.136884,57.00
97245,66127.0,0.864944,0.175779,0.489782,2.540622,-0.035307,-0.040585,0.365706,-0.074742,-0.928481,...,0.165533,-0.072096,-0.536730,-0.013356,0.068699,0.237758,-0.187225,-0.018365,0.049499,150.72


In [37]:
y_test

57248     1
57615     1
147246    0
255556    1
124028    0
         ..
188956    0
235293    0
276981    0
97245     0
27627     1
Name: Class, Length: 197, dtype: int64

# Model Training

In [35]:
models = {
   'lr':LogisticRegression(),
   'rfc':RandomForestClassifier(n_estimators=100),
   'dtc': DecisionTreeClassifier()
}

for name, mod in models.items():
    mod.fit(X_train,y_train)
    y_pred=mod.predict(X_test)
    print(f"{name} accuracy score: {accuracy_score(y_test,y_pred)} Precisionscore {precision_score(y_test,y_pred)}  recallscore {precision_score(y_test,y_pred)} F1score {f1_score(y_test,y_pred)}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lr accuracy score: 0.9390862944162437 Precisionscore 0.9484536082474226  recallscore 0.9484536082474226 F1score 0.9387755102040816
rfc accuracy score: 0.9441624365482234 Precisionscore 0.9888888888888889  recallscore 0.9888888888888889 F1score 0.9417989417989419
dtc accuracy score: 0.9086294416243654 Precisionscore 0.8857142857142857  recallscore 0.8857142857142857 F1score 0.9117647058823529



Based on the three models—Logistic Regression, Random Forest Classifier, Decision Tree Classifier, and K-Nearest Neighbors—we used evaluation metrics and found that the Random Forest Classifier has the best accuracy score, precision score, recall score, and F1 score.

# Select Model=RandomForestClassifier

In [38]:
rfc= RandomForestClassifier()
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)

In [39]:
print(y_pred)

[0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 1
 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0
 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1
 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0
 0 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 1 0 1 1 0
 0 1 0 0 0 0 1 0 0 0 0 1]


In [40]:

# Check if they are equal
are_equal = y_pred == y_test

print("Are y_pred and y_test equal?", are_equal)


Are y_pred and y_test equal? 57248     False
57615      True
147246     True
255556     True
124028     True
          ...  
188956     True
235293     True
276981     True
97245      True
27627      True
Name: Class, Length: 197, dtype: bool


# Making Predictive System

In [47]:
input_data = (6.82070000e+04, -1.31926710e+01,  1.27859706e+01, -9.90665002e+00,
        3.32033688e+00, -4.80117593e+00,  5.76005856e+00, -1.87508892e+01,
       -3.73534426e+01, -3.91539744e-01, -5.05250237e+00,  4.40680552e+00,
       -4.61075648e+00, -1.90948797e+00, -9.07271093e+00, -2.26074451e-01,
       -6.21155748e+00, -6.24814535e+00, -3.14924669e+00,  5.15761185e-02,
       -3.49304992e+00,  2.72028392e+01, -8.88701714e+00,  5.30360690e+00,
       -6.39434802e-01,  2.63203123e-01, -1.08876930e-01,  1.26956636e+00,
        9.39407363e-01,  1.00000000e+00)
       
#changing the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)
# reshape the data as we are predicting the label for only one instance
input_data_reshaped =input_data_as_numpy_array.reshape(1,-1)
prediction =rfc.predict(input_data_reshaped)
print(prediction) 

[1]




In [49]:
print(y_train.iloc[0])

1


Our predictive system working well.Both the result are equal

# Conclusion

In my project, I developed three different machine learning models: Logistic Regression, Random Forest classifier, and Decision Tree classifier. To evaluate the performance of these models, I used four key metrics: accuracy_score,Precision_score,recall_score,F1_score

Model Evaluation Results:

Logistic Regression:

Accuracy_Score: 0.9390862944162437

Precision_Score:0.9484536082474226 

Recall_Score:0.9484536082474226

F1_score:  0.9387755102040816
    
Random Forest Classifier:

Accuracy_Score:  0.9441624365482234

Precision_Score:0.9888888888888889

Recall_Score:0.9888888888888889

F1_score: 0.9417989417989419
    
Decision Tree Classifier:

Accuracy_Score: 0.9086294416243654

Precision_Score:0.8857142857142857 

Recall_Score:0.8857142857142857 

F1_score: 0.9117647058823529
    
Conclusion:

Based on the evaluation metrics, Random Forest Classifier are the best performing models for this project. The accuracy score ( 0.94), precision score (0.9888), recall score (0.988), and F1 score ( 0.94) .These indicates model give the accurate prediction