# Building a Predictive Model using Logistic Regression to Detect Fraudulent Transactions. 

In [1]:
import pandas as pd
import numpy as np

## Importing Data downloaded from Kaggle - source(https://www.kaggle.com/ealaxi/paysim1)

In [5]:
transactions = pd.read_csv(r"C:\Users\SANATH\OneDrive\Desktop\Datasets\PS_20174392719_1491204439457_log.csv")

In [17]:
print(transactions.head(10))

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815      170136.00       160296.36   
1     1   PAYMENT   1864.28  C1666544295       21249.00        19384.72   
2     1  TRANSFER    181.00  C1305486145         181.00            0.00   
3     1  CASH_OUT    181.00   C840083671         181.00            0.00   
4     1   PAYMENT  11668.14  C2048537720       41554.00        29885.86   
5     1   PAYMENT   7817.71    C90045638       53860.00        46042.29   
6     1   PAYMENT   7107.77   C154988899      183195.00       176087.23   
7     1   PAYMENT   7861.64  C1912850431      176087.23       168225.59   
8     1   PAYMENT   4024.36  C1265012928        2671.00            0.00   
9     1     DEBIT   5337.77   C712410124       41720.00        36382.23   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0            0.00        0               0  
1  M2044282225            

In [7]:
print(transactions.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB
None


## Checking number of fraudulent transactions

In [11]:
print(transactions['isFraud'].count())

6362620


In [13]:
print(transactions['isFraud'].sum())

8213


So, out of 6362620 we have 8213 fraudulent transactions.

In [10]:
print(transactions['amount'].describe())

count    6.362620e+06
mean     1.798619e+05
std      6.038582e+05
min      0.000000e+00
25%      1.338957e+04
50%      7.487194e+04
75%      2.087215e+05
max      9.244552e+07
Name: amount, dtype: float64


## Feature Engineering 

Online fraudulent transactions can happen only in certain mode. Let's first check the type of transactions we have in dataset.

In [16]:
print(transactions['type'].unique())

['PAYMENT' 'TRANSFER' 'CASH_OUT' 'DEBIT' 'CASH_IN']


Let's create two features(columns) based on mode of transactions. so, we can eliminate 'CASH_IN' from the analysis.

In [19]:
transactions['isPayment']=0
transactions['isPayment'][transactions['type'].isin(['PAYMENT', 'TRANSFER'])]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['isPayment'][transactions['type'].isin(['PAYMENT', 'TRANSFER'])]=1


In [20]:
transactions['isMovement']=0
transactions['isMovement'][transactions['type'].isin(['DEBIT', 'CASH_OUT'])]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['isMovement'][transactions['type'].isin(['DEBIT', 'CASH_OUT'])]=1


In [21]:
print(transactions.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  \
0  M1979787155             0.0             0.0        0               0   
1  M2044282225             0.0             0.0        0               0   
2   C553264065             0.0             0.0        1               0   
3    C38997010         21182.0             0.0        1               0   
4  M1230701703             0.0             0.0        0               0   

   isPayment  isMovement  
0          1           0  
1          1           0  
2          1     

As most of the fraudulent transaction happens into new or unused bank accounts. Let's check the bank balance difference between the acount where money was debited from, and transfered to.

In [22]:
transactions['accountDiff'] = abs(transactions['oldbalanceOrg']-transactions['oldbalanceDest'])

## Features and Label for the Model 

In [23]:
label = transactions['isFraud']
features = transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']]

## Splitting the dataset 

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.3)

## Normalizing 

As we are going to employ Logistic Regression model from sklearn, inside which the data regularization is done by default. we need to Normalize the data before using it to train our model. 

We are going to use standradization scaling technique, as Logistic Regression algorithm is sensitive to the scales of features and Standardization doesn't affect the scale of the algorith.

In [25]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Modelling with Logistic Regression. 

In [28]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

## Evaluating the model 

In [29]:
print(model.score(X_train, y_train))

0.9986865249131422


99% score..!. Let's check the score our prediction with test data

In [30]:
print(model.score(X_test, y_test))

0.9986588334155846


In [31]:
print(model.coef_)

[[ 0.24729259  2.88016895  2.9843517  -0.65425114]]


## Lets use the model to predict a new transaction is fraudulent or not

In [33]:
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
transaction4 = np.array([23456.56, 0.0, 0.0, 11111.01])

In [34]:
combine_transactions = np.stack((transaction1, transaction2, transaction3, transaction4))

## Now Normalize this data 

In [35]:
normalized = scaler.transform(combine_transactions)



## Result Time.. 

In [37]:
print(model.predict(normalized))

[0 0 0 0]


Okay. No transaction is assumed fradulent for now..!

Hence. we created a predictive model using Logistic Regression to detect fraudulent transactions.