**Importing the neccessary libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

**Loading the data**

In [None]:
data = pd.read_csv("Fraud.csv", nrows = 100000)

In [None]:
data.head()

**Analysis**

In [None]:
# Checking for null values
data.isnull().values.any()

In [None]:
legit = len(data[data.isFraud == 0])
fraud = len(data[data.isFraud == 1])
legit_percent = (legit / len(data.isFraud)) * 100
fraud_percent = (fraud / len(data.isFraud)) * 100

print(f"Percentage of Legit transactions: {legit_percent} %")
print(f"Percentage of Fraud transactions: {fraud_percent} %")

These results prove that this is a highly unbalanced data as Percentage of Legit transactions = 99.87 % and Percentage of Fraud transactions= 0.13 %. So DECISION TREES and RANDOM FORESTS are the good methods for imbalanced data.

**Label Encoding**

In [None]:
# Checking how many attributes are dtype: object
objList = data.select_dtypes(include = "object").columns

# Label Encoding for the object to numeric conversion
le = LabelEncoder()
for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

**Multicolinearity** Checking the corelation

In [None]:
def vif_cal(data):
    vifact = pd.DataFrame()
    vifact["variables"] = data.columns
    vifact["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]

    return(vif)

vif_cal(data)

Output

How did you select variables to be included in the model? Using the VIF values, we just need to check if there are any two attributes highly correlated to each other and then drop the one which is less correlated to the isFraud Attribute.

As we can see the that oldbalanceOrg, newbalanceOrig, oldbalanceGest and newbalanceDesst have high VIF thus they are highly correlated. So, dropping these attributes

In [None]:
data['Actual_amount'] = data.apply(lambda x: x['oldbalanceDest'] - x['newbalanceDest'],axis=1)

#Dropping columns
data = data.drop(['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest','step','nameOrig','nameDest'],axis=1)
vif_cal(data)

Output

**Selecting the dependent and independent variables**

In [None]:
Y = data["isFraud"]
X = data.drop(["isFraud"], axis= 1)

**Train-Test Split**

In [None]:
(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size= 0.3)

**Model Training**

In [None]:
# Logistic Regression

logistic_regression = LogisticRegression(random_state = 0)
logistic_regression.fit(X_train, Y_train)

Y_pred_lr = logistic_regression.predict(X_test)
class_score = logistic_regression.score(X_test, Y_test) * 100

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred_dt = decision_tree.predict(X_test)
decision_tree_score = decision_tree.score(X_test, Y_test) * 100

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators= 100)
random_forest.fit(X_train, Y_train)

Y_pred_rf = random_forest.predict(X_test)
random_forest_score = random_forest.score(X_test, Y_test) * 100

**Evaluation**

In [None]:
print("Decision Score: ", class_score)
print("Decision Tree Score: ", decision_tree_score)
print("Random Forest Score: ", random_forest_score)

In [None]:
# confusion matrix - LR

confusion_matrix_lr = confusion_matrix(Y_test, Y_pred_lr.round())
print("Confusion Matrix - Logistic Regression")
print(confusion_matrix_lr)

print("----------------------------------------------------------------------------------------")

# confusion matrix - DT

confusion_matrix_dt = confusion_matrix(Y_test, Y_pred_dt.round())
print("Confusion Matrix - Decision Tree")
print(confusion_matrix_dt)

print("----------------------------------------------------------------------------------------")

# confusion matrix - RF

confusion_matrix_rf = confusion_matrix(Y_test, Y_pred_rf.round())
print("Confusion Matrix - Random Forest")
print(confusion_matrix_rf)

Output

.

In [None]:
print("TP,FP,TN,FN - Decision Tree")
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_dt).ravel()
print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

print("----------------------------------------------------------------------------------------")

# key terms of Confusion Matrix - RF

print("TP,FP,TN,FN - Random Forest")
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_rf).ravel()
print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

Output

Here Random Forest looks better

In [None]:
# classification report - DT

classification_report_lr = classification_report(Y_test, Y_pred_lr)
print("Classification Report - Logistic Regression")
print(classification_report_lr)

print("----------------------------------------------------------------------------------------")

# classification report - DT

classification_report_dt = classification_report(Y_test, Y_pred_dt)
print("Classification Report - Decision Tree")
print(classification_report_dt)

print("----------------------------------------------------------------------------------------")

# classification report - RF

classification_report_rf = classification_report(Y_test, Y_pred_rf)
print("Classification Report - Random Forest")
print(classification_report_rf)

Output

**Conclusion**

We can see that Accuracy for both RandomForest and DecisionTree is equal, although the precision of Random Forest is better. In the fraud detection model, Precision is more important rather than predicting normal transactions correctly. We want Fraud Transactions to be predicted correctly and legit to be left off. If either of the 2 reasons are not fulfiiled we may catch the innocent and leave the culprit. This is also one of the reason why Random Forest and Decision Tree are used instead of other algorithms.

What are the key factors that predict fraudulent customer?

1. The source of request is secured or not
2. Transaction history of vendors.


What kind of prevention should be adopted while company update its infrastructure?

1. Use smart vertified apps only.
2. Browse through secured websites.
3. Keep your mobile and laptop security updated.
4. Don't respond to unsolicited calls/SMS(s/E-mails.
5. If you feel like you have been tricked or security compromised, contact your bank immidiately.


Assuming these actions have been implemented, how would you determine if they work?

1. Bank sending E-statements.
2. Customers keeping a check of their account activity.
3. Always keep a log of your payments.