# Financial Dataset EDA and Fraud Detection

This Jupyter Notebook comprises of EDA of a Dataset with an implementation of  neural network model and a Decision Tree Classifier for fraud prediction prediction. The dataset consist of transaction information with more than 6 million entries. The dataset which is being utilized synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import warnings
warnings.filterwarnings('ignore')

In [58]:
#Loading Dataset
data = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
data.head()

**EDA**

In [59]:
data['nameOrig'].value_counts()

In [60]:
data['nameDest'].value_counts()

In [61]:
#Dimensions of the Dataset
data.shape

In [62]:
#Columns in the Dataset
data.columns

In [63]:
#Dataset info
data.info()

In [64]:
data['type'].value_counts()

In [65]:
#Analyzing no of Frauds and no Frauds
NoFraud = len(data[data['isFraud'] == 0])
Fraud = len(data[data['isFraud'] == 1])
print("Percentage of No Fraud: {:.2f}%".format((NoFraud / (len(data['isFraud']))*100)))
print("Percentage of Fraud: {:.2f}%".format((Fraud / (len(data['isFraud']))*100)))

In [66]:
#Checking if the dataset has any null/missing values or not
data.isnull().sum()

So, throughout the dataset there are no null values present.

**Correlation Heatmap**

In [67]:
plt.figure(figsize = (10, 12))
sns.heatmap(data.corr(), annot = True)

In [68]:
data['type'].value_counts()

In [69]:
plt.figure(figsize = (10, 12))
sns.countplot(data['type'])

**No of Transactions of each type**

In [70]:
sns.countplot(data[data['isFraud'] == 1]['type'])

Therefore. it is evident that whenever a fraud takes place, the transaction type are 'Transfer' and 'Cash Out'

In [71]:
data['type'].value_counts()

In [72]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Object = data[['type', 'nameOrig', 'nameDest']]
for i in Object:
    data[i] = le.fit_transform(data[i])
data

In [73]:
#Data info after update
data.info()

In [74]:
#Amount Distribution when Fraud Takes Place
sns.distplot(data[data['isFraud'] == 1]['amount'])

In [75]:
data[data['isFraud'] == 1]['nameDest'].value_counts()

In [76]:
numerical = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

**OUTLIER DETECTION**

In [77]:
from collections import Counter

def detect_outliers(data,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(data[c],25)
        # 3rd quartile
        Q3 = np.percentile(data[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_columns = data[(data[c] < Q1 - outlier_step) | (data[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_columns)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [78]:
data.loc[detect_outliers(data,numerical)]

In [79]:
#Removing outliers from the dataset
data = data.drop(detect_outliers(data,numerical),axis = 0).reset_index(drop = True)

**Scatterplot Heatmap**

In [80]:
fig = plt.figure(figsize=(12,12))
corr_mat = data.corr().stack().reset_index(name="correlation")
g = sns.relplot(
    data=corr_mat,
    x="level_0", y="level_1", hue="correlation", size="correlation",
    palette="YlGnBu", hue_norm=(-1, 1), edgecolor=".7",
    height=10, sizes=(50, 250), size_norm=(-.2, .8),
)
g.fig.suptitle('Scatterplot heatmap',fontsize=22, fontweight='bold', fontfamily='serif', color="#000000")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)
for artist in g.legend.legendHandles:
    artist.set_edgecolor(".7")
plt.show()

In [81]:
X = data.drop(['isFraud', 'isFlaggedFraud', 'nameOrig', 'nameDest'], axis  = 1)
y= data['isFraud']

In [82]:
numerical = [feature for feature in X.columns if X[feature].dtype == 'int64' or X[feature].dtype == 'float64']
numerical

**Scaling the dataset**

In [83]:
scaler = RobustScaler()
X[numerical] = scaler.fit_transform(X[numerical])

**First Five rows of the scaled dataset.**

In [84]:
X.head()

In [85]:
#Spliting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
print("The shape of X_train is      ", X_train.shape)
print("The shape of X_test is       ",X_test.shape)
print("The shape of y_train is      ",y_train.shape)
print("The shape of y_test is       ",y_test.shape)

**Building Neural Network**

In [86]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten

#Building Neural Network
model = Sequential()

# 1st layer: input_dim=6, 12 nodes, RELU
model.add(Dense(12, input_dim=7, kernel_initializer='random_uniform', activation='relu'))

model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(512, kernel_initializer='random_uniform', activation='relu'))
model.add(Dropout(rate=0.1))

model.add(Flatten())
model.add(Dense(512, kernel_initializer='random_uniform', activation='relu'))
model.add(Dropout(rate=0.1))



model.add(Flatten())
model.add(Dense(512, kernel_initializer='random_uniform', activation='relu'))
model.add(Dropout(rate=0.1))


# output layer: dim=1, activation sigmoid
model.add(Dense(1, kernel_initializer='random_uniform', activation='sigmoid' ))

# Compile the model
model.compile(loss='binary_crossentropy',   
             optimizer='adam',
             metrics=['accuracy'])
model.summary()

In [87]:
model.fit(X_train, y_train, batch_size=100, epochs=2)

In [88]:
Y_pred = model.predict(X_test)

**Implementing Decision Tree Classifier**

In [89]:
from sklearn.tree import DecisionTreeClassifier

In [90]:
DecisionTree = DecisionTreeClassifier()
fit = DecisionTree.fit(X_train, y_train)
prediction = DecisionTree.predict(X_test)


In [91]:
from sklearn.metrics import confusion_matrix, roc_curve, auc, accuracy_score, classification_report
CM = confusion_matrix(y_test,prediction)
CR = classification_report(y_test,prediction)
fpr, recall, thresholds = roc_curve(y_test, prediction)
AUC = auc(fpr, recall)

In [92]:
sns.heatmap(CM, annot = True)

In [93]:
print('Classification Report:')
print(CR)

In [94]:
print("Area Under Curve:")
print(AUC)

In [95]:
print("Accuracy Score:",accuracy_score(y_test, prediction))