The goal of this project is to detect fraud mobile money transactions. The aim is to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction. The dataset is publicly avialable at https://www.kaggle.com/ntnu-testimon/paysim1

The different columns in the data are as follows:
* step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

* type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

* amount - amount of the transaction in local currency.

* nameOrig - customer who started the transaction

* oldbalanceOrg - initial balance before the transaction

* newbalanceOrig - new balance after the transaction

* nameDest - customer who is the recipient of the transaction

* oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

* newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

* isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

* isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

A **supervised algorithm** is used to learn a function from a labeled input data and produces labels when unalbeled data is given. K-NN is a supervised algorithm which believes that similar data points always lie close to each other or you can remember it as "Birds of the same feather flock together" :(

Classification and regression are examples of supervised learning. Clustering is an example of unsupervised learning. KNN is a classification algorithm-spuervised learning. In knn 'k' can take any value from 1 to n, where n is the number of input datapoints. The labels in knn algorithm is predicted by the label of the data points closest to it. If k=n, we will see for the n closest data points and depending on their values the present label will be predicted.

In [13]:
#Package Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [14]:
#Reading in the dataset
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

Dataframe is used to hold the data in a two dimension tabular structure in rows and columns. it contains three major elements-rows, columns, data. to create a dataframe from a list/dictionary all the arrays must have same length.

Pandas is very useful for analyizing data. pandas head() method is used to return the top 5 rows of a data frame. head(n) is used to return n top rows.

In [15]:
#Viewing the first 5 rows of the dataset

df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


From the dataset there are few columns which are redundant.
* nameOrig: This column is a unique identifier that belongs to each customer. As it is unique it wont be of any use in fraud detection.
* nameDest: This column is also a unique identifier that belongs to each customer.. 
* isFlaggedFraud: This column flags a transaction as fraudulent if a person tries to transfer more than 200,000 in a single transaction. We already have a feature called isFraud that flags a transaction as fraud, this feature becomes redundant. 

In [16]:
#Dropping the redundant features

df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

axis= 1 implies columns and 0 implies rows.

In [17]:
df.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1
4,1,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0


In [18]:
#number of rows or entries in the dataset
df.shape[0]

6362620

The dataset contains about 6.3 million of data and to reduce the computational cost and time complexity we can reduce it to 20,000 rows.

In [19]:
#Storing the fraudulent data into a dataframe
df_fraud = df[df['isFraud'] == 1]

#get the number of rows i.e. number of fraudulent data entries
df_fraud.shape[0]

8213

In [37]:
df_fraud.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1
251,1,TRANSFER,2806.0,2806.0,0.0,0.0,0.0,1
252,1,CASH_OUT,2806.0,2806.0,0.0,26202.0,0.0,1
680,1,TRANSFER,20128.0,20128.0,0.0,0.0,0.0,1


In [20]:
#Storing the non-fraudulent data into a dataframe 
df_nofraud = df[df['isFraud'] == 0]

#getting the number of non-fraudulent data entries
df_nofraud.shape[0]

6354407

In [38]:
df_nofraud.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0
4,1,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0
5,1,PAYMENT,7817.71,53860.0,46042.29,0.0,0.0,0
6,1,PAYMENT,7107.77,183195.0,176087.23,0.0,0.0,0


In [21]:
#Storing 12,000 rows of non-fraudulent data
df_nofraud = df_nofraud.head(12000)

#Joining both datasets together (12000 non-fraudulent and 8000 fraudulent entries)
df = pd.concat([df_fraud, df_nofraud], axis = 0) #contatenate row -wise

One of the main constraints of scikit-learn is that columns that are categorical in nature cannnot be implemented. For example th type column has *five* categories:
* CASH-IN
* CASH-OUT
* DEBIT
* PAYMENT
* TRANSFER

The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4.

*We can do this by using the following code:*

In [22]:
#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category') #Cast a pandas object to a specified data type

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

Broadly there are two types of categorical variables:
* Nominal: Variables which dont have an inherent order like names, ID. For example, a person's name or ID cannot be compared with another person's name or ID.
* Ordinal: Variables which have an inherent order like rank, degree. For example, a person having Masters degree is said to have a higher dergree than a person with a bachelors degree.

In case of ordinal variables, integer encoding is sufficient whereas for nominal variables one hot encoding is required else the model might get confused thinking that the data might have some order or hierarchy.

In [23]:
#One hot encoding the 'type' column
type_one_hot = OneHotEncoder()
type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [24]:
#Adding the one hot encoded variables to the dataset 
ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])
df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable 
df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding 
df.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_0,type_1,type_2,type_3,type_4
0,1.0,9839.64,170136.0,160296.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,1864.28,21249.0,19384.72,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,181.0,181.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,1.0,181.0,181.0,0.0,21182.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,1.0,11668.14,41554.0,29885.86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [25]:
#Checking every column for missing values

df.isnull().any()

step              True
amount            True
oldbalanceOrg     True
newbalanceOrig    True
oldbalanceDest    True
newbalanceDest    True
isFraud           True
type_0            True
type_1            True
type_2            True
type_3            True
type_4            True
dtype: bool

We can see that every column has some missing valaues. As of now we will assign the missing values to be zero.

In [26]:
#Imputing the missing values with a 0

df = df.fillna(0)

To make it easy for us, export this dataset as a .csv file and store it in the same directory that you are working in with the Jupyter Notebook.

In [27]:
df.to_csv('fraud_prediction.csv')

Now we will apply the k-nn algorithm. In the dataset, the target variable is called isFraud and contains two labels: 0 if the transaction is not a fraud and 1 if the transaction is a fraud. The features are the remaining variables. We can store these into two separate variables. .values is used to convert the features and target into numpy arrrays.

In [29]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [31]:
#split the features and target into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, stratify = target)

30% of the original data will be stored in the test set, while 70% of the original data will be used for training. There are two ways in which the train_test_split shuffles data:

**Random Sampling** : Randomly puts target into train and test sets (y_train and y_test)

**Stratified Sampling** : Ensures that the target labels are adequately placed in test and train sets.

In [32]:
#implementing the knn algorithm
from sklearn.neighbors import KNeighborsClassifier

#Initializing the kNN classifier with 3 neighbors 

knn_classifier = KNeighborsClassifier(n_neighbors=3)

#Fitting the classifier on the training data 

knn_classifier.fit(X_train, y_train)

#Extracting the accuracy score from the test sets

knn_classifier.score(X_test, y_test)

0.9830667920978363

By using the .score() method on the test data, we obtain a value between 0 and 1 that indicates how accurate the classifier is. 

We obtained an accuracy score of 0.983

**Fine tuning the parameters**

To avoid overfitting and underfitting issues the number of nearest neighbours can be fine-tuned. To do this GridSearchCV algorithm is used.

**GridSearchCV** creates an empty grid and fills it with possible values of the number of neighbors or any other machine learning parameter that we want to optimize. It then uses each value in the grid and tests its performance and determines the  optimal value of the parameter. We can implement the GridSearchCV algorithm to find the optimal number of neighbors.

**Cross validation** is a technique in which the classifier first divides the data into 10 parts. The first nine parts are used as the training set while the 10th part is used as the test set. In the second iteration, we use the first eight parts and the 10th part as the training set, while the ninth part is used as the test set. This process is repeated until every part of the data is used for testing. This creates a very robust classifier, since we have used the entire dataset for training and testing and have not left out any part of the data. 

In [33]:
import numpy as np
from sklearn.model_selection import GridSearchCV

#Initializing a grid with possible number of neighbors from 1 to 24

grid = {'n_neighbors' : np.arange(1, 25)}

#Initializing a k-NN classifier 

knn_classifier = KNeighborsClassifier()

#Using cross validation to find optimal number of neighbors 

knn = GridSearchCV(knn_classifier, grid, cv = 10) #10-fold cross validation

knn.fit(X_train, y_train)

#Extracting the accuracy score for optimal number of neighbors

knn.best_score_


0.9850813971070006

The range of neighbours is kept as (1,25). Increasing this range will increase time complexity. The best accuracy is obtained for 1 neighbour which is 0.985. The .best_params_ to extract the optimal number of neighbors and knn.best_score_ is used to obtain the best accuracy obtained.

In [35]:
#Extracting the optimal number of neighbors
knn.best_params_

{'n_neighbors': 1}

When a new data point is given in the knn algorithm, it looks for the distance to check the points taht are closer to it. For example, if feature one has a range between 500-800 and feature 2 has range from 0-10, the distance metric doesnt make sense anymore. All the features should have the same range of values so that the distance metric is uniform across all features. One way to do this is to subtract each value of each feature by the mean of that feature and divide by the variance of that feature. This is called  standardization. 
Standardization = (Row Value - Mean across all values of features)/Variance across all values of features

In [36]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#Setting up the scaling pipeline 

pipeline_order = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 1))]

pipeline = Pipeline(pipeline_order)

#Fitting the classfier to the scaled dataset 

knn_classifier_scaled = pipeline.fit(X_train, y_train)

#Extracting the score 

knn_classifier_scaled.score(X_test, y_test)

0.9960018814675446

By scaling we can see that the performance improved from .985 to .996 :)