# Predicting categories with K-Nearest Neighbors

**Aim**: The aim of this notebook is to predict if a mobile transaction is fraudulent or not by using the K-NN algorithm with scikit-learn.

## Table of contents

1. Data preparation
2. Implementing the k-NN algorithm
3. Fine-tuning parameters using GridsearchCV
4. Scaling

## Package Requirements

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

## Data preparation

In [3]:
#Reading in the dataset

#df = pd.read_csv('PS_20174392719_1491204439457_log.csv')
#df = pd.read_csv('transactions_train.csv')
df = pd.read_csv('fraud_prediction.csv')

In [4]:
#Viewing the data

df.head()

Unnamed: 0.1,Unnamed: 0,step,amount,oldbalanceOrig,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_0,type_1,type_2,type_3,type_4
0,2,1.0,181.0,181.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,3,1.0,181.0,181.0,0.0,21182.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,251,1.0,2806.0,2806.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
3,252,1.0,2806.0,2806.0,0.0,26202.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,680,1.0,20128.0,20128.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


**Dropping the redundant features**

In [5]:
#Dropping the redundant features

#df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)
df = df.drop(['nameOrig', 'nameDest'], axis = 1)

In [5]:
#Inspecting the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27353 entries, 0 to 27352
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      27353 non-null  int64  
 1   step            27353 non-null  float64
 2   amount          27353 non-null  float64
 3   oldbalanceOrig  27353 non-null  float64
 4   newbalanceOrig  27353 non-null  float64
 5   oldbalanceDest  27353 non-null  float64
 6   newbalanceDest  27353 non-null  float64
 7   isFraud         27353 non-null  float64
 8   type_0          27353 non-null  float64
 9   type_1          27353 non-null  float64
 10  type_2          27353 non-null  float64
 11  type_3          27353 non-null  float64
 12  type_4          27353 non-null  float64
dtypes: float64(12), int64(1)
memory usage: 2.7 MB


**Reducing the size of the data**

In [7]:
#Storing the fraudulent data into a dataframe

df_fraud = df[df['isFraud'] == 1]

In [8]:
#Storing the non-fraudulent data into a dataframe 

df_nofraud = df[df['isFraud'] == 0]

In [9]:
#Storing 12,000 rows of non-fraudulent data

df_nofraud = df_nofraud.head(12000)

In [10]:
#Joining both datasets together 

df = pd.concat([df_fraud, df_nofraud], axis = 0)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19717 entries, 2 to 12071
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            19717 non-null  int64  
 1   type            19717 non-null  object 
 2   amount          19717 non-null  float64
 3   oldbalanceOrig  19717 non-null  float64
 4   newbalanceOrig  19717 non-null  float64
 5   oldbalanceDest  19717 non-null  float64
 6   newbalanceDest  19717 non-null  float64
 7   isFraud         19717 non-null  int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 1.4+ MB


**Encoding the categorical feature**

In [12]:
#Converting the type column to categorical

df['type'] = df['type'].astype('category')

In [13]:
#Integer Encoding the 'type' column

type_encode = LabelEncoder()

In [14]:
#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

In [15]:
df['type'].value_counts()

3    6732
1    5446
4    4983
0    2186
2     370
Name: type, dtype: int64

In [16]:
#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()
type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

In [17]:
#Adding the one hot encoded variables to the dataset 

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])
df = pd.concat([df, ohe_variable], axis=1)

In [18]:
#Dropping the original type variable 

df = df.drop('type', axis = 1)

**Checking for missing values**

In [6]:
#Checking every column for missing values

df.isnull().any()

Unnamed: 0        False
step              False
amount            False
oldbalanceOrig    False
newbalanceOrig    False
oldbalanceDest    False
newbalanceDest    False
isFraud           False
type_0            False
type_1            False
type_2            False
type_3            False
type_4            False
dtype: bool

In [20]:
#Imputing the missing values with a 0

df = df.fillna(0)

In [22]:
#Checking if there are missing values left

df.isnull().any()

step              False
amount            False
oldbalanceOrig    False
newbalanceOrig    False
oldbalanceDest    False
newbalanceDest    False
isFraud           False
type_0            False
type_1            False
type_2            False
type_3            False
type_4            False
dtype: bool

**Exporting the dataset**

In [35]:
df.to_csv('fraud_prediction.csv')

## Implementing the k-NN Algorithm

In [7]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

**Splitting the data into training and test sets**

In [8]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, stratify = target)

**Building the knn classifier**

In [9]:
knn_classifier = KNeighborsClassifier(n_neighbors=3)

In [10]:
knn_classifier.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [11]:
knn_classifier.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.9947599317572507

## Fine Tuning Parameters using GridSearchCV

In [12]:
#Initializing a grid with possible number of neighbors from 1 to 24

grid = {'n_neighbors' : np.arange(1, 25)}

#Initializing a k-NN classifier 

knn_classifier = KNeighborsClassifier()

#Using cross validation to find optimal number of neighbors 

knn = GridSearchCV(knn_classifier, grid, cv = 10)

knn.fit(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

In [13]:
#Extracting the optimal number of neighbors 

knn.best_params_

{'n_neighbors': 3}

In [14]:
#Extracting the accuracy score for optimal number of neighbors

knn.best_score_

0.9957173335952483

## Scaling

In [16]:
#Setting up the scaling pipeline 

pipeline_order = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 3))]

pipeline = Pipeline(pipeline_order)

#Fitting the classfier to the scaled dataset 

knn_classifier_scaled = pipeline.fit(X_train, y_train)

#Extracting the score 

knn_classifier_scaled.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.9962222763831343