
This is my first experience with Data Visualization & machine Learning. I come from totally non-coding background, so my insights & the way to intrepret the things might be different from the one with good machine learning background. Your insights & feedback are welcome.

**Credit Card Fraud Detection**

Importing the important Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import matplotlib.pyplot as plt

In [None]:
cc=pd.read_csv('../input/credit-card-fraud-detection/creditcard.csv')

In [None]:
cc.head

In [None]:
cc.isna().sum()

In [None]:
cc.shape

### Analysing the DataSet

The data set has 31 columns & 281807 rows. This set doesnt have any null values.
We do not have column labels except for the amount column so we do not know what the other columns signify.

#### This is  unsupervised learning problem

##### Our goal is to identify if the transaction is fraud or not, the class column becomes dependent variable, our Y variable.

##### This is classification problem, we will use classification algorithms to build the model.


## Lets  Begin with exploring the dataset

In [None]:
#Analysing the dependent or Y variable

cc['Class'].value_counts().plot(kind='bar')

In [None]:
cc['Class'].value_counts()

In [None]:
print(f"Non fraudulent transactions:{round(cc['Class'].value_counts()[0]/len(cc['Class'])*100,2)}%")
print(f" fraudulent transactions:{round(cc['Class'].value_counts()[1]/len(cc['Class'])*100,2)}%")

As we see most of the transactions (99%) are Non Fraud & only 1% are Fraudulent transactions. This means this is largely imbalanced dataset. 

If we use this dataset to build the model we might get lot of errors & we might overfit since it will assume most of the transaction Non fraudulent.



In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

#Lets analyse the 'Amount' column
sns.distplot(cc['Amount'],ax=ax[0],color='r')
ax[0].set_title('Distribution of amount')

#now seeing the time distribution
sns.distplot(cc['Time'],ax=ax[1],color='violet')
ax[1].set_title('Distribution of Time')

plt.show()

We can see that above distributions are skewed & we need use the techniques to reduce the skewness. We will see how to normalise the data in later stages.

### Dealing with Imbalanced Data

As we see the data is hugely imbalance, this might lead to overfitting problem. To make out model work accurately we will need to balance the fraudulent & non- fruadulent transactions which means we need to have equal amount of both classes.

To balance the data, we will take the sub-sample of both fraudulent & non-fraudulent transactions & try to build the prediction models. Since we have only 492 fradulent transactions, we will randomply pick 492 non-fraudulent transactions to create a balanced sub-sample #or dataframe



### Scaling
If we observe all the other variables are scaled except the Amount & time labels, we will first scale these variables using Standard Scalar



In [None]:
from sklearn.preprocessing import StandardScaler
std_slr= StandardScaler()
cc['sld_amt']=std_slr.fit_transform(cc['Amount'].values.reshape(-1,1))
cc['sld_time']=std_slr.fit_transform(cc['Time'].values.reshape(-1,1))

In [None]:
cc.head()

In [None]:
#dropping the old columns
cc.drop(['Amount','Time'],axis=1, inplace=True)
cc.head()

Now we have scaled all the labels.

Next we have to look into imbalance data before we start building the model. We need to have equal Fruad & Non-fruad transactions. In order to achieve this, we will have to use Random sampling techniques & scale the dataset to get the balanced dataset of Y variable classes.This will give us new dataset.

Remember, we have to test our model on the original dataset not on new dataset achieved through sampling.

### First lets split the dataset

In [None]:
x=cc.drop('Class',axis=1)
y=cc['Class']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)


In [None]:
#Initially lets identify how the model built on imbalanced dataset will perform

#Using Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
y_pred

In [None]:
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)
y_pred=model.predict(x_test)
y_pred

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score
acc=accuracy_score(y_test,y_pred)
cm=confusion_matrix(y_test,y_pred)
print("Accuracy:",acc)
print(cm)

##### Note that above accuracy score can be misleading though we have got 98% accuracy,since the data is hugely imbalanced.

Like Said earlier, we need to balance the samples. we have 492 fraud transactions and we need only 492 non-fraud transactions.
We will perform Random Under Sampling technique to achieve the balanced sample.

### Random Undersampling

In [None]:
#shuffle the data before creating the sub-sample

cc=cc.sample(frac=1) #this will randomly select all the data from dataset

#extract the fraudulent & non-fraudulent transactions
fraud_df=cc.loc[cc['Class']==1]
nonfraud_df=cc.loc[cc['Class']==0][:492] #using slicing method to select the 492 samples

#combine the datasets
new_df=pd.concat([fraud_df,nonfraud_df])
new_df.shape

In [None]:
#check the distribution of classes
new_df['Class'].value_counts()

In [None]:
#Checking the correlation between the variables. We need to analyse how these variables are correlated with Y variable.
new_df.head()


In [None]:
corr=new_df.corr()

plt.figure(figsize=(18,8))
corr["Class"].sort_values(ascending=True)[:-1].plot(kind="barh")
plt.title("Correlation of variables to Class")
plt.xlabel("Correlation to Class")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(corr,annot=False, cmap="Blues")
plt.title("Correlation of Variables with Class")

In [None]:
#Lets see how the model performs with undersampled data

x=new_df.drop('Class',axis=1)
y=new_df['Class']
from sklearn.model_selection import train_test_split
x_train1,x_test1,y_train1,y_test1=train_test_split(x,y,test_size=0.2,random_state=1)

In [None]:
#Using Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(x_train1,y_train1)
y_pred1=model.predict(x_test1)
y_pred1

In [None]:
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test1,y_pred1))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train1,y_train1)
y_pred1=model.predict(x_test1)
y_pred1

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score
acc=accuracy_score(y_test1,y_pred1)
cm=confusion_matrix(y_test1,y_pred1)
print("Accuracy:",acc)
print(cm)

As you see the accuracy of the undersampled decreased compared the original data. 

There is a catch, if you observe the we have used undersampled test data to predict our model, this can be misleading too we need test our model on the original data.

In [None]:
#testing on the original Dataset

#Using Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(x_train1,y_train1)
y_pred2=model.predict(x_test)
y_pred2

In [None]:
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test,y_pred2))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train1,y_train1)
y_pred2=model.predict(x_test)
y_pred2

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score
acc=accuracy_score(y_test,y_pred2)
cm=confusion_matrix(y_test,y_pred2)
print("Accuracy:",acc)
print(cm)

I have not normalised the amount & the time column(*which I stated will do at later stages;)*), which might also help develop the model with greater accuracy & can help reducing the overfitting problem. I'm not sure though. Can anyone help me with some insights?


Hope you like my first Notebook. Your comments & feedback are welcome.

**P.S: I have referred to multiple Notebooks to get started, thanking each one of them.** 