# ***FIC4020 assignment by Jeet Gohil - 657076***
## In your Notebook do the following:
1. Import the credit card fraud data set
1. Plot histograms for the frequency/number of fraudulent and non-fraudulent transactions against Amount
1. Draw boxplots showing summary statistics for the Amount column
1. Generate a correlation matrix illustrating using a heatmap the relationship between the different variables
1. Generate a scatterplot for Amount and V2 showing a line of best fit using the equation of a straight line is y = mx + c, where m is the slope of the line and c is the y intercept
1. Build an outlier detection model for your data using the Isolation Forest and the Local Outlier Factor
1. Analyze the models using Errors, Confusion Matrix, Accuracy Score and Classification Report to identify the strengths and weaknesses of the models
1. Discuss as a conclusion the best model and how to use it in the future in identifying fraudulent credit card transactions

# Content of the dataset

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

# Importing the credit card fraud data set

In [None]:
import numpy as np
import pandas as pd
import sklearn
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report,accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from pylab import rcParams
rcParams['figure.figsize'] = 14, 8
RANDOM_SEED = 42
LABELS = ["Normal", "Fraud"]

In [None]:
data = pd.read_csv('../input/credit-card-fraud-detection/creditcard.csv',sep=',')
data.head()

In [None]:
data.info()


# Exploring the data we imported

Firstly, we need to check if the dataset have any missing values. Pandas can only check for standard missing values which is null. 

In [None]:
data.isnull().values.any()

# Visualization

A graph/chart is one of the best methods to understand the data that we have.

We will start analyzing how many of the cases in this dataset are fraudulent and which are not.

## Plot histograms for the frequency/number of fraudulent and non-fraudulent transactions against Amount

In [None]:
count_classes = pd.value_counts(data['Class'], sort = True)

count_classes.plot(kind = 'bar', rot=0)

plt.title("Transaction Class Distribution")

plt.xticks(range(2), LABELS)

plt.xlabel("Class")

plt.ylabel("Frequency")

Looking at the histogram above, we can easily notice that the number of fraud cases were very few compared to the enormous number of non-fraudulent cases.

In [None]:
fraud = data[data['Class']==1]

normal = data[data['Class']==0]

In [None]:
print(fraud.shape,normal.shape)

We need to analyze more amount of information from the transaction data
How different are the amount of money used in different transaction classes?

In [None]:
fraud.Amount.describe()

In [None]:
normal.Amount.describe()

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Amount per transaction by class')
bins = 50
ax1.hist(fraud.Amount, bins = bins)
ax1.set_title('Fraud')
ax2.hist(normal.Amount, bins = bins)
ax2.set_title('Normal')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.xlim((0, 20000))
plt.yscale('log')
plt.show();

# Draw boxplots showing summary statistics for the Amount column

A Box Plot is also known as **Whisker plot** is created to display the summary of the set of data values having properties like minimum, first quartile, median, third quartile and maximum. In the box plot, a box is created from the first quartile to the third quartile, a verticle line is also there which goes through the box at the median. Here x-axis denotes the data to be plotted while the y-axis shows the frequency distribution.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
s = sns.boxplot(ax = ax1, x="Class", y="Amount", hue="Class",data=data, palette="PRGn",showfliers=True, showmeans=True)
s = sns.boxplot(ax = ax2, x="Class", y="Amount", hue="Class",data=data, palette="PRGn",showfliers=False, showmeans=True)
plt.show();

# Generate a correlation matrix illustrating using a heatmap the relationship between the different variables

In [None]:
## Take some sample of the data

data1= data.sample(frac = 0.1,random_state=1)

data1.shape

In [None]:
data.shape

### Determine the number of fraud and valid transactions in the dataset


In [None]:
Fraud = data1[data1['Class']==1]

Valid = data1[data1['Class']==0]

outlier_fraction = len(Fraud)/float(len(Valid))

In [None]:
print(outlier_fraction)

print("Fraud Cases : {}".format(len(Fraud)))

print("Valid Cases : {}".format(len(Valid)))

## Correlation Matrix

In [None]:
import seaborn as sns
#get correlations of each features in dataset
corrmat = data1.corr()
top_corr_features = corrmat.index

### Plotting the Heatmap

In [None]:
plt.figure(figsize=(20,20))
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

# Generating a scatterplot for Amount and V2 showing a line of best fit using the equation of a straight line is y = mx + c, where m is the slope of the line and c is the y intercept

## Get the Fraud and the normal dataset

In [None]:
fraud = data[data['Class']==1]

normal = data[data['Class']==0]

In [None]:
print(fraud.shape,normal.shape)

In [None]:
normal.Amount.describe()

In [None]:
m, b = np.polyfit(fraud.V2, fraud.Amount, 1)
plt.plot(fraud.V2, fraud.Amount, 'o')
plt.plot(fraud.V2, m*fraud.V2 + b, color="red")
plt.xlabel("V2")
plt.ylabel("Amount")
plt.show()

In [None]:
m, b = np.polyfit(normal.V2, normal.Amount, 1)
plt.plot(normal.V2, normal.Amount, 'o')
plt.plot(normal.V2, m*normal.V2 + b, color="red")
plt.xlabel("V2")
plt.ylabel("Amount")
plt.show()

# Building an outlier detection model for the data using the Isolation Forest and the Local Outlier Factor classifiers

In [None]:
#Create independent and Dependent Features
columns = data1.columns.tolist()
# Filter the columns to remove data we do not want 
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting 
target = "Class"
# Define a random state 
state = np.random.RandomState(42)
X = data1[columns]
Y = data1[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

In [None]:
classifiers = {
    "Isolation Forest":IsolationForest(n_estimators=100, max_samples=len(X), 
                                       contamination=outlier_fraction,random_state=state, verbose=0),
    "Local Outlier Factor":LocalOutlierFactor(n_neighbors=20, algorithm='auto', 
                                              leaf_size=30, metric='minkowski',
                                              p=2, metric_params=None, contamination=outlier_fraction),
}

In [None]:
type(classifiers)

# Analyzing the models using Errors, Confusion Matrix, Accuracy Score and Classification Report to identify the strengths and weaknesses of the models

## Classification Report

In [None]:
from sklearn.metrics import confusion_matrix
n_outliers = len(Fraud)
LABELS = ["Nonfraudulent", "Fraudulent"]

for i, (clf_name,clf) in enumerate(classifiers.items()):
    #Fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_prediction = clf.negative_outlier_factor_
    elif clf_name == "Isolation Forest":
        clf.fit(X)
        scores_prediction = clf.decision_function(X)
        y_pred = clf.predict(X)
    else:    
       print("No other model")
    
    #Reshape the prediction values to 0 for Valid transactions , 1 for Fraud transactions
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    n_errors = (y_pred != Y).sum()
    # Run Classification Metrics
    print("{}: {}".format(clf_name,n_errors))
    print("Accuracy Score :")
    print(accuracy_score(Y,y_pred))
    print("Classification Report :")
    print(classification_report(Y,y_pred))
    conf_matrix = confusion_matrix(Y, y_pred)
    sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt='d');
    plt.title('Confusion matrix for ' + clf_name)
    plt.ylabel('True class')
    plt.xlabel('Predicted class')
    plt.show()

# Conclusion

1. Isolation Forest detected 73 errors while Local Outlier Factor model detected 97 errors. This shows us that the Isolation Forest Method is the best one between the two. Isolation Forest Method also had a higher accuracy of 99.74% while Local outlier factor had an accuracy of 99.65%. 
1. When comparing error precision & recall for 3 models , the Isolation Forest performed much better than the LOF as we can see that the detection of fraud cases is around 27 % versus LOF detection rate of just 2%. Isolation Forest correctly detected 13 fraudulent transactions when LOF only detected 1.  So,Isolation Forest Method performed much better in determining the fraud cases which is around 30%.
1. We can improve on this accuracy by increasing the sample size or use deep learning algorithms however at the cost of computational expense. We can also use complex anomaly detection models to get better accuracy in determining more fraudulent cases.