# Logistic Regression for Credit Card Fraud Detection 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### Loading the data 
Load the data from `fraud_data.csv`.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('fraud_data.csv')

## Print the percentage of fraud observations

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  # Your code here

print('Percentage of observations that are instances of fraud: ', len(y[y[:]==1])/len(y))

Percentage of observations that are instances of fraud:  0.016410823768035772


**Question:** What percentage of the observations in the dataset are instances of fraud?

- According to the dataset, around 1.6% of the observations are instances if fraud.

### Predictions using the majority class label 

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? (Here accuracy is the ratio of the number of correctly classified transactions to the total number of transactions)

In [2]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
    
## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

dummy_test_pred = dummy.predict(X_test)

## Measure test accuracy of your dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)

print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9852507374631269


**Question:** *How does the accuracy of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

- The accuracy of the dummy classifier is approximately 98%. This is very high accuracy, as this means that the dummy classifier is able to predict 98% of the values accurately. 
- However, as we noticed that the number of observations that were instances of fraud were only 1.6%. This means that majority of the observations were not fraud. As the dummy classifier always predicts the class label by the majority class of the training data, it was bound to get most of the values right as most of the values were cases of not fraud. 

**Question:** *How many fraudulent transactions are correctly classified? (This is the **recall** score/measure)*

In [3]:
from sklearn.metrics import recall_score

## Measure test recall score of your dummy classifier
dummy_test_recall = recall_score(y_test, dummy_test_pred)

print('Dummy classifier recall:', dummy_test_recall)

Dummy classifier recall: 0.0


**Question:** *How does the recall of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

- The recall score of the dummy classifier is 0, which is very low. This happens because the dummy classifier will always predict the class label by the majority class of the training data, which is 'not fraud' in our case. 
- The dummy classifier, hence, will never assign a 'fraud' label to any of the observations in the test data, which is why the recall score is 0. It will never accurately identify a fraud, as fraud is not a majority class. 

### Training a logistic regression model 

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [4]:
from sklearn.linear_model import LogisticRegression
    
## Instantiate a logistic regression model and fit to the training data
logR = LogisticRegression(max_iter=1000)
logR.fit(X_train, y_train)

logR_test_pred = logR.predict(X_test)

## Measure test accuracy 
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9964970501474927
Logistic classifier recall: 0.7875


**Question:** *Compare the results of logistic regression with those of the above dummy classifier*

- Accuracy Score - Comparing the accuracy score of the two models, we can see that the Accuracy score of the logistic Regression model is higher than that of the Dummy Classifier. The accuracy of the logistic Regression model is 99%, which means that the Logistic Regression model got 99% of the predictions right, as compared to the Dummy Classifier which got 98% of the predictions right. 

- Recall Score - Comparing the recall score of the two models, we can see that the recall score of the Logistic Regression model is much, much higher than that of the Dummy Classifier. The recall score is the number of correct positive predictions made out of all positive predictions. The Logistic Regression model has a recall score of about 78% as compared to that of the Dummy Classifier which has a recall score of 0. This means that the Logistic Regression model was able to accurately identify 78% of the fraud cases as fraud, but the dummy classifier couldn't identify a single fraud case as fraud (because the dummy classifier classified all the cases as 'not fraud' as 'not fraud' was the majority class label)

### Grid search for selecting hyperparameters for Logistic Regression

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

In [8]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
parameters = [{'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100] }]
model = LogisticRegression()
    
## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, parameters, cv=3, scoring='recall')
cmodel.fit(X_train, y_train)

## Fit logistic regression with best parameters to the entire training data
model = LogisticRegression(C=cmodel.best_params_.get('C'), penalty=cmodel.best_params_.get('penalty') )
model.fit(X_train, y_train)
    
logR_test_pred = model.predict(X_test)

## Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9966814159292036
Logistic classifier recall: 0.8


**Question:** *Compare the results with that of logistic regression with default parameters*

- Accuracy Score - The accuracy score of the Logistic Regression model with default parameters and with parameters from the GridSearchCV are almost the same - 99.6%
- Recall Score - The recall score of the Logistic Regression model with default parameters and with parameters from the GridSearchCV differ by approximately 2%. This means that the parameters estimated using the GridSearchCV are better at predicting the fraud cases as fraud, than the model with default parameters. 