# Logistic Regression for Credit Card Fraud Detection (10 pts)

Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### Loading the data (1 pts)
Load the data from `fraud_data.csv`.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('./fraud_data.csv')

## Print the percentage of fraud observations

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  # Your code here

**Question:** What percentage of the observations in the dataset are instances of fraud?

In [10]:
len(y[y == 1])/len(y)

0.016410823768035772

*** 1.64% of the observations are instances of fraud. ***

### Predictions using the majority class label (4pts)

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? (Here accuracy is the ratio of the number of correctly classified transactions to the total number of transactions)

In [11]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
    
## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
dummy_test_pred = dummy.predict(X_test)

## Measure test accuracy of your dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)

print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9852507374631269


**Question:** *How does the accuracy of the dummy classifier look (very low, low, high, very high)? The accuracy looks very high. It is because the percentage of fraud instances is very low. Predicting based on majority classes will give very high accuracy since we don't have many fraud labels.*

**Question:** *How many fraudulent transactions are correctly classified? (This is the **recall** score/measure)*

In [13]:
from sklearn.metrics import recall_score

## Measure test recall score of your dummy classifier
dummy_test_recall = recall_score(y_test, dummy_test_pred)

print('Dummy classifier recall:', dummy_test_recall)

Dummy classifier recall: 0.0


**Question:** *How does the recall of the dummy classifier look (very low, low, high, very high)? The recall is very low (is zero). It is because nothing is classified as fraudulent. *

### Training a logistic regression model (3pts)

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [15]:
from sklearn.linear_model import LogisticRegression
    
## Instantiate a logistic regression model and fit to the training data
logR = LogisticRegression().fit(X_train, y_train)

logR_test_pred = logR.predict(X_test)

## Measure test accuracy 
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)



Logistic classifier accuray: 0.9964970501474927
Logistic classifier recall: 0.7875


**Question:** *Compare the results of logistic regression with those of the above dummy classifier* ***Accuracy and recall of logistic regression model is higher than dummy classifier. Especially, with logistic regreesion, we can classfy 78% of fraudulent instances correctly.  ***

### Grid search for selecting hyperparameters for Logistic Regression (2pts)

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

In [16]:
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
parameters = {'penalty':['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
model = LogisticRegression()
    
## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, parameters)
cmodel.fit(X_train, y_train.ravel())



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
cmodel.best_estimator_.C

0.1

In [24]:
cmodel.best_estimator_.penalty

'l2'

In [26]:
## Fit logistic regression with best parameters to the entire training data
model = LogisticRegression(C=0.1,penalty='l2').fit(X_train, y_train)
    
logR_test_pred = model.predict(X_test)

## Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)



Logistic classifier accuray: 0.9963126843657817
Logistic classifier recall: 0.775


**Question:** *Compare the results with that of logistic regression with default parameters* ***The results are not different that much than using the default parameter. ***