### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('./fraud_dataset.csv')
df.head()

  from pandas.core import datetools


Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [12]:
df[['True', 'False']] = pd.get_dummies(df['fraud'])
df[['weekday', 'weekend']] = pd.get_dummies(df['day'])

#whatever variable you keep first gets 1 as dummy

In [6]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,True,False,weekday,weekend
0,28891,21.3026,weekend,False,1,0,0,1
1,61629,22.932765,weekend,False,1,0,0,1
2,53707,32.694992,weekday,False,1,0,1,0
3,47812,32.784252,weekend,False,1,0,0,1
4,43455,17.756828,weekend,False,1,0,0,1


In [11]:
df['False'].mean() #The proportion of fraudulent transactions

0.012168770612987604

In [12]:
df[df['False'] == 1]['duration'].mean() #The average duration for fraudulent transaction.

4.6242473706156568

In [9]:
df['weekday'].mean()#The proportion of weekday transactions

0.34527465029000343

In [13]:
df[df['True'] == 1]['duration'].mean() #The average duration for non-fraudulent transactions.

30.013583132522555

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [15]:
df['intercept'] = 1
logistic_model = sm.Logit(df['False'], df[['intercept', 'weekday', 'duration']])
results = logistic_model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull
  return -2*(self.llnull - self.llf)


0,1,2,3
Dep. Variable:,False,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 10 Feb 2019",Pseudo R-squ.:,
Time:,19:56:26,Log-Likelihood:,-inf
converged:,True,LL-Null:,-inf
,,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


## Results

## p-value shows both the variables are significant
## But before using coefficient we need to exponentiate them

## For quantitative variables, we say for one unit increase in X1, we expect a multiplicative change in the odds of A 1 of e^b1

## For categorical variables, when in categroy X1, we expect a multiplicative change in the odds of A 1 by e^b1 compared to the baseline

## In most of the cases we don't care about intercept, we care only about attached variables

In [13]:
#interpreting results

#taking exponential of both vars duration and weekday

np.exp(-1.4637), np.exp(2.5465)

(0.2313785882117941, 12.762357271496972)

## The above resulting values is the multiplicative change in the odds

## Results
Fraud is 12.76 times as likely on weekdays than on weekends holding all else constant i.e. duration doesn't change

For each 1 unit increase in duration, fraud is 0.23 times as likely holding all else constant 

In [14]:
#the below changes the decrease the direction from increase to decrease
1/np.exp(-1.4637)

4.321921089278333

## Results

For every one unit decrease in duration on the page, fraud is 4.23 times as likely holding all else constant