### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

df = pd.read_csv('./fraud_dataset.csv')
df.head()

  from pandas.core import datetools


Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [2]:
df = df.join(pd.get_dummies(df['day']))

In [3]:
df[['n_fraud','Fraud']]=pd.get_dummies(df['fraud']) 

In [4]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday,weekend,n_fraud,Fraud
0,28891,21.3026,weekend,False,0,1,1,0
1,61629,22.932765,weekend,False,0,1,1,0
2,53707,32.694992,weekday,False,1,0,1,0
3,47812,32.784252,weekend,False,0,1,1,0
4,43455,17.756828,weekend,False,0,1,1,0


In [5]:
df.drop(['weekend','n_fraud'], axis = 1, inplace = True)

In [6]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday,Fraud
0,28891,21.3026,weekend,False,0,0
1,61629,22.932765,weekend,False,0,0
2,53707,32.694992,weekday,False,1,0
3,47812,32.784252,weekend,False,0,0
4,43455,17.756828,weekend,False,0,0


In [7]:
df['Fraud'].mean()

0.012168770612987604

In [8]:
df.groupby(['Fraud'])['duration'].mean()

Fraud
0    30.013583
1     4.624247
Name: duration, dtype: float64

In [9]:
df['weekday'].mean()

0.34527465029000343

In [10]:
df['intercept'] =1
sm.OLS(df['Fraud'],df[['intercept','weekday','duration']]).fit().summary()

0,1,2,3
Dep. Variable:,Fraud,R-squared:,0.145
Model:,OLS,Adj. R-squared:,0.145
Method:,Least Squares,F-statistic:,747.9
Date:,"Thu, 27 Feb 2020",Prob (F-statistic):,1.13e-300
Time:,05:39:25,Log-Likelihood:,7651.6
No. Observations:,8793,AIC:,-15300.0
Df Residuals:,8790,BIC:,-15280.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,0.1674,0.005,36.944,0.000,0.159,0.176
weekday,0.0184,0.002,8.071,0.000,0.014,0.023
duration,-0.0054,0.000,-37.539,0.000,-0.006,-0.005

0,1,2,3
Omnibus:,10942.218,Durbin-Watson:,1.941
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1208518.066
Skew:,6.989,Prob(JB):,0.0
Kurtosis:,58.706,Cond. No.,129.0


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [11]:
model = sm.Logit(df['Fraud'],df[['intercept','weekday','duration']])
result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull
  return -2*(self.llnull - self.llf)


0,1,2,3
Dep. Variable:,Fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 27 Feb 2020",Pseudo R-squ.:,
Time:,05:39:25,Log-Likelihood:,-inf
converged:,True,LL-Null:,-inf
,,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
