### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('fraud_dataset.csv')
df.head(3)

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [2]:
df[['weekday']] = pd.get_dummies(df['day'])['weekday']

In [3]:
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])

In [4]:
df['intercept'] = 1

In [5]:
df.head(2)

Unnamed: 0,transaction_id,duration,day,fraud,weekday,not_fraud,intercept
0,28891,21.3026,weekend,0,0,1,1
1,61629,22.932765,weekend,0,0,1,1


In [6]:
df = df.drop('not_fraud', axis=1)

In [7]:
df.head(2)

Unnamed: 0,transaction_id,duration,day,fraud,weekday,intercept
0,28891,21.3026,weekend,0,0,1
1,61629,22.932765,weekend,0,0,1


In [8]:
df.fraud.mean()

0.012168770612987604

In [9]:
df.groupby('fraud').duration.mean()

fraud
0    30.013583
1     4.624247
Name: duration, dtype: float64

In [10]:
df.weekday.mean()

0.3452746502900034

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [11]:
lm = sm.Logit(df['fraud'], df[['intercept', 'weekday', 'duration']])
results = lm.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 06 May 2021",Pseudo R-squ.:,inf
Time:,15:58:42,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


In [12]:
np.exp(-1.4637), np.exp(2.5465)

(0.2313785882117941, 12.762357271496972)

In [13]:
# As duration is negative
1/np.exp(-1.4637)

4.321921089278333

### if we consider weekday, Fraud is 12.76 times as likely on weekdays than weekends holding all else constant.
### if we consider duration, for each 1 Unit descrease in duration, Fraud is 4.32 times as likely holding all else constant.

# Admissions Example

### Interpreting Results of Logistic Regression

In this notebook (and quizzes), you will be getting some practice with interpreting the coefficients in logistic regression.  Using what you saw in the previous video should be helpful in assisting with this notebook.

The dataset contains four variables: `admit`, `gre`, `gpa`, and `prestige`:

* `admit` is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
* `gre` is the GRE score. GRE stands for Graduate Record Examination.
* `gpa` stands for Grade Point Average.
* `prestige` is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).

To start, let's read in the necessary libraries and data.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("admissions.csv")
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [15]:
df.groupby('prestige').count()

Unnamed: 0_level_0,admit,gre,gpa
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,61,61,61
2,148,148,148
3,121,121,121
4,67,67,67


There are a few different ways you might choose to work with the `prestige` column in this dataset.  For this dataset, we will want to allow for the change from prestige 1 to prestige 2 to allow a different acceptance rate than changing from prestige 3 to prestige 4.

1. With the above idea in place, create the dummy variables needed to change prestige to a categorical variable, rather than quantitative, then answer quiz 1 below.

In [16]:
df[['prestige_1','prestige_2','prestige_3','prestige_4']] = pd.get_dummies(df['prestige'])

In [17]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380,3.61,3,0,0,1,0
1,1,660,3.67,3,0,0,1,0
2,1,800,4.0,1,1,0,0,0
3,1,640,3.19,4,0,0,0,1
4,0,520,2.93,4,0,0,0,1


In [18]:
df['intercept'] = 1

In [19]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4,intercept
0,0,380,3.61,3,0,0,1,0,1
1,1,660,3.67,3,0,0,1,0,1
2,1,800,4.0,1,1,0,0,0,1
3,1,640,3.19,4,0,0,0,1,1
4,0,520,2.93,4,0,0,0,1,1


`2.` Now, fit a logistic regression model to predict if an individual is admitted using `gre`, `gpa`, and `prestige` with a baseline of the prestige value of `1`.  Use the results to answer quiz 2 and 3 below.  Don't forget an intercept.

In [20]:
lm = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prestige_2','prestige_3', 'prestige_4']])
results = lm.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 06 May 2021",Pseudo R-squ.:,0.08166
Time:,15:58:42,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
Covariance Type:,nonrobust,LLR p-value:,1.176e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-3.8769,1.142,-3.393,0.001,-6.116,-1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05,0.004
gpa,0.7793,0.333,2.344,0.019,0.128,1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301,-0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015,-0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372,-0.735


# Remember if the p-values are less than 0.05, this suggests there is a statistically significant relationship between the variable and the response variable.

In [21]:
np.exp(results.params)

intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64

In [22]:
1/_

intercept     48.272116
gre            0.997784
gpa            0.458710
prestige_2     1.974147
prestige_3     3.813995
prestige_4     4.727566
dtype: float64

In [23]:
df.groupby('prestige').admit.mean()

prestige
1    0.540984
2    0.358108
3    0.231405
4    0.179104
Name: admit, dtype: float64

# If an individual attended the most prestigious alma mater, they are 4.727566 more likely to be admitted than if they attended the least prestigious, holding all other variables constant.


# If an individual attended the most prestigious alma mater, they are 3.813995 more likely to be admitted than if they attended the second lowest in prestigious-ness, holding all other variables constant.


# If an individual attended the most prestigious alma mater, they are 1.974147 more likely to be admitted than if they attended the second most prestigious, holding all other variables constant.


# For every one point increase in gpa, an individual is 2.180027 more likely to be admitted, holding all other variables constant.

### Model Diagnostics in Python

In this notebook, you will be trying out some of the model diagnostics you saw from Sebastian, but in your case there will only be two cases - either admitted or not admitted.

First let's read in the necessary libraries and the dataset.

In [24]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split

np.random.seed(42)

df = pd.read_csv('./admissions.csv')
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


`1.` Change prestige to dummy variable columns that are added to `df`.  Then divide your data into training and test data.  Create your test set as 20% of the data, and use a random state of 0.  Your response should be the `admit` column.  [Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) are the docs, which can also find with a quick google search if you get stuck.

In [25]:
df[['prestige_1','prestige_2','prestige_3','prestige_4']] = pd.get_dummies(df['prestige'])

y = df['admit']

X = df[['gre', 'gpa', 'prestige_2','prestige_3', 'prestige_4']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

`2.` Now use [sklearn's Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to fit a logistic model using `gre`, `gpa`, and 3 of your `prestige` dummy variables.  For now, fit the logistic regression model without changing any of the hyperparameters.  

The usual steps are:
* Instantiate
* Fit (on train)
* Predict (on test)
* Score (compare predict to test)

As a first score, obtain the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).  Then answer the first question below about how well your model performed on the test data.

In [26]:
log_mod = LogisticRegression()
log_mod.fit(X_train, y_train)
y_preds = log_mod.predict(X_test)

print(precision_score(y_test, y_preds))
print(recall_score(y_test, y_preds))
print(accuracy_score(y_test, y_preds))
confusion_matrix(y_test, y_preds)

1.0
0.16666666666666666
0.75


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[56,  0],
       [20,  4]], dtype=int64)

	Predicted	
Actual	0	1
    0	23	1
    1	14	2
    
    
Therefore, there are 23 non-admitted that we predict to be non-admitted.
There are 14 admitted that we predicted to be non-admitted.
There is 1 non-admitted that we predict to be admitted.
There are 2 admitted that we predict to be admitted.

# If we really care about correctly identifying the accepted students as accepted, which metric do we care about the most? **recall**


# If we only care obtaining the most correctly identified cases whether accepted or non-accepted, which metric do we care about the most? **Accuracy**

`3.` Now, try out a few additional metrics: [precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), and [accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) are all popular metrics, which you saw with Sebastian.  You could compute these directly from the confusion matrix, but you can also use these built in functions in sklearn.

Another very popular set of metrics are [ROC curves and AUC](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py).  These actually use the probability from the logistic regression models, and not just the label.  [This](http://blog.yhat.com/posts/roc-curves.html) is also a great resource for understanding ROC curves and AUC.

Try out these metrics to answer the second quiz question below.  I also provided the ROC plot below.  The ideal case is for this to shoot all the way to the upper left hand corner.  Again, these are discussed in more detail in the Machine Learning Udacity program.

In [27]:
#pip install plotnine

In [28]:
### Unless you install the ggplot library in the workspace, you will 
### get an error when running this code!

from plotnine import *
from ggplot import *
from sklearn.metrics import roc_curve, auc
%matplotlib inline

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) +\
    geom_line() +\
    geom_abline(linetype='dashed')

ModuleNotFoundError: No module named 'ggplot'