# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [118]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [119]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'), dtype={'prestige': 'str'})
## forced variable 'prestige' to be categorical as we import the data
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3
1,1,660.0,3.67,3
2,1,800.0,4.0,1
3,1,640.0,3.19,4
4,0,520.0,2.93,4


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [140]:
df['prestige'].value_counts()

2    148
3    121
4     67
1     61
Name: prestige, dtype: int64

In [121]:
df['admit'].value_counts()

0    271
1    126
Name: admit, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [122]:
df.dtypes

admit         int64
gre         float64
gpa         float64
prestige     object
dtype: object

In [123]:
pd.get_dummies(df, prefix=['prestige'], columns=['prestige'])

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We will need 3 of these binary variables for modeling

> ### Question 4.  Why are we doing this?

Answer: We are creating dummy variables out of the categorical feature because we don't want the model to treat its values as numeric and embed a sense of ordinality to the values of prestige. We only need 3 of the binary variables as we would face multi-collinearity if we took all four.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [124]:
df1=pd.get_dummies(df, prefix=['prestige'], columns=['prestige'], drop_first=True)

In [125]:
df1

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.00,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0
...,...,...,...,...,...,...
395,0,620.0,4.00,1.0,0.0,0.0
396,0,560.0,3.04,0.0,1.0,0.0
397,0,460.0,2.63,1.0,0.0,0.0
398,0,700.0,3.65,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [156]:
df['prestige']=pd.to_numeric(df['prestige'])

In [164]:
df[df['prestige']==1]['admit'].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [179]:
33/28.0

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [181]:
df[df['prestige']!=1]['admit'].value_counts()

0    243
1     93
Name: admit, dtype: int64

In [182]:
93/243.0

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [183]:
(33/28.0)/(93/243.0)

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: Undergraduates who attended a #1 ranked college have about 3 times the odds of admission than those who did not attend a #1 ranked college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [184]:
df[df['prestige']==4]['admit'].value_counts()

0    55
1    12
Name: admit, dtype: int64

In [185]:
12/55.0

0.21818181818181817

In [186]:
(33/28.0)/(12/55.0)

5.401785714285714

Answer: Undergraduates who attended a top ranking college have about 5.4 times the odds of admission to UCLA than those who attended the least prestigious schools.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [189]:
from statsmodels.discrete import discrete_model

In [205]:
y=df1['admit']
X=df1.iloc[:,1:7]

from statsmodels.tools import tools
X1=tools.add_constant(X, prepend=False)

In [206]:
X

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4
0,380.00,3.61,0.00,1.00,0.00
1,660.00,3.67,0.00,1.00,0.00
2,800.00,4.00,0.00,0.00,0.00
3,640.00,3.19,0.00,0.00,1.00
4,520.00,2.93,0.00,0.00,1.00
...,...,...,...,...,...
395,620.00,4.00,1.00,0.00,0.00
396,560.00,3.04,0.00,1.00,0.00
397,460.00,2.63,1.00,0.00,0.00
398,700.00,3.65,1.00,0.00,0.00


In [207]:
X1

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4,const
0,380.00,3.61,0.00,1.00,0.00,1
1,660.00,3.67,0.00,1.00,0.00,1
2,800.00,4.00,0.00,0.00,0.00,1
3,640.00,3.19,0.00,0.00,1.00,1
4,520.00,2.93,0.00,0.00,1.00,1
...,...,...,...,...,...,...
395,620.00,4.00,1.00,0.00,0.00,1
396,560.00,3.04,0.00,1.00,0.00,1
397,460.00,2.63,1.00,0.00,0.00,1
398,700.00,3.65,1.00,0.00,0.00,1


In [212]:
logit=discrete_model.Logit(endog=y, exog=X1)

> ### Question 13.  Print the model's summary results.

In [214]:
logit.fit().params

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


gre           0.00
gpa           0.78
prestige_2   -0.68
prestige_3   -1.34
prestige_4   -1.55
const        -3.88
dtype: float64

In [225]:
## Just tested another method of running the logit which has better result summary function
import statsmodels.api as sm
from statsmodels.formula.api import logit

In [221]:
df1.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


In [222]:
logit2 = logit("admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4", df1).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


In [240]:
logit2.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 17 Jan 2017",Pseudo R-squ.:,0.08166
Time:,17:45:05,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [252]:
logit2.params

Intercept    -3.88
gre           0.00
gpa           0.78
prestige_2   -0.68
prestige_3   -1.34
prestige_4   -1.55
dtype: float64

In [264]:
logodds = logit2.params[0] + logit2.params[1:6] 
logodds

gre          -3.87
gpa          -3.10
prestige_2   -4.56
prestige_3   -5.22
prestige_4   -5.43
dtype: float64

In [286]:
odds = np.exp(logodds)
odds

gre          0.02
gpa          0.05
prestige_2   0.01
prestige_3   0.01
prestige_4   0.00
dtype: float64

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Undergraduates who attended a college whose prestige level is 2 have 0.01 times the odds of being admitted than the prestige level 1 students.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: A 1 point increase in GPA increases the odds of admission by 0.05.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [282]:
logodds1=logit2.params[0]+ sum(logit2.params[1:6] * [800, 4,0,0,0])
p1 = np.exp(logodds1)/(1+np.exp(logodds1))
p1

0.73403997689044631

In [283]:
logodds2=logit2.params[0]+ sum(logit2.params[1:6] * [800, 4,1,0,0])
p2 = np.exp(logodds2)/(1+np.exp(logodds2))
p2

0.58299511694059036

In [284]:
logodds3=logit2.params[0]+ sum(logit2.params[1:6] * [800, 4,0,1,0])
p3 = np.exp(logodds3)/(1+np.exp(logodds3))
p3

0.41983282060994886

In [285]:
logodds4=logit2.params[0]+ sum(logit2.params[1:6] * [800, 4,0,0,1])
p4 = np.exp(logodds4)/(1+np.exp(logodds4))
p4

0.3686080314202152

Answer: The probability of admission for a student with GRE =800 & GPA=4 is:
if prestige = tier 1, then prob=73.4%
if prestige = tier 2, then prob=58.3%
if prestige = tier 3, then prob=42.0%
if prestige = tier 4, then prob=36.9%

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [227]:
from sklearn.linear_model import LogisticRegression

In [266]:
logreg = LogisticRegression(penalty='l2', C=10**2)

mdl = logreg.fit(X1, y)
mdl.coef_

array([[ 0.00209769,  0.71809211, -0.72960187, -1.40029508, -1.60285974,
        -1.77303041]])

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [294]:
logodds_sk = logreg.coef_
logodds_sk

array([[ 0.00209769,  0.71809211, -0.72960187, -1.40029508, -1.60285974,
        -1.77303041]])

In [288]:
odds_sk = np.exp(logreg.coef_)
odds_sk

array([[ 1.0020999 ,  2.05051732,  0.48210089,  0.24652421,  0.20131997,
         0.16981759]])

In [290]:
## Recall the odds ratio from the StatsModel
odds

gre          0.02
gpa          0.05
prestige_2   0.01
prestige_3   0.01
prestige_4   0.00
dtype: float64

Answer: The odds ratio from the sklearn model are larger than those from the statsmodel. However, directionally the values move relatively similarly.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [305]:
logreg.coef_

array([[ 0.00209769,  0.71809211, -0.72960187, -1.40029508, -1.60285974,
        -1.77303041]])

In [313]:
x1=(0.00209769*800)+(0.71809211*4)+(0)-1.77303041
p1_sk = np.exp(x1)/(1+np.exp(x1))
p1_sk

0.94144723723495372

In [314]:
x2=(0.00209769*800)+(0.71809211*4)+(-0.72960187)-1.77303041
p2_sk = np.exp(x2)/(1+np.exp(x2))
p2_sk

0.88573405505157254

In [315]:
x3=(0.00209769*800)+(0.71809211*4)+(-1.40029508)-1.77303041
p3_sk = np.exp(x3)/(1+np.exp(x3))
p3_sk

0.79854011907688283

In [316]:
x4=(0.00209769*800)+(0.71809211*4)+(-1.60285974)-1.77303041
p4_sk = np.exp(x4)/(1+np.exp(x4))
p4_sk

0.76398094239186565

Answer: The probability of admission for a student with GRE =800 & GPA=4 is:
if prestige = tier 1, then prob=94.1%
if prestige = tier 2, then prob=88.6%
if prestige = tier 3, then prob=79.9%
if prestige = tier 4, then prob=76.4%