# DS-SF-33 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [7]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [8]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [9]:
pd.crosstab(df.prestige,df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [10]:
pd.crosstab(df.prestige,df.admit,normalize=True)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.070529,0.083123
2.0,0.239295,0.133501
3.0,0.234257,0.070529
4.0,0.138539,0.030227


In [11]:
pd.crosstab(df.prestige,df.admit,normalize=True).sum()

admit
0    0.68262
1    0.31738
dtype: float64

In [12]:
pd.crosstab(df.prestige,df.admit,normalize='index')

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.459016,0.540984
2.0,0.641892,0.358108
3.0,0.768595,0.231405
4.0,0.820896,0.179104


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [13]:
df.prestige = df.prestige.astype(int)
df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3
1,1,660.0,3.67,3
2,1,800.0,4.00,1
3,1,640.0,3.19,4
4,0,520.0,2.93,4
...,...,...,...,...
395,0,620.0,4.00,2
396,0,560.0,3.04,3
397,0,460.0,2.63,2
398,0,700.0,3.65,2


In [14]:
#one_hot = (pd.get_dummies(df['prestige'], prefix='prestige'))
#one_hot

In [15]:
df = df.join(pd.get_dummies(df['prestige'], prefix='prestige'))
df

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,3,0,0,1,0
1,1,660.0,3.67,3,0,0,1,0
2,1,800.0,4.00,1,1,0,0,0
3,1,640.0,3.19,4,0,0,0,1
4,0,520.0,2.93,4,0,0,0,1
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2,0,1,0,0
396,0,560.0,3.04,3,0,0,1,0
397,0,460.0,2.63,2,0,1,0,0
398,0,700.0,3.65,2,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: You need the predictor variable to be binary, but for modeling this specific example we'd only need 3 of the 4 prestige binary variables. 

> ### Question 4.  Why are we doing this?

Answer: To try and predict admission based on GRE, GPA and Prestige. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [16]:
del df['prestige']
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [17]:
pd.crosstab(df.prestige_1,df.admit)

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [18]:
33/28

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [19]:
93/243

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [20]:
1.17/.38

3.078947368421052

> ### Question 10.  Write this finding in a sentence.

Answer: Applicants who attend a #1 ranked college are 3.07x more likely to be admitted to grad school. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [21]:
pd.crosstab(df.prestige_4,df.admit)

admit,0,1
prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,114
1,55,12


In [22]:
12/55

0.21818181818181817

In [23]:
114/216

0.5277777777777778

In [24]:
.52/.22

2.3636363636363638

Answer: Students who go to an undergraduate school ranked w/ the lowest prestige are 2.36x less likely than all other applicants to be admitted to UCLA.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [33]:
import statsmodels.api as sm
from patsy import dmatrices
import statsmodels.formula.api as smf
import patsy
from sklearn.cross_validation import train_test_split
formula1 = '''
admit ~ gre, gpa, prestige_2, prestige_3, prestige_4 -1'''
formula1 = formula1.replace(",","+")
formula1

# results = sm.OLS('admit', 'gre','gpa','prestige_2','prestige_3','prestige_4').fit()



'\nadmit ~ gre+ gpa+ prestige_2+ prestige_3+ prestige_4 -1'

In [34]:
# y, X = patsy.dmatrices(formula1, data=df, return_type='dataframe')
lm_stats_1 = smf.logit(formula=formula1, data=df).fit()
print (lm_stats_1.summary())


Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Wed, 19 Apr 2017   Pseudo R-squ.:                 0.05722
Time:                        20:23:17   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.308      0.191        -0.001     0.003
gpa           -0.1323      0.

> ### Question 13.  Print the model's summary results.

In [35]:
print (lm_stats_1.summary())

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Wed, 19 Apr 2017   Pseudo R-squ.:                 0.05722
Time:                        20:24:31   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.308      0.191        -0.001     0.003
gpa           -0.1323      0.195     -0.680      0.497        -0.514     0.249
prestige_2    -0.9562      0.302     -3.171      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [39]:
np.exp(lm_stats_1.params)

gre           1.001368
gpa           0.876073
prestige_2    0.384342
prestige_3    0.214918
prestige_4    0.154135
dtype: float64

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: If you went to an undergraduate school with prestige = 2, you're 38% less likely to be admitted than a user with undergraduate school of prestige = 1. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: Each drop in GPA point correlates to an 87% lowering of the likelihood of being admitted.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

Answer: 38%, 21%, 15%

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [40]:
from sklearn.linear_model import LogisticRegression

In [41]:
lm = LogisticRegression()

In [42]:
lm.fit(df[['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4',]], df['admit'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [44]:
print(lm.coef_)

[[ 0.00178497  0.23229458 -0.60347467 -1.17214957 -1.37729795]]


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [46]:
np.exp(lm.coef_)

array([[ 1.00178657,  1.26149128,  0.546908  ,  0.3097005 ,  0.25225925]])

GPA is a lot different but the others look similar. 

Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

Answer: 2-54%, 3-30%, 4-25%