# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
# pd.crosstab(df.BathCount, df.BedCount, dropna = False)
pd.crosstab(df.prestige, df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [4]:
df.groupby('prestige').mean()

Unnamed: 0_level_0,admit,gre,gpa
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.540984,611.803279,3.453115
2.0,0.358108,596.621622,3.367365
3.0,0.231405,574.876033,3.432893
4.0,0.179104,570.149254,3.318358


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
X = df[ ['admit', 'gre', 'gpa', 'prestige'] ]
X

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


In [6]:
df.groupby('prestige').mean()

Unnamed: 0_level_0,admit,gre,gpa
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.540984,611.803279,3.453115
2.0,0.358108,596.621622,3.367365
3.0,0.231405,574.876033,3.432893
4.0,0.179104,570.149254,3.318358


In [7]:
c = df.prestige
c

0      3.0
1      3.0
2      1.0
3      4.0
4      4.0
      ... 
395    2.0
396    3.0
397    2.0
398    2.0
399    3.0
Name: prestige, dtype: float64

In [8]:
cs = pd.get_dummies(c, prefix = 'prestige')

In [9]:
cs

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [10]:
cs.rename(columns = {'prestige_1.0': 'Prestige_1',
                    'prestige_2.0': 'Prestige_2',
                    'prestige_3.0': 'Prestige_3',
                    'prestige_4.0': 'Prestige_4'}, inplace = True)

In [11]:
cs

Unnamed: 0,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [12]:
cs.drop('Prestige_1', axis=1, inplace=True)
cs

Unnamed: 0,Prestige_2,Prestige_3,Prestige_4
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
395,1.0,0.0,0.0
396,0.0,1.0,0.0
397,1.0,0.0,0.0
398,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: All but one, so: 4 - 1 = 3.

> ### Question 4.  Why are we doing this?

Answer: to avoid collinearity

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [13]:
# drop 
df.drop('prestige', axis = 1, inplace=True)
df

Unnamed: 0,admit,gre,gpa
0,0,380.0,3.61
1,1,660.0,3.67
2,1,800.0,4.00
3,1,640.0,3.19
4,0,520.0,2.93
...,...,...,...
395,0,620.0,4.00
396,0,560.0,3.04
397,0,460.0,2.63
398,0,700.0,3.65


In [14]:
dfnew = pd.concat((df, cs), axis = 1)
dfnew

Unnamed: 0,admit,gre,gpa,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.00,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0
...,...,...,...,...,...,...
395,0,620.0,4.00,1.0,0.0,0.0
396,0,560.0,3.04,0.0,1.0,0.0
397,0,460.0,2.63,1.0,0.0,0.0
398,0,700.0,3.65,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [None]:
# the odds of a prestige/true being admit/1 33 to 28

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [25]:
# TODO the odds of being prestige/not1 and admit/1 is 93 to 243
pd.crosstab(df.prestige > 1, df.admit)

AttributeError: 'DataFrame' object has no attribute 'prestige'

> ### Question 9.  Finally, what's the odds ratio?

In [None]:
# TODO the chances that the thing happens comapred with the chances it doesn't

> ### Question 10.  Write this finding in a sentenance.

Answer:

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [None]:
# TODO

Answer:

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [26]:
#admission = smf.ols(formula = 'admit ~ gre + gpa + Prestige_1 + Prestige_2 + Prestige_3', data = dfnew).fit()
#admission
model = smf.logit(formula = 'admit ~ gre + gpa + Prestige_2 + Prestige_3 + Prestige_4', data = dfnew).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Wed, 02 Nov 2016",Pseudo R-squ.:,0.08166
Time:,21:54:52,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
Prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
Prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
Prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [28]:
# TODO this is "taking the exponent"
np.exp(model.params)

Intercept     0.020716
gre           1.002221
gpa           2.180027
Prestige_2    0.506548
Prestige_3    0.262192
Prestige_4    0.211525
dtype: float64

In [29]:
np.exp(model.conf_int(alpha = .05))


Unnamed: 0,0,1
Intercept,0.002207,0.19444
gre,1.000074,1.004372
gpa,1.13612,4.183113
Prestige_2,0.272168,0.942767
Prestige_3,0.133377,0.515419
Prestige_4,0.093329,0.479411


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [30]:
# TODO #probability is 73% for tier 1 school, 58% for tier 2, 42% for tier 3, 37% for tier 4
print model.predict(pd.DataFrame([[800, 4, 0, 0, 0,]], columns=dfnew.columns[1:]))
print model.predict(pd.DataFrame([[800, 4, 1, 0, 0,]], columns=dfnew.columns[1:]))
print model.predict(pd.DataFrame([[800, 4, 0, 1, 0,]], columns=dfnew.columns[1:]))
print model.predict(pd.DataFrame([[800, 4, 0, 0, 1,]], columns=dfnew.columns[1:]))

[ 0.73403998]
[ 0.58299512]
[ 0.41983282]
[ 0.36860803]


Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [31]:
# TODO
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=10**2)
logreg.fit(dfnew.drop('admit', axis=1, inplace=False), dfnew['admit'])

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [32]:
logreg.predict_proba([[800, 4, 0, 0, 0]])
#this is in numerical order, so admit = 0 and then admit = 1

array([[ 0.28814605,  0.71185395]])

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [33]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [34]:
# TODO

Answer: