# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [8]:
# TODO
pd.crosstab(df.prestige, df.admit, dropna = False)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [7]:
# TODO
hot_encoding = pd.get_dummies(df.prestige, prefix = None)

hot_encoding

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Only 3 are required 

> ### Question 4.  Why are we doing this?

Answer: The presitge variable is categorical and needs to be transformed to a 0/1 value for each possibility, so that it can be used by our algos 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [12]:
# TODO

df_1 = pd.concat([df, hot_encoding], axis=1)

df_1
df_1.drop('prestige', axis=1, inplace=True)

df_1


Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [20]:
# TODO
df_1.columns = ['admit', 'gre', 'gpa', 'p1','p2','p3','p4']

pd.crosstab(df_1[df_1.p1==1].p1, df_1[df_1.p1==1].admit)


admit,0,1
p1,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [29]:
# TODO
a = (33/(28* 1.0))
a

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [31]:
# TODO
pd.crosstab(df_1[df_1.p1!=1].p1, df_1[df_1.p1!=1].admit)
b = (93 / (243 * 1.0))
b

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [32]:
# TODO

a/b

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: This basically means someone who attended a tier one school is three times as likely to be admitted than someone who did not attend a tier one school.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [38]:
# TODO

pd.crosstab(df_1[df_1.p4==1].p4, df_1[df_1.p4==1].admit)

c = (12/(55 * 1.0))
c


0.21818181818181817

Answer: .218 someone who attended a tier 4 school is .218 times as likely as someone who attended a tier 1-3 school to get admitted in the UCLA program. 

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [61]:
# TODO

names_X = ['gre', 'gpa', 'p2', 'p3', 'p4']

def X_c(df):
    X = df[ names_X ]
    c = df.admit
    return X, c
# train_X, train_c = X_c(train_df)

train_X, train_c = X_c(train_df)

logit = smf.Logit(train_c, train_X)
result = logit.fit()
# smf.discrete.discrete_model.Logit(df.admit,df[['p2','p3','p4','gpa','gre']]).fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [62]:
# TODO

result.summary()


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 07 Feb 2017",Pseudo R-squ.:,0.05722
Time:,17:54:44,Log-Likelihood:,-233.88
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.039e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0014,0.001,1.308,0.191,-0.001 0.003
gpa,-0.1323,0.195,-0.680,0.497,-0.514 0.249
p2,-0.9562,0.302,-3.171,0.002,-1.547 -0.365
p3,-1.5375,0.332,-4.627,0.000,-2.189 -0.886
p4,-1.8699,0.401,-4.658,0.000,-2.657 -1.083


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [63]:
# TODO
np.exp(result.params)

gre    1.001368
gpa    0.876073
p2     0.384342
p3     0.214918
p4     0.154135
dtype: float64

In [65]:
conf_int = result.conf_int()
conf_int

Unnamed: 0,0,1
gre,-0.00068,0.003414
gpa,-0.513657,0.249045
p2,-1.547279,-0.365166
p3,-2.188769,-0.88623
p4,-2.656743,-1.083112


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:When comparing a 2 ranked school to a 3 ranked school, the odds increase by 38%. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: For every increase in a point in a GPA, we expect odds increasing by 87% (within the confounds of GPA). 

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [70]:
# TODO
result.predict([ [800, 4.0, 0, 0, 0] ])
result.predict([ [800, 4.0, 1, 0, 0] ])
result.predict([ [800, 4.0, 0, 1, 0] ])
result.predict([ [800, 4.0, 0, 0, 1] ])

array([ 0.21318433])

Answer: .63, .40, .27, .213

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [74]:
# TODO

X = df_1[ ['gre', 'gpa', 'p2', 'p3', 'p4'] ]
c = df.admit

sklearn_model = linear_model.LogisticRegression(fit_intercept=False, C=10 ** 2).fit(X, c)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [75]:
# TODO
print np.exp(model.coef_)

[[ 1.00128195  0.87660575  0.41070667  0.22500931  0.16769503]]


Answer: Overall the numbers seem to be somewhat different, but overall aligned

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [82]:
# TODO
model.predict_proba([ [800, 4.0, 0, 0, 0] ]) #0.62201467
model.predict_proba([ [800, 4.0, 1, 0, 0] ]) #0.40329182
model.predict_proba([ [800, 4.0, 0, 1, 0] ]) #0.27022029
model.predict_proba([ [800, 4.0, 0, 0, 1] ]) #0.21627627

array([[ 0.78372373,  0.21627627]])

Answer: #s seem to be somewhat equivalent to what we were seeing before