# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
# TODO
df['prestige'].value_counts().sort_index()

1.0     61
2.0    148
3.0    121
4.0     67
Name: prestige, dtype: int64

In [4]:
pd.crosstab(df['prestige'],df['admit'])

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
# TODO
df['prestige_2_dummy'] = np.where(df['prestige'] == 2, 1, 0)
df['prestige_3_dummy'] = np.where(df['prestige'] == 3, 1, 0)
df['prestige_4_dummy'] = np.where(df['prestige'] == 4, 1, 0)

In [6]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_2_dummy,prestige_3_dummy,prestige_4_dummy
0,0,380.0,3.61,3.0,0,1,0
1,1,660.0,3.67,3.0,0,1,0
2,1,800.0,4.0,1.0,0,0,0
3,1,640.0,3.19,4.0,0,0,1
4,0,520.0,2.93,4.0,0,0,1


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Three

> ### Question 4.  Why are we doing this?

Answer: We transform a categorical variable to multiple binary variables (number of categories minus one) in order to perform regression with the categorical variable. We don't need a dummy variable for each category because one category can be the "default" category where all the dummy variable equals zero.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [7]:
df.drop('prestige', axis=1, inplace=True)
df.head()

Unnamed: 0,admit,gre,gpa,prestige_2_dummy,prestige_3_dummy,prestige_4_dummy
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.0,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [8]:
# TODO
df['admit'][(df['prestige_2_dummy']==0) & (df['prestige_3_dummy']==0) & (df['prestige_4_dummy']==0)].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [9]:
# TODO
33./28

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [10]:
# TODO
df['admit'][(df['prestige_2_dummy']==1) | (df['prestige_3_dummy']==1) | (df['prestige_4_dummy']==1)].value_counts()

0    243
1     93
Name: admit, dtype: int64

In [11]:
93./243

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [12]:
# TODO
(33./28)/(93./243)

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: The odds of admisssion to graduate school for applicants that attended the most prestigious undergraduate schools is three times that of those who did not attend the most prestigious schools.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [13]:
# TODO
four_odds = 12./55

In [14]:
not_four_odds = float(33+53+28)/(28+95+93)

In [17]:
four_OR = four_odds/not_four_odds
print four_odds, not_four_odds, four_OR

0.218181818182 0.527777777778 0.413397129187


Answer: The odds of admisssion to graduate school for applicants that attended the least prestigious undergraduate schools is 41% that of those who attended more prestigious schools.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [18]:
# TODO
import statsmodels.api as sm

X = df.drop('admit', axis=1)
y = df['admit']
X['intercept'] = 1
model = sm.Logit(y, X)
results = model.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [19]:
# TODO
print results.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Mon, 16 Jan 2017   Pseudo R-squ.:                 0.08166
Time:                        19:55:52   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
gre                  0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa                  0.7793      0.333      2.344      0.019         0.128     1.431
prestige_2_dummy    -0.6801 

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [20]:
# TODO
odds_ratio = np.exp(results.conf_int())
odds_ratio['odds_ratio'] = np.exp(results.params)
odds_ratio.columns = ['lower_bound', 'upper_bound', 'odds_ratio']

In [21]:
odds_ratio

Unnamed: 0,lower_bound,upper_bound,odds_ratio
gre,1.000074,1.004372,1.002221
gpa,1.13612,4.183113,2.180027
prestige_2_dummy,0.272168,0.942767,0.506548
prestige_3_dummy,0.133377,0.515419,0.262192
prestige_4_dummy,0.093329,0.479411,0.211525
intercept,0.002207,0.19444,0.020716


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds of admisssion to graduate school for applicants that attended a tier-2 undergraduate schools is about half the odds for those who attended a tier-1 school.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: The odds of admission to graduate school for applicants who attended a tier-1 school doubles if their gpa is 1.0 higher.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [22]:
# TODO
odds_1 = np.exp(np.dot(results.params, [800, 4, 0, 0, 0, 1]))
odds_2 = np.exp(np.dot(results.params, [800, 4, 1, 0, 0, 1]))
odds_3 = np.exp(np.dot(results.params, [800, 4, 0, 1, 0, 1]))
odds_4 = np.exp(np.dot(results.params, [800, 4, 0, 0, 1, 1]))
odds = np.array([odds_1, odds_2, odds_3, odds_4])
table = pd.DataFrame({'odds': odds}, index=[1, 2, 3, 4])
table.index.name = 'tier'
table['probability'] = table.odds / (1 + table.odds)

In [23]:
table

Unnamed: 0_level_0,odds,probability
tier,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.759964,0.73404
2,1.398053,0.582995
3,0.723641,0.419833
4,0.583802,0.368608


Answer: 73% for tier-1 applicant, 58% for tier-2 applicant, 42% for tier-3 applicant and 37% for tier-4 applicant

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [24]:
# TODO
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C = 10 ** 2)
X.drop('intercept', axis=1, inplace=True)
logreg.fit(X, y)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [25]:
# TODO
sklearn_OR = np.exp(np.append(logreg.coef_,logreg.intercept_))
odds_ratio['sklearn_OR'] = sklearn_OR
odds_ratio

Unnamed: 0,lower_bound,upper_bound,odds_ratio,sklearn_OR
gre,1.000074,1.004372,1.002221,1.002161
gpa,1.13612,4.183113,2.180027,1.960413
prestige_2_dummy,0.272168,0.942767,0.506548,0.533219
prestige_3_dummy,0.133377,0.515419,0.262192,0.285867
prestige_4_dummy,0.093329,0.479411,0.211525,0.208297
intercept,0.002207,0.19444,0.020716,0.029754


Answer: Odds ratios for sklearn ('sklearn_OR') are similar to the odds ratios calculated with statsmodels.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [26]:
# TODO
sklearn_prob = logreg.predict_proba([[800, 4, 0, 0, 0], [800, 4, 1, 0, 0], [800, 4, 0, 1, 0], [800, 4, 0, 0, 1]])
table['sklearn_prob'] = sklearn_prob[:,1]
table

Unnamed: 0_level_0,odds,probability,sklearn_prob
tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2.759964,0.73404,0.711854
2,1.398053,0.582995,0.568463
3,0.723641,0.419833,0.413911
4,0.583802,0.368608,0.339755


Answer: 71% for tier-1 applicant, 57% for tier-2 applicant, 41% for tier-3 applicant and 34% for tier-4 applicant