# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
# TODO
prestige_group = df.groupby("prestige")
prestige_group["admit"].value_counts()

prestige  admit
1.0       1        33
          0        28
2.0       0        95
          1        53
3.0       0        93
          1        28
4.0       0        55
          1        12
Name: admit, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
# TODO
prestige_one_hot = pd.get_dummies(df["prestige"])
prestige_one_hot

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3

> ### Question 4.  Why are we doing this?

Answer: "Prestige" is really a categorical value, but if we pass it to sklearn or statsmodels without transforming it into a dummy variable, then sklearn or statsmodels will treat it as a numeric variable. However, this is inaccurate because it doesn't represent the relationships between the different prestige tiers in a sensible way. For example, it makes no sense to say that a prestige value of 1 is equal to half of a prestige value of 2. Converting "prestige" to a dummy variable enables us to accurately model the effect of being in a particular prestige tier.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
df = df.join(prestige_one_hot)
df = df.drop("prestige", axis = 1)
df

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [6]:
# TODO
df[df[1.0]==1]["admit"].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [7]:
# TODO
odds_prestige1 = 33/28
odds_prestige1

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [8]:
# TODO
other_undergrads_not_admitted = len(df[(df[1.0]!=1) & (df["admit"]==0)])
other_undergrads_admitted = len(df[(df[1.0]!=1) & (df["admit"]==1)])
odds_other_undergrads = other_undergrads_admitted / other_undergrads_not_admitted
odds_other_undergrads

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [9]:
# TODO
odds_ratio = odds_prestige1 / odds_other_undergrads
odds_ratio

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: If an applicant attended one of the most prestigious undergraduate schools, then his or her odds of being admitted to graduate school are slightly more than three times the odds of someone who attended a less prestigious school being admitted.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [10]:
# TODO
odds_admitted_prestige4 = 12/55
others_admitted = len(df[(df[4.0]!=1) & (df["admit"] == 1)])
others_not_admitted = len(df[(df[4.0]!=1) & (df["admit"]==0)])
odds_others_admitted = others_admitted / others_not_admitted
print("Odds of being admitted for applicants from least prestigious schools:", odds_admitted_prestige4)
print("Odds ratio:", odds_admitted_prestige4 / odds_others_admitted)

Odds of being admitted for applicants from least prestigious schools: 0.21818181818181817
Odds ratio: 0.4133971291866028


Answer: If an applicant attended one of the least prestigious undergraduate schools, his or her chances of being admitted to graduate school at UCLA are only 41% as good as the odds of someone who attended a more prestigious school being admitted.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [11]:
# TODO
df.rename(columns = {"admit":"admit", "gre":"gre", "gpa":"gpa", 1.0:"prestige1", 2.0:"prestige2", 3.0:"prestige3", 4.0:"prestige4"}, inplace = True)

In [12]:
df['intercept'] = 1

In [13]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige1,prestige2,prestige3,prestige4,intercept
0,0,380.0,3.61,0.0,0.0,1.0,0.0,1
1,1,660.0,3.67,0.0,0.0,1.0,0.0,1
2,1,800.0,4.0,1.0,0.0,0.0,0.0,1
3,1,640.0,3.19,0.0,0.0,0.0,1.0,1
4,0,520.0,2.93,0.0,0.0,0.0,1.0,1


In [14]:
features = ["gre", "gpa", "prestige2", "prestige3", "prestige4", "intercept"]
logreg_model = smf.Logit(df["admit"], df[features])
results = logreg_model.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [15]:
# TODO
print(results.summary())

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Mon, 16 Jan 2017   Pseudo R-squ.:                 0.08166
Time:                        16:49:52   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.019         0.128     1.431
prestige2     -0.6801      0.317     -2.146      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [16]:
# TODO
print("Odds ratios of features")
print(np.exp(results.params))

Odds ratios of features
gre          1.002221
gpa          2.180027
prestige2    0.506548
prestige3    0.262192
prestige4    0.211525
intercept    0.020716
dtype: float64


In [17]:
print("Odds ratios of features, 95% confidence intervals")
print(np.exp(results.conf_int()))

Odds ratios of features, 95% confidence intervals
                  0         1
gre        1.000074  1.004372
gpa        1.136120  4.183113
prestige2  0.272168  0.942767
prestige3  0.133377  0.515419
prestige4  0.093329  0.479411
intercept  0.002207  0.194440


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds of an applicant who went to a school whose prestige value is 2 being admitted to UCLA is slightly above half the odds of an applicant who went to one of the most prestigious schools being admitted. Furthermore, we are 95% confident that the odds ratio for prestige = 2 is between .272168 and .942767.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: Holding the other variables constant, an increase of one point in an applicant's GPA increases his or her odds of being admitted by 118%.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [18]:
# Question 17 answer
def calc_prob(gre=800, gpa=4, prestige2=0, prestige3=0, prestige4=0, intercept=1):
    log_odds = .0022 * gre + .7793 * gpa - 0.6801 * prestige2 - 1.3387 * prestige3 -1.5534 * prestige4 - 3.8769
    odds = np.exp(log_odds)
    prob = odds / (1+ odds)
    return prob

print("Tier 1 probability:", calc_prob())
print("Tier 2 probability:", calc_prob(prestige2=1))
print("Tier 3 probability:", calc_prob(prestige3=1))
print("Tier 4 probability:", calc_prob(prestige4=1))

Tier 1 probability: 0.731117558121
Tier 2 probability: 0.579372992908
Tier 3 probability: 0.416198188472
Tier 4 probability: 0.365145484814


## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [19]:
# TODO
new_features = ["gre", "gpa", "prestige2", "prestige3", "prestige4"]
new_model = linear_model.LogisticRegression(C = 10**2)
new_model.fit(df[new_features], df["admit"])

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [20]:
# TODO
print("Odds ratios calculated with sklearn")
list(zip(new_features,np.exp(new_model.coef_)[0]))

Odds ratios calculated with sklearn


[('gre', 1.002160546045296),
 ('gpa', 1.9604125879025451),
 ('prestige2', 0.53321935659458952),
 ('prestige3', 0.28586733124690045),
 ('prestige4', 0.20829662792241602)]

In [21]:
print("Odds ratios calculated with statsmodels")
print(np.exp(results.params))

Odds ratios calculated with statsmodels
gre          1.002221
gpa          2.180027
prestige2    0.506548
prestige3    0.262192
prestige4    0.211525
intercept    0.020716
dtype: float64


Answer: For the most part, the odds ratios are similar. The exception is the odds ratio for gpa, which is notably higher when calculated with statsmodels. Holding the other variables constant, statsmodels predicts that the odds of being admitted rise by 118% with a one-point increase in GPA, while sklearn predicts that the odds of being admitted rise by 96% with a one-point increase in GPA.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [22]:
coeffs = new_model.coef_[0]
model_intercept = new_model.intercept_
def calc_prob2(gre=800, gpa=4, prestige2=0, prestige3=0, prestige4=0, intercept=model_intercept):
    log_odds = (coeffs[0] * gre) + (coeffs[1] * gpa) + (coeffs[2] * prestige2) + \
    (coeffs[3] * prestige3) + (coeffs[4] * prestige4) + intercept
    odds = np.exp(log_odds)
    prob = odds / (1+ odds)
    return prob

print("Tier 1 probability:", calc_prob2())
print("Tier 2 probability:", calc_prob2(prestige2=1))
print("Tier 3 probability:", calc_prob2(prestige3=1))
print("Tier 4 probability:", calc_prob2(prestige4=1))

Tier 1 probability: [ 0.71185395]
Tier 2 probability: [ 0.56846298]
Tier 3 probability: [ 0.41391064]
Tier 4 probability: [ 0.33975486]
