# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
df.prestige.value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We only need 3 since the information (signal) in the fourth one will be captured by the three we choose. And based on the counts we should probably not include `prestige_2.0` in our model since that occurs most frequently.

> ### Question 4.  Why are we doing this?

Answer: We want to avoid multicollinearity. So we can either just use 3 variables or remove intercept use regularization.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
df = df[['admit', 'gre', 'gpa']].join(prestige_df)
df

Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [6]:
df[(df['prestige_1.0'] == 1)].admit.value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [7]:
1.18

1.18

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [8]:
0.38

0.38

> ### Question 9.  Finally, what's the odds ratio?

In [9]:
3.10

3.1

> ### Question 10.  Write this finding in a sentence.

Answer: Undergraduates who attended the `#1` ranked college have 3 times higher odds of getting admitted compared to the undergraduates who did not attend the `#1` college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [10]:
df[(df['prestige_4.0'] == 1)].admit.value_counts()
odds_p4a = 0.22

df[(df['prestige_4.0'] != 1)].admit.value_counts()
odds_not_p4a = 0.53

odds_ratio = 0.42


Answer: Undergraduates who attended the least prestigious college are less than half likely to get admitted to UCLA compared to others.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [11]:
df['intercept'] = 1.0

#train_df = df.sample(frac = .8, random_state = 0)
#test_df = df.drop(train_df.index)

train_cols = ['intercept', 'gre', 'gpa', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0']

#logit = sm.Logit(train_df['admit'], train_df[train_cols])
logit = sm.Logit(df['admit'], df[train_cols])

# fit the model
model = logit.fit()


Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [12]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 07 Feb 2017",Pseudo R-squ.:,0.08166
Time:,20:04:48,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [13]:
print 'odd ratios'
print np.exp(model.params)
print
print 'confidence intervals'
print model.conf_int()

odd ratios
intercept       0.020716
gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
dtype: float64

confidence intervals
                     0         1
intercept    -6.116077 -1.637631
gre           0.000074  0.004362
gpa           0.127619  1.431056
prestige_2.0 -1.301337 -0.058936
prestige_3.0 -2.014579 -0.662776
prestige_4.0 -2.371624 -0.735197


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Based on the training data looks like undergraduates who attended colleges with prestige = 2 have 0.5 odds of getting admitted in their favor compared to others.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: One unit change in gpa increases your odds of getting admitted by 2 times

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [16]:
df_to_predict = pd.DataFrame({
            'intercept': [1.0 for x in range(4)],
            'gre': [800.0 for x in range(4)],
            'gpa': [4.0 for x in range(4)],
            'prestige': [1.0,2.0,3.0,4.0]
        })
prestige_df_to_predict = pd.get_dummies(df_to_predict.prestige, prefix = 'prestige')
df_to_predict = df_to_predict[['intercept', 'gre', 'gpa']].join(prestige_df_to_predict)
df_to_predict['admit_prediction_prob'] = model.predict(df_to_predict[train_cols])
df_to_predict

Unnamed: 0,intercept,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0,admit_prediction_prob
0,1.0,800.0,4.0,1.0,0.0,0.0,0.0,0.73404
1,1.0,800.0,4.0,0.0,1.0,0.0,0.0,0.582995
2,1.0,800.0,4.0,0.0,0.0,1.0,0.0,0.419833
3,1.0,800.0,4.0,0.0,0.0,0.0,1.0,0.368608


Answer: So if a student came from a tier-1 college he/she has a 0.73 probability of getting admitted and the probability goes down to 0.37 if the same student is from tier-4

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [18]:
sk_model = linear_model.LogisticRegression(C = 10 ** 2).fit(df[train_cols], df['admit'])

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [21]:
zip(train_cols, np.exp(sk_model.coef_[0]))

[('intercept', 0.16981759074332772),
 ('gre', 1.0020998959288006),
 ('gpa', 2.0505173223177762),
 ('prestige_2.0', 0.48210089088427549),
 ('prestige_3.0', 0.24652420808276629),
 ('prestige_4.0', 0.20131997081905478)]

Answer: They are not exactly same as the `statsmodel` but they seem to be pretty close.

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [25]:
sk_model.predict_proba(df_to_predict[train_cols])

array([[ 0.26806485,  0.73193515],
       [ 0.43171408,  0.56828592],
       [ 0.59768586,  0.40231414],
       [ 0.64528942,  0.35471058]])

Answer: The probability of admission for different tiered colleges goes down with decreasing prestige as expected and is very close to what was predicted by the statsmodel.