# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
pd.crosstab(df.prestige,df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [4]:
pd.crosstab(df.prestige,df.admit,normalize='index')

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.459016,0.540984
2.0,0.641892,0.358108
3.0,0.768595,0.231405
4.0,0.820896,0.179104


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
dummy_ranks = pd.get_dummies(df['prestige'].astype(int), prefix='prestige')

In [16]:
dummy_ranks.head(5)

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1


In [6]:
df=df.join(dummy_ranks)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 399
Data columns (total 8 columns):
admit         397 non-null int64
gre           397 non-null float64
gpa           397 non-null float64
prestige      397 non-null float64
prestige_1    397 non-null uint8
prestige_2    397 non-null uint8
prestige_3    397 non-null uint8
prestige_4    397 non-null uint8
dtypes: float64(3), int64(1), uint8(4)
memory usage: 17.1 KB


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3

> ### Question 4.  Why are we doing this?

Answer: This is a categorical feature and we need to convert it to binary numeric features in order to be fit using sckit learn package in python. We only include 3 out of 4 in the actual modeling because we are trying to avoid including perfectly correlated features in our model. Once you include 3 dummies, the other is simply a perfect linear combination of these 3 dummies. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [8]:
df.drop('prestige',inplace=True,axis=1)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 399
Data columns (total 7 columns):
admit         397 non-null int64
gre           397 non-null float64
gpa           397 non-null float64
prestige_1    397 non-null uint8
prestige_2    397 non-null uint8
prestige_3    397 non-null uint8
prestige_4    397 non-null uint8
dtypes: float64(2), int64(1), uint8(4)
memory usage: 14.0 KB


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [29]:
pd.crosstab(index=df['prestige_1'],columns=df['admit'])

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


In [32]:
df.admit[df.prestige_1==1].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [37]:
df1=pd.crosstab(index=df['prestige_1'],columns=df['admit'],normalize='index')
print(df1)

admit              0         1
prestige_1                    
0           0.723214  0.276786
1           0.459016  0.540984


In [44]:
p=df1.loc[1,1]
odds1=p/(1-p)
print(odds1)

1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [40]:
df2=pd.crosstab(index=df[df.prestige_1==0].prestige_1,columns=df[df.prestige_1==0].admit,normalize='index')

In [42]:
print(df2)

admit              0         1
prestige_1                    
0           0.723214  0.276786


In [45]:
odds0=df2.loc[0,1]/df2.loc[0,0]

> ### Question 9.  Finally, what's the odds ratio?

In [46]:
odds1/odds0

3.0794930875576041

> ### Question 10.  Write this finding in a sentence.

Answer: odds of getting into grad school from the most prestigous school is 3 times more likely than from not the most prestigous school

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [None]:
df2=pd.crosstab(index=df[df.prestige_1==0].prestige_1,columns=df[df.prestige_1==0].admit,normalize='index')

Answer:

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [91]:
df = sm.add_constant(df)
# y = df['admit'].values
# logit_model = logit(y, X).fit()
print(df.head(5))

   const  admit    gre   gpa  prestige_1  prestige_2  prestige_3  prestige_4
0      1      0  380.0  3.61           0           0           1           0
1      1      1  660.0  3.67           0           0           1           0
2      1      1  800.0  4.00           1           0           0           0
3      1      1  640.0  3.19           0           0           0           1
4      1      0  520.0  2.93           0           0           0           1


In [98]:
X=df[['const','gre','gpa','prestige_2','prestige_3','prestige_4']]

In [99]:
X.head(5)

Unnamed: 0,const,gre,gpa,prestige_2,prestige_3,prestige_4
0,1,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.0,0,0,0
3,1,640.0,3.19,0,0,1
4,1,520.0,2.93,0,0,1


In [93]:
y=df['admit']

In [100]:
logit_model = sm.Logit(y, X)

In [104]:
sm.Logit.fit?

In [101]:
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [102]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Mon, 17 Apr 2017   Pseudo R-squ.:                 0.08166
Time:                        18:24:13   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -3.8769      1.142     -3.393      0.001        -6.116    -1.638
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [132]:
sm_res=pd.DataFrame(data=result.params,columns=['coef'])

In [130]:
sm_ci=result.conf_int()
sm_ci.columns=['95% Conf. Int. Lower Bound','95% Conf. Int. Upper Bound']

In [131]:
print(sm_ci)

            95% Conf. Int. Lower Bound  95% Conf. Int. Upper Bound
const                        -6.116077                   -1.637631
gre                           0.000074                    0.004362
gpa                           0.127619                    1.431056
prestige_2                   -1.301337                   -0.058936
prestige_3                   -2.014579                   -0.662776
prestige_4                   -2.371624                   -0.735197


In [133]:
sm_res=sm_res.join(sm_ci)

In [134]:
print(sm_res)

                coef  95% Conf. Int. Lower Bound  95% Conf. Int. Upper Bound
const      -3.876854                   -6.116077                   -1.637631
gre         0.002218                    0.000074                    0.004362
gpa         0.779337                    0.127619                    1.431056
prestige_2 -0.680137                   -1.301337                   -0.058936
prestige_3 -1.338677                   -2.014579                   -0.662776
prestige_4 -1.553411                   -2.371624                   -0.735197


In [137]:
sm_res['odds_ratio']=sm_res.coef.apply(np.exp)
sm_res['95% CI LB Odds Ratio']=sm_res['95% Conf. Int. Lower Bound'].apply(np.exp)
sm_res['95% CI UB Odds Ratio']=sm_res['95% Conf. Int. Upper Bound'].apply(np.exp)

In [138]:
print(sm_res)

                coef  95% Conf. Int. Lower Bound  95% Conf. Int. Upper Bound  \
const      -3.876854                   -6.116077                   -1.637631   
gre         0.002218                    0.000074                    0.004362   
gpa         0.779337                    0.127619                    1.431056   
prestige_2 -0.680137                   -1.301337                   -0.058936   
prestige_3 -1.338677                   -2.014579                   -0.662776   
prestige_4 -1.553411                   -2.371624                   -0.735197   

            odds_ratio  95% CI LB Odds Ratio  95% CI UB Odds Ratio  
const         0.020716              0.002207              0.194440  
gre           1.002221              1.000074              1.004372  
gpa           2.180027              1.136120              4.183113  
prestige_2    0.506548              0.272168              0.942767  
prestige_3    0.262192              0.133377              0.515419  
prestige_4    0.211525   

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Holding everything else equal, a student from a first tier school is twice more likely to be admitted than a second tier school on average. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: Holding everything else equal, a student with 1 point higher gpa is 2.18 times more likely to be admitted.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [172]:
X_test=pd.DataFrame.from_dict({'gre':[800.0,800.0,800.0,800.0],'gpa':[4.00,4.00,4.00,4.00], \
                     'prestige_2':[0,1,0,0],'prestige_3':[0,0,1,0],'prestige_4':[0,0,0,1]})

In [173]:
X_test = sm.add_constant(X_test,has_constant='add')

In [174]:
print(X_test)

   const  gpa    gre  prestige_2  prestige_3  prestige_4
0      1  4.0  800.0           0           0           0
1      1  4.0  800.0           1           0           0
2      1  4.0  800.0           0           1           0
3      1  4.0  800.0           0           0           1


In [170]:
y_pred = result.predict(X_test[['const','gpa','gre','prestige_2','prestige_3','prestige_4']])

In [171]:
print(result.predict(X_test))

[ 1.  1.  1.  1.]


Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [10]:
from sklearn.linear_model import LogisticRegression
lm=LogisticRegression(C=10**2)
X=df[['gre','gpa','prestige_2','prestige_3','prestige_4']]
y=df['admit']
lm.fit(X,y)
lm.coef_

array([[ 0.00215822,  0.67315495, -0.62882239, -1.25222745, -1.56879212]])

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [183]:
coef=lm.coef_
dat=pd.DataFrame(data=coef,columns=X.columns)
dat=dat.T
print(dat)

                   0
gre         0.002158
gpa         0.673155
prestige_2 -0.628822
prestige_3 -1.252227
prestige_4 -1.568792


In [184]:
dat.columns=['coef']

In [185]:
dat['odds_ratio']=dat.coef.apply(np.exp)

In [186]:
dat

Unnamed: 0,coef,odds_ratio
gre,0.002158,1.002161
gpa,0.673155,1.960413
prestige_2,-0.628822,0.533219
prestige_3,-1.252227,0.285867
prestige_4,-1.568792,0.208297


Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [11]:
sample=pd.DataFrame.from_dict({'gre':[800,800,800,800],'gpa':[4,4,4,4],'prestige_2':[0,1,0,0], \
                               'prestige_3':[0,0,1,0],'prestige_4':[0,0,0,1]})
print(sample)
print(lm.predict_proba(sample[['gre','gpa','prestige_2','prestige_3','prestige_4']]))
print(lm.predict(sample[['gre','gpa','prestige_2','prestige_3','prestige_4']]))
print(lm.predict(np.array([[2.7, 300.0, 0.0, 0.0, 0.0]])))

   gpa  gre  prestige_2  prestige_3  prestige_4
0    4  800           0           0           0
1    4  800           1           0           0
2    4  800           0           1           0
3    4  800           0           0           1
[[ 0.28814605  0.71185395]
 [ 0.43153702  0.56846298]
 [ 0.58608936  0.41391064]
 [ 0.66024514  0.33975486]]
[1 1 0 0]
[1]


In [196]:
sample.dtypes

gpa           float64
gre             int64
prestige_2      int64
prestige_3      int64
prestige_4      int64
dtype: object

In [12]:
X_test=sample[['gpa','gre','prestige_2','prestige_3','prestige_4']]

In [13]:
X_test.as_matrix()

array([[  4, 800,   0,   0,   0],
       [  4, 800,   1,   0,   0],
       [  4, 800,   0,   1,   0],
       [  4, 800,   0,   0,   1]])

In [79]:
lm.predict_proba(np.array([4, 800, 0, 0, 0]))



array([[ 0.,  1.]])

In [14]:
print(X_test)

   gpa  gre  prestige_2  prestige_3  prestige_4
0    4  800           0           0           0
1    4  800           1           0           0
2    4  800           0           1           0
3    4  800           0           0           1


In [15]:
print(lm.predict_proba(sample.as_matrix()))

[[ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]]


In [16]:
print(lm.predict_proba(sample[['gpa','gre','prestige_2','prestige_3','prestige_4']]))

[[ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]]


In [200]:
print(y_predict_proba)

[[ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]]


Answer: