# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [3]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [4]:
pd.crosstab(df.prestige, df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
df.prestige.value_counts(dropna = False).sort_index()

1.0     61
2.0    148
3.0    121
4.0     67
Name: prestige, dtype: int64

In [6]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige_df.head()

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0


In [7]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                           'prestige_2.0': 'prestige_2',
                           'prestige_3.0': 'prestige_3',
                           'prestige_4.0': 'prestige_4'}, inplace = True)

> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3 of 4. If we know 3 variables, we always know what's the 4th gonna be, therefore 4th variable is dependent and we should not use it. 

> ### Question 4.  Why are we doing this?

Answer: By changing to binary variables, we can see how admission changes if presgige grow from 1 to 2, 2 to 3, etc. So we don't assume that prestige changes proportionally and are able to build more accurate model.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [8]:
df = df.join([prestige_df])
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige', u'prestige_1', u'prestige_2',
       u'prestige_3', u'prestige_4'],
      dtype='object')

In [9]:
df.drop(['prestige'], axis = 1, inplace = True)

In [10]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [11]:
fr_t = pd.crosstab(df.admit, df.prestige_1==1)[1]
fr_t

admit
0    28
1    33
Name: True, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [12]:
p_admitted_pr1 = fr_t[1] * 1.0 / (df.prestige_1==1).sum()
print 'Probability: ', p_admitted_pr1*100

Probability:  54.0983606557


In [13]:
odds_admitted_pr1 = p_admitted_pr1 / (1 -  p_admitted_pr1)
print 'Odds: ', odds_admitted_pr1

Odds:  1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [14]:
fr_t2 = pd.crosstab(df.admit, df.prestige_1==0)[1]
fr_t2

admit
0    243
1     93
Name: True, dtype: int64

In [15]:
p_admitted_not_pr1 = 1.0 * fr_t2[1] / (df.prestige_1==0).sum()
print 'Probability: ', p_admitted_not_pr1*100

Probability:  27.6785714286


In [16]:
odds_admitted_not_pr1 = p_admitted_not_pr1 / (1 - p_admitted_not_pr1)
print 'Odds: ', odds_admitted_not_pr1

Odds:  0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [17]:
ratio = odds_admitted_pr1 * 1.0 / odds_admitted_not_pr1

In [18]:
print ratio

3.07949308756


> ### Question 10.  Write this finding in a sentenance.

Answer: 54% of students who attended the most prestigious undergraduate schoolds gets admitted into graduate schools.
But only 27% of students who attended other schoolds gets admitted into graduate schools. 
The odds of admission a student from the most prestigious undergraduate schoolds are 3 times higher than students from other schools.

### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [19]:
fr_t3 = pd.crosstab(df.admit, df.prestige_4==1)[1]
fr_t3

admit
0    55
1    12
Name: True, dtype: int64

In [20]:
p_admitted_pr4 = fr_t3[1] * 1.0 / (df.prestige_4==1).sum()
print 'Probability: ', p_admitted_pr4*100

Probability:  17.9104477612


In [21]:
odds_admitted_pr4 = p_admitted_pr4 / (1 -  p_admitted_pr4)
print 'Odds: ', odds_admitted_pr4

Odds:  0.218181818182


In [22]:
fr_t4 = pd.crosstab(df.admit, df.prestige_4==0)[1]
fr_t4

admit
0    216
1    114
Name: True, dtype: int64

In [23]:
p_admitted_not_pr4 = 1.0 * fr_t4[1] / (df.prestige_4==0).sum()
print 'Probability: ', p_admitted_not_pr4*100

Probability:  34.5454545455


In [24]:
odds_admitted_not_pr4 = p_admitted_not_pr4 / (1 - p_admitted_not_pr4)
print 'Odds: ', odds_admitted_not_pr4

Odds:  0.527777777778


In [25]:
ratio = odds_admitted_pr4 * 1.0 / odds_admitted_not_pr4
ratio

0.41339712918660282

Answer: Students from the least prestigious schools have 0.4 odds to be admitted comparing to students from all other schools.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [26]:
model = smf.logit(formula = 'admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4', data = df).fit()


Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 03 Nov 2016",Pseudo R-squ.:,0.08166
Time:,13:37:17,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [28]:
print np.exp(model.params)

Intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds that students from schools with prestige = 2 will be admitted are 38% more than other students.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: The odds of admission increase by 88% with 1 point increase of GPA.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [35]:
predict_X1 = [800, 4, 0, 0, 0]
predict_X2 = [800, 4, 1, 0, 0]
predict_X3 = [800, 4, 0, 1, 0]
predict_X4 = [800, 4, 0, 0, 1]




In [146]:
print 'School prestige_1: ', model.predict(predict_X1)
print 'School prestige_2: ', model.predict(predict_X2)
print 'School prestige_3: ', model.predict(predict_X3)
print 'School prestige_4: ', model.predict(predict_X4)

School prestige_1:  [ 0.63739858]
School prestige_2:  [ 0.40320425]
School prestige_3:  [ 0.27420161]
School prestige_4:  [ 0.21318433]


Answer: P(tier-1) = 63.7% ; P(tier-2) = 40.3%; P(tier-3) = 27.4%; P(tier-4) = 21.3%.

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [29]:
X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ]
y = df.admit

model_new = linear_model.LogisticRegression(C = 10 ** 2).fit(X, y)

In [30]:
model_new.score(X,y)

0.70528967254408059

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [32]:
print np.exp(model_new.intercept_)
print np.exp(model_new.coef_)

[ 0.02975414]
[[ 1.00216055  1.96041259  0.53321936  0.28586733  0.20829663]]


In [160]:
zip(X, np.exp(model_new.coef_[0]))

[('gre', 1.002160546050425),
 ('gpa', 1.9604125876545186),
 ('prestige_2', 0.53321935757232597),
 ('prestige_3', 0.28586733162404865),
 ('prestige_4', 0.20829662748852928)]

In [33]:
print 'statsmodel\n', np.exp(model.params)

statsmodel
Intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64


Answer: the odds ratio for GPA in sklearn is significantly higher, the other odds ratios are pretty close in these 2 models, although prestige_2 and prestige_3 are higher in sklearn as well. 

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [36]:
print 'School prestige_1: ', model_new.predict_proba(predict_X1)
print 'School prestige_2: ', model_new.predict_proba(predict_X2)
print 'School prestige_3: ', model_new.predict_proba(predict_X3)
print 'School prestige_4: ', model_new.predict_proba(predict_X4)

School prestige_1:  [[ 0.28814605  0.71185395]]
School prestige_2:  [[ 0.43153702  0.56846298]]
School prestige_3:  [[ 0.58608936  0.41391064]]
School prestige_4:  [[ 0.66024514  0.33975486]]




Answer: P(tier-1) = 71.2% ; P(tier-2) = 56.8%; P(tier-3) = 41.4%; P(tier-4) = 34%.
Looks like the hypothesis was correct.