# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [44]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [45]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [46]:
# TODO
table = pd.crosstab(index=df["admit"], columns="prestige")
table

col_0,prestige
admit,Unnamed: 1_level_1
0,271
1,126


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [47]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [48]:
prestige_df.rename(columns = {'prestige_1.0' : 'prestige_1', 
                              'prestige_2.0' : 'prestige_2',
                              'prestige_3.0' : 'prestige_3',
                              'prestige_4.0' : 'prestige_4',}, inplace = True)
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need three of these because there are four variable combinations in total.

> ### Question 4.  Why are we doing this?

Answer: One hot encoding is used to encode categorical integer features so that they can be used as features in any given model.

The output will be a matrix where each column corresponds to one possible value of one feature, until every possible value is represented by one of these binary vectors.


> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [53]:
#not sure what happened here, but I have already added the binary variables with the following formula 
#df = df.join([prestige_df])
df

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,3.0,0.0,0.0,1.0,0.0
1,1,660.0,3.67,3.0,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,0.0,1.0
4,0,520.0,2.93,4.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0.0,1.0,0.0,0.0
396,0,560.0,3.04,3.0,0.0,0.0,1.0,0.0
397,0,460.0,2.63,2.0,0.0,1.0,0.0,0.0
398,0,700.0,3.65,2.0,0.0,1.0,0.0,0.0


In [69]:
#now we need to remove the prestige feature 
df.drop('prestige', axis=1, inplace=True)
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [123]:
# TODO
table1 = pd.crosstab(index=df['admit'],columns=[df.prestige_1])
table1

prestige_1,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [None]:
# TODO
#For the most prestigious schools, illustrated in the prestige_4 frequency table, the odds of getting addmitted is 27.9%

#33/61 / (1-(33/61)) = 1.17

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [None]:
# TODO

#93/336 / (1-(93/336) = 0.38

> ### Question 9.  Finally, what's the odds ratio?

In [None]:
# TODO
# The odds ratio is success ratio divided by the failure ratio = 1.17/0.38 = 3.07

> ### Question 10.  Write this finding in a sentenance.

Answer: An undergrad who attended a #1 ranked college is 3 times as likely to be admitted 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [130]:
# TODO
table4 = pd.crosstab(index=df['admit'],columns=[df.prestige_4])
table4

prestige_4,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [None]:
#Success ratio: 12/(12+55) / (1-(12/(12+55))) = 0.218
#Failure ratio: 114/(114+216) / (1-(114/(114+216)) = 0.527
#odds ratio: 0.218 / 0.527 = 0.4

Answer: An undergrad who attended a #1 ranked college is 0.4 times as likely to be admitted .

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [129]:
# TODO
smf.ols(formula = 'admit ~ gre + gpa + prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.059
Model:,OLS,Adj. R-squared:,0.052
Method:,Least Squares,F-statistic:,8.181
Date:,"Tue, 25 Oct 2016",Prob (F-statistic):,2.7e-05
Time:,17:44:14,Log-Likelihood:,-247.69
No. Observations:,397,AIC:,503.4
Df Residuals:,393,BIC:,519.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.4413,0.211,-2.092,0.037,-0.856 -0.027
gre,0.0005,0.000,2.443,0.015,0.000 0.001
gpa,0.1404,0.065,2.158,0.032,0.012 0.268
prestige_4,-0.1428,0.061,-2.337,0.020,-0.263 -0.023

0,1,2,3
Omnibus:,410.966,Durbin-Watson:,1.938
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.503
Skew:,0.696,Prob(JB):,1.98e-13
Kurtosis:,1.735,Cond. No.,5740.0


> ### Question 13.  Print the model's summary results.

In [None]:
# TODO
see above

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO
#The odds ratio for gre = p/(1-p) = 0.015/(1-0.015) = 0.01522
#odds ratio for gpa = p/(1-p) = 0.032/(1-0.032) = 0.033
#odds ratio for prestige_4

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: