# DS-SF-25 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn import linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
pd.crosstab(df.admit, df.prestige)

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
prestige = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3

> ### Question 4.  Why are we doing this?

Answer: to fit more information on a potencial model in a linear way. there maybe some significance in using prestige to explain admission, but it is not clear when using it now.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
df = df.join(prestige)

In [6]:
df = df.drop('prestige', axis = 1)

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [7]:
presXadmit_1 = pd.crosstab(df['prestige_1.0'],df.admit)
presXadmit_1

admit,0,1
prestige_1.0,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [8]:
float(presXadmit_1[1][1])/float(presXadmit_1[1][0])

0.3548387096774194

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [9]:
float(presXadmit_1[0][1])/float(presXadmit_1[0][0])

0.11522633744855967

> ### Question 9.  Finally, what's the odds ratio?

In [10]:
# #of times that an event happens / #of times that it doesn't happen

> ### Question 10.  Write this finding in a sentenance.

Answer: it relates the success and failures of an specific event happen or not. If it is over 1, the event is more problably to happen. If under 1, the problability is bigger to not happen.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [11]:
presXadmit_4 = pd.crosstab(df['prestige_4.0'],df.admit)
presXadmit_4

admit,0,1
prestige_4.0,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,216,114
1.0,55,12


In [12]:
float(presXadmit_4[1][1])/float(presXadmit_4[1][0])

0.10526315789473684

Answer: The odds of beeing admited on UCLA with a low prestige are much lower than with a high prestige.

## Part C. Analysis using `statsmodel`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [17]:
y = df['admit']
X = df[['gre', 'gpa', 'prestige_1.0']]

model = smf.OLS(y, X).fit()


> ### Question 13.  Print the model's summary results.

In [18]:
model.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.363
Model:,OLS,Adj. R-squared:,0.358
Method:,Least Squares,F-statistic:,74.92
Date:,"Fri, 23 Sep 2016",Prob (F-statistic):,2.3299999999999998e-38
Time:,12:46:03,Log-Likelihood:,-245.92
No. Observations:,397,AIC:,497.8
Df Residuals:,394,BIC:,509.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
gre,0.0004,0.000,1.860,0.064,-2.2e-05 0.001
gpa,0.0170,0.036,0.466,0.642,-0.055 0.089
prestige_1.0,0.2452,0.063,3.889,0.000,0.121 0.369

0,1,2,3
Omnibus:,246.643,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,61.144
Skew:,0.756,Prob(JB):,5.28e-14
Kurtosis:,1.813,Cond. No.,1670.0


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [20]:
X = df[ ['gre', 'gpa', 'prestige_1.0'] ]
y = df['admit']
model = linear_model.LogisticRegression(C = 10 ** 2).fit(X, y)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: