# DS-NYC-45 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

Answer: Determine which applicants will be admitted to UCLA, using application data (GRE score, GPA and school prestige) from UCLA's Logit Regression in R tutorial for the applicable application year.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

Answer:

- Data Dictionary

Variable | Description and Range | Type of Variable
---|---|---
admit | 0 = Not admitted, 1 = Admitted | Categorical
gre | GRE score: [220, 800] | Continuous
gpa | GPA score: [2.26, 4.00] | Continuous
prestige | Prestige of applicant's alma mater: 1 = highest tier, 4 = lowest tier | Categorical

- Data Cleaning: Dropped records with missing data. Out of 400 records, 3 records were dropped.

In [44]:
import numpy as np
import pandas as pd

df = pd.read_csv('dataset/ucla-admissions.csv')
df.dropna(inplace=True)

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

Answer: 

- Average of GRE and GPA scores by admission status:

In [45]:
df.groupby('admit').mean()

Unnamed: 0_level_0,gre,gpa,prestige
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,573.579336,3.347159,2.645756
1,618.571429,3.489206,2.150794


In [46]:
# Number of applicants by prestige and admission status:
pd.crosstab(df['admit'],df['prestige'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


> ## Question 4. Methods
> Write up the methods used in your analysis.

Answer: 

- Logistic regression was performed on the admissions data against admittance (outcome varaible) to graduate school. 
- Input variables included GRE and GPA scores, and one-hot encoding using binary variables for the prestige variable.
- For one-hot encoding, three dummy variables were created for `prestige` = 2, 3 and 4, so that the highest prestige undergraduate schools is the reference point.
- Trained `sklearn`'s `LogisticRegression` model with `C=100`, using the entire dataset of 397 records (with the exception of dropped).

In [47]:
prestige_dummy = pd.get_dummies(df['prestige'])
df[['prestige_2','prestige_3','prestige_4']]=prestige_dummy[[2, 3, 4]]
df.drop('prestige', axis=1, inplace=True)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C = 10 ** 2)
X = df.drop('admit', axis=1)
y = df['admit']
logreg.fit(X, y)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ## Question 5. Results
> Write up your results.

Answer:


- Applying the exponential funciton, `np.exp()`, to the coefficients of logistic regression gives the estimated odds ratios for each feature.
- The probability of admission, given the student's GRE and GPA scores and prestige of undergraudate school, can be calculated with the above coefficients (or by using the `predict_proba(X)` function.)

> ## Question 6. Visuals
> Provide a table or visualization of these results.

Answer:

- Odds Ratios

In [59]:
odds_ratios = pd.DataFrame(data=np.exp(logreg.coef_), index=['odds_ratio'], columns=df.columns[1:])
odds_ratios

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4
odds_ratio,1.002161,1.960413,0.533219,0.285867,0.208297


- Predicted probabilities of admission for student with a GRE of 800 and a GPA of 4 (by prestige):

In [67]:
prob = logreg.predict_proba([[800, 4, 0, 0, 0], [800, 4, 1, 0, 0], [800, 4, 0, 1, 0], [800, 4, 0, 0, 1]])
probability = pd.DataFrame(data=prob[:,1], index=[1, 2, 3, 4], columns=['probability'])
probability.index.name = 'tier'
probability

Unnamed: 0_level_0,probability
tier,Unnamed: 1_level_1
1,0.711854
2,0.568463
3,0.413911
4,0.339755


> ## Question 7.  Discussion
> Write up your discussion and future steps.

Answer:

In [70]:
p = df['admit'].mean()
print p, 1-p

0.317380352645 0.682619647355


In [71]:
logreg.score(X,y)

0.70528967254408059

- 32% of 397 applicants (entire dataset minus dropped) were admitted and 68% of the applicants were not admitted.
- In comparison, the logistic regression model correctly predicted only 71% of the applicants' admission status. (71% accuracy on the training set.)
- This is poor performance because the logistic regression model marginally outperformed a simple model of predicting everyone will not be admitted. Furthermore, the logistic regression did not perform any cross validation or have a separate test set for final evaluation.
- Next steps would be trying to fit different learning models such as random forests and performing cross validation.

In [88]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 15)
model.fit(X, y)
model.score(X, y)

0.98236775818639799

In [93]:
from sklearn.model_selection import cross_val_score

print cross_val_score(model, X, y, scoring='accuracy', cv=5).mean()

0.654789810908


- Prediction accuracy on the training set is high (98%).
- However, with cross validation the prediction score drops below the lower bound of 68% (not admitted percentage).
- The high prediction accuracy on the entire dataset was due to overfitting.
- Dataset of 397 records is too small to perform robust fitting and testing with cross validation. Need more data.