# Project 4

In this project, you will summarize and present your analysis from Projects 1-3.

### Intro: Write a problem Statement/ Specific Aim for this project

Answer: The goal of this project was to determine which features had the strongest impact on being admitted into UCLA for graduate school. We are looking to predict whether an applicant will be admitted based on their GRE, GPA, and rank from their undergraduate institution.

### Dataset:  Write up a description of your data and any cleaning that was completed

Answer: This data contains four variables: admit, gre, gpa, and prestige (see below for data dictionary with variable descriptions). The admit variable is what we are predicting in this analysis and is a binary. GRE and GPA are continuous variables that we do not manipulate before using the data in our analysis. The prestige variable is broken out into four different categories/classes, in order to prevent collinearity we turn prestige into a dummy variable and drop one of them. The final step of data cleaning is dropping any rows with nas (missing data). As far as distributions of the data goes, GPA is negatively skewed as the median is greater than the mean value, GRE is positively skewed as the median is less than the mean, and Prestige is positively skewed.

### Data Dictionary

Variable | Description | Type of Variable
---| ---| ---
*Admit* | 0 = not admitted to the program 1 = admitted to the program, this is our binary target variable and is the outcome of the admissions data set | binary
*GRE* | the GRE score of the candidate (stands for Graduate Record examinations - is a standardized exam for graduate school entrance and application)  | continuous
*GPA* | the GPA score of the candidate (stands for grade point average, an average taken across all classes of the candidate in their undergraduate schooling) | continuous
*Prestige* | the ranking of an applicant's undergraduate school | discrete 


### Demo: Provide a table that explains the data by admission status

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
import os

  from pandas.core import datetools


Mean (STD) or counts by admission status for each variable 

| Not Admitted | Admitted
---| ---|---
GPA | mean(std)  | mean(std)
GRE |mean(std) | mean(std)
Prestige 1 | frequency (%) | frequency (%)
Prestige 2 | frequency (%) | frequency (%)
Prestige 3 |frequency (%) | frequency (%)
Prestige 4 |frequency (%) | frequency (%)

In [24]:
#Data import, cleaning, and dummy-ing
df_raw = pd.read_csv("/Users/scottonigman/Desktop/GA/homework/sonigman-dat-GA-HW/admissions.csv")
df = df_raw.dropna()
dummy_ranks = pd.get_dummies(df['prestige'],prefix='prestige')
cols_to_keep = ['admit', 'gre', 'gpa']
df_new = df[cols_to_keep].join(dummy_ranks.loc[:,:])
df_new.head()

Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.0,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1


In [34]:
#aggregating for this table
agglist = {'gre':['mean'], 'gpa':['mean'], 'prestige_1.0':['sum'] , 'prestige_2.0':['sum'], 'prestige_3.0':['sum'], 'prestige_4.0':['sum']}
admitstatus = df_new.groupby('admit').agg(agglist)
admitstatus.T
#I think the frequency % here is off and should be done as count of admit status/row total, woul dbe done with a lamba function included in the agglist object

Unnamed: 0,admit,0,1
prestige_4.0,sum,55.0,12.0
prestige_1.0,sum,28.0,33.0
gpa,mean,3.347159,3.489206
gre,mean,573.579336,618.571429
prestige_2.0,sum,95.0,53.0
prestige_3.0,sum,93.0,28.0


####Random starter code while figuring out how to combine agg functions 
f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
{'duration':sum,      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'}) 


air = (df.groupby(['origin', 'dest'])
         .agg({'dep_delay': {'dep_mean': 'mean', 'dep_count': 'count'},
               'arr_delay': {'arr_mean': 'mean', 'arr_count': 'count'}}))
air.columns = air.columns.droplevel()

### Methods: Write up the methods used in your analysis

Answer:
### EDA
In order to analyze this data set, we will:
0. Write a data dictionary
1. Use the describe function on the data set to see the mean, count, standard deviation, minimum value, max value, and quartiles of the data set
2. Use the info function to determine the shape of the data set and if there are any missing values
3. Visualize the data set with a histogram in order to do a visual check on whether the data is normally distributed
4. Visualize the data set to see if there are any relationships between predictors in the data set with each other and the outcome
5. Aggregate using group by functions in pandas
6. Test the normality of the data set and filter out any outliers using an agreed upon statistical significance (0.05 or 0.01)
7. Once this is done, create dummy variables from the prestige variable so that it may be used for logistic regression

### Analysis Method
1. Import data set
2. Clean data using previously described EDA plan
3. Reiterating the creation of dummy variables as from the prestige variable since it is discrete
4. Fit logistic regression
5. Evaluate coefficients and odds ratios in the stats models output


In [27]:
df_new2 = df[cols_to_keep].join(dummy_ranks.loc[:, 'prestige_2':])
df_new2['intercept'] = 1.0
train_cols = df_new2.columns[1:]
logit = sm.Logit(df_new2['admit'], df_new2[train_cols])
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


### Results: Write up your results

Answer: For every one unit increase in gre, the chance of being admitted increases by 0.002, for every unit increase in gpa, the chance of being admitted increases by 0.77. If the candidate attends prestige schools of 2,3, or 4, their chances of being admitted decrease by .68, 1.33 or 1.55 if they attended each of these respective undergrad schools

### Visuals: Provide a table or visualization of these results

<img src='placeholder.png' height= 25% width= 25%>

In [28]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Wed, 30 Aug 2017   Pseudo R-squ.:                 0.08166
Time:                        18:12:31   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
gre              0.0022      0.001      2.028      0.043    7.44e-05       0.004
gpa              0.7793      0.333      2.344      0.019       0.128       1.431
prestige_2.0    -0.6801      0.317     -2.14

### Discussion: Write up your discussion and future steps

Answer: Since this model didn't have a particularly good fit, I would try to engineer other features from those that currently exist. I would also try to figure out if there are any relationships between gre and gpa, gpa and prestige, or gre and prestige that could explain any tradeoffs one could make in attending one undergrad institution over another in order to boost potential graduate admissions. One could also incorporate other data sets to join on, for example demographic data or more years of data that could yield more accurate predictions. Finally, one could try a random forest algorithm to make a prediction, but be sure to limit the number of trees as this is a small data set and could be easily over fit.