# Project 3

In this project, you will perform a logistic regression on the admissions data we've worked with previously in projects 1 and 2

#### Part 0. Load packages

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

%matplotlib inline

  from pandas.core import datetools


In [2]:
df_raw = pd.read_csv("./datasets/admissions.csv")
df = df_raw.dropna() 
print (df.head())

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables ([hint](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html))

In [3]:
# frequency table for prestige and whether or not someone was admitted
pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [5]:
prest_dummy = pd.get_dummies(df['prestige'])
prest_dummy.head()

Unnamed: 0,1.0,2.0,3.0,4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1


#### 2.2 When modeling our class variables, how many do we need? (caution: be sure to avoid the dreaded [dummy variable trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html))  
All 4? 3? 2? 1? 

Answer: 3

## Part 3. Hand calculating [odds ratios](https://lifeinthefastlane.com/ccc/odds-ratio/)

Develop your intuition about expected outcomes by hand calculating odds ratios

In [6]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(prest_dummy.loc[:, 1.0:])
print (handCalc.head())

   admit    gre   gpa  1.0  2.0  3.0  4.0
0      0  380.0  3.61    0    0    1    0
1      1  660.0  3.67    0    0    1    0
2      1  800.0  4.00    1    0    0    0
3      1  640.0  3.19    0    0    0    1
4      0  520.0  2.93    0    0    0    1


In [7]:
# make a frequency table cutting 'prestige_1.0' and whether or not someone was admitted
pd.crosstab(handCalc['admit'], handCalc[1.0], rownames=['admit'])



1.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


#### 3.1 Use the values from the [cross tab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) produced above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [8]:
ex1 = 33.0/28.0
ex1

1.1785714285714286

#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [9]:
ex2 = 93.0/243.0
ex2

0.38271604938271603

#### 3.3 Use 3.1 and 3.2 to calculate the odds ratio (OR)

In [10]:
ex1/ex2

3.079493087557604

#### 3.4 Write this finding in a sentence

Answer: Odds of getting an admission is 3.07 times higher for those you attanded #1.

#### 3.5 Print the cross tab for 'prestige_4.0'

In [11]:
pd.crosstab(handCalc['admit'], handCalc[4.0], rownames=['admit'])

4.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


#### 3.6 Calculate the OR 

In [31]:
(12/55)/(114/216)

0.4133971291866028

#### 3.7 Write this finding in a sentence

Answer: Odds of getting an admit is 2.5 times higher if one hasn't attended #4 college

## Part 4. Analysis

Set 1 (aka most prestigious) as your reference category and merge prestige_2, prestige_3 and prestige_4 back into the dataset. 

#### Reminder- How to use dummy variables to represent an n-category variable:
1. First note that we use a set of n-1 dummy variables as tools to represent an n‑category variable.
2. Choose one of the categories to serve as the “reference” category, the category to which you compare the other categories.
3. Create dummy (0/1) variables to represent each of the other categories.  Each dummy is coded so that it has the value 1 if a case is in that category, and 0 if not.
4. Interpret the regression coefficient for each dummy variable as how that category compares to the reference category.
([source](http://web.pdx.edu/~stipakb/))

In [12]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(prest_dummy.loc[:, 2:])
data.head()

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.0,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [13]:
# manually add the intercept
data['intercept'] = 1.0

#### 4.1 Define the labels of the covariates (columns) as a variable called 'train_cols'

In [14]:
train_cols = data.columns[1:]

#### 4.2 Fit the model
e.g.  
```python
logit = sm.Logit(y, data[train_cols])  
result = logit.fit()  
```

In [16]:
# instantiate the model
logit = sm.Logit(data['admit'], data[train_cols])
# then fit it
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


#### 4.3 Print the summary results

In [17]:
print(result.summary2())

                        Results: Logit
Model:              Logit            No. Iterations:   6.0000  
Dependent Variable: admit            Pseudo R-squared: 0.082   
Date:               2018-04-24 11:37 AIC:              467.6399
No. Observations:   397              BIC:              491.5435
Df Model:           5                Log-Likelihood:   -227.82 
Df Residuals:       391              LL-Null:          -248.08 
Converged:          1.0000           Scale:            1.0000  
----------------------------------------------------------------
             Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
----------------------------------------------------------------
gre          0.0022    0.0011   2.0280  0.0426   0.0001   0.0044
gpa          0.7793    0.3325   2.3438  0.0191   0.1276   1.4311
2.0         -0.6801    0.3169  -2.1459  0.0319  -1.3013  -0.0589
3.0         -1.3387    0.3449  -3.8819  0.0001  -2.0146  -0.6628
4.0         -1.5534    0.4175  -3.7211  0.0002  -2.3716  -

#### 4.4 Calculate the odds ratio of the coefficients and their 95% [confidence intervals](http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.LogitResults.conf_int.html)

hints: 
```python
np.exp(X)
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
```

In [19]:
params = result.params # get coefficients

# odds ratio of the coefficients
conf = result.conf_int()
conf['OR'] = params
np.exp(conf)

Unnamed: 0,0,1,OR
gre,1.000074,1.004372,1.002221
gpa,1.13612,4.183113,2.180027
2.0,0.272168,0.942767,0.506548
3.0,0.133377,0.515419,0.262192
4.0,0.093329,0.479411,0.211525
intercept,0.002207,0.19444,0.020716


In [None]:
# confidence intervals


#### 4.5 [Interpret](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/) the OR of Prestige_2

#### 4.6 Interpret the OR of GPA

Answer: 

## Part 5. Predicted probablities

As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified minimum and maximum value--in our case just the min/max observed values.

In [53]:
def cartesian_alt(arrays, out=None):
    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = int(n / arrays[0].size)
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in range(1, arrays[0].size,1):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [54]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
gres = list(map(int, gres))
print (gres)
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
gpas = list(map(int, gpas))
print (gpas)
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])

# enumerate all possibilities
# combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))

[220, 284, 348, 413, 477, 542, 606, 671, 735, 800]
[2, 2, 2, 2, 3, 3, 3, 3, 3, 4]


In [55]:
combos = pd.DataFrame(cartesian([(gres),(gpas), [1, 2, 3, 4], [1]]))

TypeError: slice indices must be integers or None or have an __index__ method

#### 5.1 Re-create the dummy variables

In [43]:
# re-create the dummy variables
combos.columns = ['gre', 'gpa', 'prestige', 'intercept']
dummy_ranks = pd.get_dummies(combos['prestige'], prefix='prestige')
dummy_ranks.columns = ['prestige_1.0', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0']

# keep only what we need for making predictions


NameError: name 'combos' is not defined

#### 5.2 Make predictions on the enumerated dataset using the model we created previously
e.g.
```python
result.predict(combos[train_cols])
```

#### 5.3 Interpret findings for the last 4 observations

Answer: 

## Bonus!

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score

##### inspired by the great blog post: http://blog.yhathq.com/posts/logistic-regression-and-python.html  