# Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
import statsmodels.formula.api as smf

  from pandas.core import datetools


In [4]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
print df.head()

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables

In [5]:
print pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])

prestige  1.0  2.0  3.0  4.0
admit                       
0          28   95   93   55
1          33   53   28   12


## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [6]:
prestige_dummies = pd.get_dummies(df["prestige"].astype(int),prefix="prestige",prefix_sep='_',drop_first=True)

In [7]:
prestige_dummies.head()

Unnamed: 0,prestige_2,prestige_3,prestige_4
0,0,1,0
1,0,1,0
2,0,0,0
3,0,0,1
4,0,0,1


#### 2.2 When modeling our class variables, how many do we need? 



n-1 variables are needed. Therefore, we need to create 3 dummy variables. 

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [8]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(prestige_dummies.ix[:, 'prestige_2':])
print handCalc.head()

   admit    gre   gpa  prestige_2  prestige_3  prestige_4
0      0  380.0  3.61           0           1           0
1      1  660.0  3.67           0           1           0
2      1  800.0  4.00           0           0           0
3      1  640.0  3.19           0           0           1
4      0  520.0  2.93           0           0           1


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


In [9]:
pd.crosstab(df["prestige"], df["admit"])

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


#### 3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [10]:
print "The odds of getting admitted if you attend a #1 ranked college is " + str(float(33)/float(28))

The odds of getting admitted if you attend a #1 ranked college is 1.17857142857


#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [11]:
print "The odds of getting admitted if you did not attend a #1 ranked college is " + str(float(53+28+12)/float(95+93+55))

The odds of getting admitted if you did not attend a #1 ranked college is 0.382716049383


#### 3.3 Calculate the odds ratio

In [12]:
float(33)/float(28)/(float(53+28+12)/float(95+93+55))

3.079493087557604

#### 3.4 Write this finding in a sentenance: 

In [13]:
print "Odds Ratio is the odds for ranked #1 college student/ odds for non ranked #1 college students " + str(float(33)/float(28)/(float(53+28+12)/float(95+93+55)))

Odds Ratio is the odds for ranked #1 college student/ odds for non ranked #1 college students 3.07949308756


#### 3.5 Print the cross tab for prestige_4

In [14]:
pd.crosstab(df[df['prestige']== 4]['prestige'],df[df['prestige']== 4]['admit'])

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
4.0,55,12


#### 3.6 Calculate the OR 

In [15]:
float(12)/float(55)

0.21818181818181817

#### 3.7 Write this finding in a sentence

In [16]:
print "The odds of getting admitted if you attend a #4 ranked college is " + str(float(33)/float(28))

The odds of getting admitted if you attend a #4 ranked college is 1.17857142857


## Part 4. Analysis

In [17]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(prestige_dummies.ix[:, 'prestige_2':])
print data.head()

   admit    gre   gpa  prestige_2  prestige_3  prestige_4
0      0  380.0  3.61           0           1           0
1      1  660.0  3.67           0           1           0
2      1  800.0  4.00           0           0           0
3      1  640.0  3.19           0           0           1
4      0  520.0  2.93           0           0           1


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [18]:
# manually add the intercept
data['intercept'] = 1.0

In [19]:
data.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4,intercept
0,0,380.0,3.61,0,1,0,1.0
1,1,660.0,3.67,0,1,0,1.0
2,1,800.0,4.0,0,0,0,1.0
3,1,640.0,3.19,0,0,1,1.0
4,0,520.0,2.93,0,0,1,1.0


#### 4.1 Set the covariates to a variable called train_cols

In [29]:
train_cols = data.cov()

In [30]:
print train_cols

               admit           gre        gpa  prestige_2  prestige_3  \
admit       0.217197      9.772155   0.030852    0.015221   -0.026270   
gre         9.772155  13390.606315  16.824761    3.274941   -3.967000   
gpa         0.030852     16.824761   0.144558   -0.009297    0.012421   
prestige_2  0.015221      3.274941  -0.009297    0.234410   -0.113910   
prestige_3 -0.026270     -3.967000   0.012421   -0.113910    0.212427   
prestige_4 -0.023395     -2.996336  -0.012501   -0.063074   -0.051567   
intercept   0.000000      0.000000   0.000000    0.000000    0.000000   

            prestige_4  intercept  
admit        -0.023395        0.0  
gre          -2.996336        0.0  
gpa          -0.012501        0.0  
prestige_2   -0.063074        0.0  
prestige_3   -0.051567        0.0  
prestige_4    0.140638        0.0  
intercept     0.000000        0.0  


#### 4.2 Fit the model

In [31]:
lm = smf.logit(formula = "admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4", data=data).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


#### 4.3 Print the summary results

In [32]:
lm.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Sat, 01 Jul 2017",Pseudo R-squ.:,0.08166
Time:,10:33:44,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116,-1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05,0.004
gpa,0.7793,0.333,2.344,0.019,0.128,1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301,-0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015,-0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372,-0.735


#### 4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params
        
           conf.columns = ['2.5%', '97.5%', 'OR']

In [33]:
  params = lm.params
  conf = lm.conf_int()
  conf['OR'] = params
  conf.columns = ['2.5%', '97.5%', 'OR']
  print np.exp(conf)

                2.5%     97.5%        OR
Intercept   0.002207  0.194440  0.020716
gre         1.000074  1.004372  1.002221
gpa         1.136120  4.183113  2.180027
prestige_2  0.272168  0.942767  0.506548
prestige_3  0.133377  0.515419  0.262192
prestige_4  0.093329  0.479411  0.211525


#### 4.5 Interpret the OR of Prestige_2

The odds of admission drops by 50% if a student is currently in a prestige 2 school compared to if the student was not.

#### 4.6 Interpret the OR of GPA

The odds of admission increases by 218% for every 1 point increase in GPA.

## Part 5: Predicted probablities


As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

In [39]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [40]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])


# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))

[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


#### 5.1 Recreate the dummy variables

In [57]:
print combos.head()

       0         1    2    3
0  220.0  2.260000  1.0  1.0
1  220.0  2.260000  2.0  1.0
2  220.0  2.260000  3.0  1.0
3  220.0  2.260000  4.0  1.0
4  220.0  2.453333  1.0  1.0


In [58]:
combos.columns = ['gre', 'gpa', 'prestige', 'intercept']

In [59]:
print combos.head()

     gre       gpa  prestige  intercept
0  220.0  2.260000       1.0        1.0
1  220.0  2.260000       2.0        1.0
2  220.0  2.260000       3.0        1.0
3  220.0  2.260000       4.0        1.0
4  220.0  2.453333       1.0        1.0


In [60]:
prestige_dummies_2 = pd.get_dummies(combos["prestige"].astype(int),prefix="prestige",prefix_sep='_',drop_first=True)

In [65]:
cols_to_keep = ['gre','gpa','intercept']
data2 = combos[cols_to_keep].join(prestige_dummies_2.ix[:, 'prestige_2':])
print data2.head()

     gre       gpa  intercept  prestige_2  prestige_3  prestige_4
0  220.0  2.260000        1.0           0           0           0
1  220.0  2.260000        1.0           1           0           0
2  220.0  2.260000        1.0           0           1           0
3  220.0  2.260000        1.0           0           0           1
4  220.0  2.453333        1.0           0           0           0


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


#### 5.2 Make predictions on the enumerated dataset

In [81]:
data2['predict_admit'] = lm.predict(data2[['gre','gpa','prestige_2','prestige_3','prestige_4']])

In [83]:
data2.tail()

Unnamed: 0,gre,gpa,intercept,prestige_2,prestige_3,prestige_4,predict_admit
395,800.0,3.806667,1.0,0,0,1,0.334286
396,800.0,4.0,1.0,0,0,0,0.73404
397,800.0,4.0,1.0,1,0,0,0.582995
398,800.0,4.0,1.0,0,1,0,0.419833
399,800.0,4.0,1.0,0,0,1,0.368608


#### 5.3 Interpret findings for the last 4 observations

Given that a student has 800 score on gre & 4.0 gpa score, the probability of a student from prestige level school getting admitted is as follow:
73.4% probability for prestige 1 school
58.3% probability for prestige 2 school
42.0% probability for prestige 3 school 
36.9% probability for prestige 4 school.

## Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.