# General Assembly London (DAT-11)
## Unit Project 2

- **Assigned:** Monday 24th October 2016
- **Due:** Sun 6th November 2016, 11:59pm
- **Submission URL:** https://app.schoology.com/assignment/851058098/info

In this project, you will perform a logistic regression on the admissions data we've been working with in project 1.

## Goal
Completed IPython notebook that includes basic modelling using logistic regression.

## Suggestions for getting started
- Review logistic regression, odds ratios and probabilities from lesson material.
- Read the docs for [Statsmodels](http://statsmodels.sourceforge.net/). Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success as a data scientist!

In [None]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import statsmodels.api as sm
%matplotlib inline

# Optional for bonus question if you prefer these libraries
# import seaborn as sns
# sns.set_style("darkgrid")

# Import data
DATA_DIR = Path('.')
df = pd.read_csv(DATA_DIR / 'admissions.csv').dropna()
df.head()

## Part 1. Frequency Tables

#### Create a frequency table of our variables

In [None]:
# Create a frequency table for prestige vs whether or not someone was admitted (hint: look at pd.crosstab())


## Part 2. Dummy variables

#### 2.1 Create four new dummy variables for prestige

In [None]:
# Create dummy vars here
dummy_ranks = 

#### 2.2 When modelling our prestige categorical variables, how many do we need? Why?
All 4? 3? 2? 1?

Answer: 

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [None]:
cols_to_keep = ['admit', 'gre', 'gpa']
hand_calc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':])
hand_calc.head()

#### 3.1 Cross-tabulate prestige_1 admission

In [None]:
# Use pd.crosstab to create a frequency table of prestige_1 vs admission


#### 3.2 Use the cross-tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

#### 3.3 Now calculate the odds of admission if you did not attend a #1 ranked college

#### 3.4 Calculate the odds ratio

#### 3.5 Write this finding in a sentence 

Answer: 

#### 3.6 Print the cross-tab vs prestige_4

#### 3.7 Calculate the odds ratio

#### 3.8 Write this finding in a sentence

Answer:

## Part 4. Analysis
First we'll create a clean data frame for the regression analysis.

In [None]:
# We'll set the top tier (#1, aka most prestigious) as our reference category
# and merge prestige_2, prestige_3 and prestige_4 back into the dataset
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
data.head()

We're going to add a constant term for our logistic regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [None]:
# Manually add the intercept
data['intercept'] = 1.0

#### 4.1 Assign the predictor column names to a variable called train_cols

In [None]:
train_cols = # predictor col names

#### 4.2 Fit a logistic regression model of using statsmodels
Want to model admin ~ gre + gpa + presige_2 + prestige_3 + prestige_4 + intercept using logistic regression.

In [None]:
logit = # log reg model

#### 4.3 Print the logistic regression summary results

#### 4.5 Calculate the odds ratio of the coefficients and their 95% confidence intervals

Hints:
1. `np.exp(X)`
2. Can print all in one object by adding:
  - `conf['odds_ratio'] = params`
  - and renaming cols to something like `conf.columns = ['2.5%', '97.5%', 'odds_ratio']`

#### 4.6 Interpret the odds ratio of prestige_2

Answer: 

#### 4.7 Interpret the odds ratio of GPA

Answer: 

## Part 5: Predicted probablities
As a way of evaluating our classifier, we're going to recreate the dataset with every possible combination of input values. By doing this we can see how the predicted probability of admission increases/decreases as different variables change.

First we're going to generate the combinations using a helper function called `cartesian()` (see below).

We'll also use `np.linspace` to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value (in our case just the min/max observed values).

In [None]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like 1-D arrays to form the cartesian product of.
    out : ndarray to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3],
                   [4, 5],
                   [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [None]:
# Instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
print
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333, 477.77777778,
#         542.22222222,  606.66666667,  671.11111111,  735.55555556,  800.        ])

gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])

# Enumerate all possibilities
combos = pd.DataFrame(cartesian([gres,
                                 gpas,
                                 [1, 2, 3, 4],  # prestige
                                 [1.]           # intercept
                                ]))

#### 5.1 Recreate the dummy variables

In [None]:
# Recreate the dummy variables

# Keep only what we need for making predictions (don't forget to set prestige_1 as a reference category)


#### 5.2 Make predictions for 'admit' on your new enumerated dataset using the trained logistic regression model from earlier

#### 5.3 Interpret your findings for the last 4 observations (i.e. last 4 rows)

Answer: 

## Bonus:

Plot the probability of being admitted into graduate school against GPA, stratified by prestige of undergrad school.

Repeat with GRE instead of GPA.