##### The purpose of this notebook is to output a csv with acceptance data for all schools.

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import presetup as p
import seaborn as sns
import re
from collections import Counter
from sklearn.cross_validation import train_test_split
%matplotlib inline

In [116]:
df = pd.read_csv('../data/raw/raw_data.csv', low_memory=False)

Let's instantiate an instance of the PreSetup class from our custom module 'presetup' - we'll use it to fetch a list of interpretable column names.

In [118]:
ps = p.PreSetup()
col_dict = ps.parseCols('../data/reference/column_names.txt')
df.columns = ps.updateCols(df.columns.values)

In [120]:
df['Who are you?'].value_counts()

Admit Creating College / Grad School Profile    18465
Student Applying To College / Grad School         102
Other Professional                                 32
School Faculty / Professional                      12
Parent                                              7
Name: Who are you?, dtype: int64

We're only interested in the profiles of those who have gone through the college admissions process and are not school faculty or parents.

In [121]:
df = df[df['Who are you?']=='Admit Creating College / Grad School Profile'].copy()
df.reset_index(inplace=True)
df.drop('index', axis=1, inplace=True)

Some more housekeeping: let's replace the base64 nulls.

In [122]:
vals = ['YTowOnt9', 'ytowont9', 'czowOiIiOw==']
for v in vals:
    df.replace(to_replace=v, value=np.nan, inplace=True)

### Extracting Acceptance Data for All Schools

#### Step 1
Instantiate a new instance of the Schools class.

In [123]:
sc = p.Schools()

#### Step 2
Get list of all schools.

In [124]:
all_schools = sc.getSchools('../data/reference/table_references.csv')

In [12]:
print len(all_schools), len(set(all_schools))

1355 1353


There are a few duplicates- let's remove these.

In [125]:
all_schools = list(set(all_schools))

#### Step 3
Create a new DataFrame df_schools, where cols are schools, rows are students

In [126]:
df_schools = pd.DataFrame(index=xrange(len(df)), columns=all_schools)

#### Step 4
Update acceptance status for each school in df_schools.

In [127]:
sc.extractFromApplied(df['Undergraduate Schools Applied'], df_schools)

Due to some formatting hiccups in the data, a handful of non-school columns were generated. Let's get rid of these.

In [128]:
df_schools = df_schools[all_schools]

#### Step 5
Create a separate DataFrame df_topschools (where the cols are just the 'top schools').

In [23]:
# top_schools = ['Harvard University (Cambridge, MA)', 'Yale University (New Haven, CT)', 
#                'Cornell University (Ithaca, NY)', 'Columbia University (New York, NY)',
#                'University of Pennsylvania (Philadelphia, PA)', 'Princeton University (Princeton, NJ)',
#                'Brown University (Providence, RI)', 'Dartmouth College (Hanover, NH)',
#                'Massachusetts Institute of Technology (Cambridge, MA)','Stanford University (Stanford, CA)']
# df_topschools = df_schools[top_schools].copy()

#### Step 6
Convert each school col into binary (1 for accepted, 0 for not)

In [129]:
for school in all_schools:
    df_schools[school] = df_schools[school].apply(lambda x: sc.cleanFromApplied(x) if type(x) == str else x)

In [25]:
# df_topschools['any_top_school'] = (df_topschools.sum(axis=1)).apply(lambda x: 1 if x>0 else np.nan)

Let's choose just the schools where there are at least 60 records for acceptance/denial.

In [130]:
mask_school = df_schools.notnull().sum(axis=0)
df_schools2 = df_schools[mask_school[mask_school>60].index].copy()
df_schools2 = df_schools2.fillna(value=0)

In [131]:
subset_schools = df_schools2.columns

In [63]:
# df_schools2.to_csv('../data/all_schools.csv')

#### Step 7
Join back with main df, and make any necessary adjustments

In [113]:
# df = df[df.columns[:-232]].copy()

In [134]:
# Join back with main df
df = df.join(df_schools2)

In [68]:
print df['Undergraduate Schools Applied'].notnull().sum()
print df['Undergraduate Schools Attended'].notnull().sum()

6614
17639


There's another problem here- we just extracted info out of the 'Undergraduate Schools Applied' column, but it has far more null values than 'Undergraduate Schools Attended', which means we're currently missing out on a lot of potentially important data. Let's cover our bases by incorporating the data in the Attended column.

In [136]:
for s in subset_schools:
    df[s+'_v2'] = df['Undergraduate Schools Attended'].apply(lambda x: sc.extractAllFromAttended(x, s))

Now for each school, let's combine the data from the two columns to get a final column.

In [148]:
for s in subset_schools:
    df[s+'_final'] = ((df[s] + df[s+'_v2'])>0).astype(int)

In [153]:
output_cols = [s+'_final' for s in subset_schools]
output_cols.append('id')
df_output = df[output_cols].copy()

In [156]:
df_output.to_csv('../data/all_schools.csv')