# Datamodel and Database module

The TIdatabase module encapsulates all loading, storing and joining of the student, college and applications dataframes.

The module will be imported at the beginning of every iPython Notebook.

In [1]:
import TIdatabase as ti
%matplotlib inline 
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from matplotlib import rcParams

## The Datamodel
This [Google Docs](https://docs.google.com/spreadsheets/d/1dm73Vmov8bhNoVRUtyg6TU-IgE7DPDVlukMkvnaCqAg/edit#gid=0&vpid=A1) contains a list of what we believe are the important factors in the college decision. This is list of course does not include things as recommendation letters as it is impossible to get data for this or to quantifiy it. The doc also includes the column names that each feature has our dataframes. We distinguish 3 dataframes:
- A students dataframe contains all academic and personal data of a particular student (scores, gender, etc)
- A college dataframe contains all information of a university (acceptance rate, public/private, etc)
- An applications dataframe contains application-specific data for a particular student in a particular university, for example and most importantly, the result of the decision procedure

## Generating Mock Data

The module has the functionality to fill the dataframes with mock data, which is useful to start writing  classification code before we finish scraping the actual data.

In [2]:
students = ti.Student()
# populate students with random values
students.fillRandom(10)
students.df 

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,GPA_w,program,intendedgradyear,...,canAfford,female,MinorityGender,MinorityRace,international,firstinfamily,sports,artist,workexp,schooltype
0,FI8X237WXX,0.283403,0.127457,0.753054,0.659981,0.899612,0.636109,0.728772,0.137809,2017,...,1,0,0,0,0,0,0,1,1,0
1,OXCEKC6Q5M,0.927929,0.041205,0.423455,0.963762,0.655037,0.369301,0.636181,0.80623,2019,...,0,1,0,0,1,0,0,0,0,1
2,GWFFIC2WXC,0.272376,0.991073,0.240869,0.580914,0.421302,0.840936,0.544604,0.771702,2015,...,0,0,1,0,1,1,1,0,1,0
3,PB337R4QOU,0.903054,0.615584,0.31279,0.838768,0.341788,0.16186,0.007461,0.036614,2020,...,1,0,0,1,1,0,0,1,0,1
4,0K62IU84V3,0.91692,0.446423,0.448245,0.285839,0.325604,0.788694,0.323615,0.776077,2015,...,0,0,1,0,0,0,1,1,0,1
5,5U92IRA0LB,0.080794,0.871656,0.307633,0.516451,0.415726,0.905907,0.328973,0.997801,2020,...,0,0,0,0,1,1,0,0,1,0
6,27WBA009NZ,0.336914,0.413281,0.515587,0.964437,0.113144,0.952846,0.058152,0.988163,2012,...,0,0,0,0,0,0,0,0,1,1
7,JFJCC7B6B0,0.574745,0.226006,0.463635,0.933696,0.112832,0.887331,0.589565,0.055493,2015,...,0,1,1,0,1,0,0,0,0,1
8,SJE2F0OC5P,0.949242,0.154963,0.509023,0.889407,0.682222,0.426865,0.336591,0.604438,2010,...,0,0,1,0,0,0,0,1,1,0
9,86LCGJZ0JH,0.721326,0.131686,0.59543,0.839838,0.316607,0.577106,0.908437,0.747246,2010,...,0,0,1,0,0,0,1,1,1,1


#### Simulating Missing Data

You can also simulate NaNs in the mockup. `fillRandom` takes a second optional parameter that is the percentage of NaNs to generate.

In [3]:
students.fillRandom(10, 0.25) # 25% of values will be NaN
students.df

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,GPA_w,program,intendedgradyear,...,canAfford,female,MinorityGender,MinorityRace,international,firstinfamily,sports,artist,workexp,schooltype
0,N9UPOT7JGZ,0.144291,0.532656,0.353175,,0.771708,0.551163,,,2019.0,...,,0.0,0.0,1.0,1.0,1.0,,1.0,,1.0
1,VVCW2RRJKM,,0.388682,0.147036,0.177695,0.758574,0.447227,,0.327176,2014.0,...,,0.0,0.0,0.0,1.0,1.0,,1.0,1.0,
2,85ZOXYU3G2,0.607921,0.542437,0.205999,0.622215,,0.349137,,0.935319,2017.0,...,0.0,1.0,0.0,,,0.0,1.0,,1.0,
3,SP0OSW4F1L,0.309242,0.040114,0.560514,,0.393784,0.580495,,0.774766,2012.0,...,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,,
4,JYNHFW8A1P,0.538917,0.564008,0.581378,0.098013,0.370698,0.368872,0.737168,,2010.0,...,0.0,,0.0,0.0,,1.0,0.0,0.0,,
5,2KFXEC002L,,0.990525,,0.460506,,0.074084,0.705907,,2014.0,...,,1.0,0.0,1.0,0.0,1.0,1.0,,1.0,0.0
6,RI0TDJICKH,,0.373696,,0.736861,0.032116,0.177107,,,2014.0,...,,1.0,0.0,1.0,0.0,0.0,1.0,,0.0,1.0
7,GQ6JYIVSQO,,0.901815,0.437347,0.909051,0.775625,,0.770134,0.508655,,...,1.0,,0.0,0.0,,0.0,0.0,,0.0,1.0
8,4M286BGLY5,0.15946,,0.44915,0.093557,0.321188,0.678027,0.818415,0.936367,2013.0,...,1.0,0.0,,,,1.0,1.0,0.0,,
9,ZO2OTW0WE7,0.092386,0.703828,0.331464,0.688822,0.420225,,0.150748,,2015.0,...,0.0,,1.0,,1.0,,0.0,,0.0,1.0


Since we have a fixed list of only 25 colleges, the college infromation is not scraped.  The list of colleges is populated from a CSV stored in the same directory as this notebook. It can be edited using your favorite CSV editor, such as Excel. When you create a new instance of College, the values will be read in from the CSV. 

In [4]:
# populate with list of known colleges
colleges = ti.College()
colleges.df

Unnamed: 0,collegeID,name,acceptrate,size,public,finAidPct,instatePct
0,Princeton,Princeton,0.074,5142,-1,0.0,0
1,Harvard,Harvard,0.06,19929,-1,0.75,0
2,Yale,Yale,0.063,12336,-1,0.0,0
3,Columbia,Columbia,0.07,24221,-1,0.0,0
4,Stanford,Stanford,0.051,16795,-1,0.0,0
5,UChicago,UChicago,0.088,12558,-1,0.0,0
6,MIT,MIT,0.079,11319,-1,0.0,0
7,Duke,Duke,0.114,15856,-1,0.0,0
8,UPenn,UPenn,0.104,21296,-1,0.0,0
9,CalTech,CalTech,0.088,2209,-1,0.0,0


The table of application forms combines a student with a university and carries the information of specific applications. `acceptStatus` is our predictor. `acceptProb` is our $\hat{Y}$ probability.

In [5]:
applForm = ti.ApplForm()
applForm.fillRandom(30)
applForm.df

Unnamed: 0,studentID,collegeID,earlyAppl,visited,alumni,outofstate,acceptStatus,acceptProb
0,RI0TDJICKH,Columbia,0,0,0,1,0,0.879352
1,GQ6JYIVSQO,Dartmouth,0,0,1,1,1,0.880599
2,N9UPOT7JGZ,Cornell,0,1,0,0,1,0.962043
3,VVCW2RRJKM,NotreDame,1,0,1,0,0,0.340978
4,RI0TDJICKH,Harvard,0,0,1,1,1,0.617261
5,SP0OSW4F1L,JohnsHopkins,1,1,1,1,1,0.224123
6,VVCW2RRJKM,UCLA,0,1,0,1,0,0.831938
7,SP0OSW4F1L,UCB,1,1,1,1,1,0.122284
8,85ZOXYU3G2,Northwestern,0,1,1,1,1,0.342427
9,GQ6JYIVSQO,JohnsHopkins,1,0,0,1,0,0.149746


To combine the student and application forms tables, we use Pandas merge capability, which will match rows by identical column names, which is studentID in this case:

In [27]:
applications = pd.merge(students.df,applForm.df)
applications

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,firstinfamily,alumni,sports,artist,workexp,collegeID,earlyAppl,visited,acceptStatus,acceptProb
0,RPGQH572X2,,0.98,,,,4.5,,-1,2020,...,,,,,,Harvard,0,0,0,
1,QNNFGQA7TP,,0.98,,,,4.5,,-1,2019,...,,,,,,Yale,0,0,0,
2,Q1LJY003VB,,0.65,,,,2.2,,1,2019,...,,,,,,Columbia,1,1,1,


Now the `applications` Pandas DataFrame is ready to use for either regression (by overwriting the acceptProb column) or building the public facing web site.

## Saving Scraped Data

### Part 1 - The Student Data

First, let's start fresh and delete the previously created objects. This is only necessary because this sample script is running within Jupyter where all variables are global to the page. In a separate Python file run from the command line, this step can be skipped. Creating a new instance would not work as behind the scenes, there is sharing of
DataFrames between objects.

In [8]:
if ('students' in locals()): 
    students.cleanup()
    del students
if ('applications' in locals()): del applications
if ('applForm' in locals()): del applForm


Let's create a new students instance. It will be an empty Pandas dataframe with the correct columns

In [9]:
students = ti.Student()
students.df

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,female,MinorityGender,MinorityRace,outofstate,international,firstinfamily,alumni,sports,artist,workexp


Populate a dictionary with the values that the scraper has for a given row. Make sure the keys match up with the column names as only the matching columns will be saved. There is no need to add the studentID key. A unique value will be generated automatically and returned from the insert. This will be in the same order as the provided rows. Saving the generated student IDs will be helpful later when populating the applForm foreign key.

In [10]:

# Example: international male who scored in 98th percentile in ACT/SAT, went to a public school and is applying for
# Class of 2020

newrow = {'admissionstest': 0.98,
         'GPA': 4.5,
         'female' : -1,
         'international': 1,
         'schooltype': -1,
         'intendedgradyear':2020}

newsinglestudentID = students.insert(newrow)
print "New studentID:",newsinglestudentID
students.df

New studentID: ['RPGQH572X2']


Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,female,MinorityGender,MinorityRace,outofstate,international,firstinfamily,alumni,sports,artist,workexp
0,RPGQH572X2,,0.98,,,,4.5,,-1,2020,...,-1,,,,1,,,,,


It is more efficient if multiple rows are added in one step. In this case, create a list of dictionaries and just
use the same method. Here, two new rows are added to the DataFrame in one step.

In [11]:
rows = []
a = {'schooltype': -1, 'admissionstest': 0.98, 'GPA': 4.5, 'female': 1, 'intendedgradyear': 2019, 'international': 0}
rows.append(a)
a = {'schooltype': 1, 'admissionstest': 0.65, 'GPA': 2.2, 'female': -1, 'intendedgradyear': 2019, 'international': 0}
rows.append(a)
newmanystudentIDs = students.insert(rows)
print "New studentIDs:",newmanystudentIDs
students.df

New studentIDs: ['QNNFGQA7TP', 'Q1LJY003VB']


Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,female,MinorityGender,MinorityRace,outofstate,international,firstinfamily,alumni,sports,artist,workexp
0,RPGQH572X2,,0.98,,,,4.5,,-1,2020,...,-1,,,,1,,,,,
0,QNNFGQA7TP,,0.98,,,,4.5,,-1,2019,...,1,,,,0,,,,,
1,Q1LJY003VB,,0.65,,,,2.2,,1,2019,...,-1,,,,0,,,,,


Now we are ready to save. The data is saved in CSV format for ease of interpretability.

In [12]:
students.save("mydata.csv")

Let's delete all the data and check that we can read it back successfully.

In [13]:
if ('students' in locals()): 
    students.cleanup()
    del students
if ('applications' in locals()): del applications
if ('applForm' in locals()): del applForm



In [14]:
students = ti.Student()
students.read("mydata.csv")
students.df

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,female,MinorityGender,MinorityRace,outofstate,international,firstinfamily,alumni,sports,artist,workexp
0,RPGQH572X2,,0.98,,,,4.5,,-1,2020,...,-1,,,,1,,,,,
0,QNNFGQA7TP,,0.98,,,,4.5,,-1,2019,...,1,,,,0,,,,,
1,Q1LJY003VB,,0.65,,,,2.2,,1,2019,...,-1,,,,0,,,,,


Et, voilà, the data is back.

### Part 2 - The Application Data

This is pretty much the same, **except** for two important differences:

* The studentID and collegeID must both be populated and exist in the respective DataFrames
* The columns are a little different. Normally acceptProb would not be populated from the scraper but could be used to store prediction runs. 

In [15]:
# we already wiped out applForm above
applForm = ti.ApplForm()
applForm.df

Unnamed: 0,studentID,collegeID,earlyAppl,visited,acceptStatus,acceptProb


In [16]:
# either pick one from the students.df DataFrame, like this:
#--- studentID = students.df.iloc[1].studentID
# or use the studentID from the insert in the students DataFrame
studentID = newsinglestudentID[0]
collegeID = colleges.df.iloc[1].collegeID

In [17]:
print studentID, collegeID

RPGQH572X2 Harvard


In [18]:
newrow = {'studentID': studentID,
         'collegeID': collegeID,
         'earlyAppl' : 0,
         'visited': 0,
         'acceptStatus': 0}

applForm.insert(newrow)
applForm.df

Unnamed: 0,studentID,collegeID,earlyAppl,visited,acceptStatus,acceptProb
0,RPGQH572X2,Harvard,0,0,0,


Now inserting multiple rows

In [19]:
rows = []
# either pick one randomly:
#--- studentID = students.df.iloc[2].studentID
# or use the list from the insert into the students DataFrame
# iterate over the list of newstudentIDs if necessary
studentID = newmanystudentIDs[0]
collegeID = colleges.df.iloc[2].collegeID
newrow = {'studentID': studentID,
         'collegeID': collegeID,
         'earlyAppl' : 0,
         'visited': 0,
         'acceptStatus': 0}
rows.append(newrow)
# note: same student, new school to apply to
studentID = students.df.iloc[2].studentID
collegeID = colleges.df.iloc[3].collegeID
newrow = {'studentID': studentID,
         'collegeID': collegeID,
         'earlyAppl' : 1,
         'visited': 1,
         'acceptStatus': 1}
rows.append(newrow)
applForm.insert(rows)
applForm.df

Unnamed: 0,studentID,collegeID,earlyAppl,visited,acceptStatus,acceptProb
0,RPGQH572X2,Harvard,0,0,0,
0,QNNFGQA7TP,Yale,0,0,0,
1,Q1LJY003VB,Columbia,1,1,1,


Let's save it

In [20]:
applForm.save("applform1.csv")

Then delete the local variable

In [21]:
if ('applForm' in locals()): del applForm

Then read it back

In [22]:
applForm = ti.ApplForm()
applForm.read("applform1.csv")
applForm.df


Unnamed: 0,studentID,collegeID,earlyAppl,visited,acceptStatus,acceptProb
0,RPGQH572X2,Harvard,0,0,0,
0,QNNFGQA7TP,Yale,0,0,0,
1,Q1LJY003VB,Columbia,1,1,1,


And now let's check that the merge still works

In [23]:
applications = pd.merge(students.df,applForm.df)
applications

Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,program,schooltype,intendedgradyear,...,firstinfamily,alumni,sports,artist,workexp,collegeID,earlyAppl,visited,acceptStatus,acceptProb
0,RPGQH572X2,,0.98,,,,4.5,,-1,2020,...,,,,,,Harvard,0,0,0,
1,QNNFGQA7TP,,0.98,,,,4.5,,-1,2019,...,,,,,,Yale,0,0,0,
2,Q1LJY003VB,,0.65,,,,2.2,,1,2019,...,,,,,,Columbia,1,1,1,


Ok, we are done for today.