# Retention rates for US Universities

This project is being built from the ground up to be customizable and reproducible.

After importing the libraries we need, we're going to load up a table of instructions containing which
of the several thousand attributes we're going to select from the IPEDS database for data mining. This is a lot, so it seems more reasonable to externally store the data list we'll be drawing from.

Many of these, such as website addresses or mission statements, aren't going to be terribly useful. Some others we'll need to exclude as it is too closely related to retention rate, and we risk overfitting or circular logic ("hey, here's how to raise your retention rate--have more of them graduate!")

At the root, each entry in the JSON file denotes a separate table. Included alongside the table name are instructions on whether all the table should be imported as default or not, which attributes are continuous, which are discrete, and  which are strings, and whether multiple records exist for each primary key. This is all derived from the associated documentation that comes with IPEDS. 

We're going to load in each table (obviously checking that it doesn't exist first), and extract the correct tables from it. 

(Should you want to change what we're measuring, you can change the predictive variable within the JSON file.)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
import json
import wget
import sys
import re
import os
from zipfile import ZipFile as zf

Below, we're going to set some variables for how (and where) we process the data.

* *MAXIMUM_NAN* : the ratio of NaNs at which we remove the column entirely.
* *MAXIMUM_COR* : the maximum Pearson correlation that a column can have with another before it's removed.
* *DATA_PATH* : where the data is stored.

In [2]:
MAXIMUM_NAN = 0.80
MAXIMUM_COR = 0.85
DATA_PATH = "data"

### Step 1: Reading in the IPEDS data

Next come the functions for reading in the data from files/ZIP archives, corraling them all to one row per unique key, and making sure they're the right data type.

In [3]:
def convert(tdf, columnlist, totype):

# this changes the datatype in each column. where columns have been unstacked, this is supposed
# to change all instances of each unstacked column to that datatype

    for col in columnlist:
        r = re.compile("(^"+col+"*)|(_"+col+")")
        # column name must either be first or have a _ before it, depending on whether we're
        # putting the table names next to it to make it unique
        for c in filter(r.match, list(tdf)): # filter all the column names through this regex
            tdf[c] = tdf[c].astype(totype)

def import_data(filename):
    with open(filename) as file:
        instructions = json.load(file)
        
# does the data directory exist? if no, create it.
    
    if os.path.isdir(DATA_PATH) is False:
        try:  
            os.mkdir(DATA_PATH)
        except OSError:  
            print ("Could not create data folder.")
            raise
            
# start with an empty dataframe to fill, and get the primary key.
# the primary key should be in the first file in the data list
# also figure out if we are adding the tablename to the attributes to make them unique

    ourdata = pd.DataFrame()
    pk = instructions["primarykey"]
    unique_headers = instructions["uniqueheaders"]

# loop through each table. 

    for table in instructions["tables"]:
        
# first, check if the file in the table has been downloaded. if not, download it from
# the path given and throw an error if something is wrong.

        filename = table["name"]+instructions["format"]
        filelocation = DATA_PATH+"/"+filename
        csvfile = DATA_PATH+"/"+table["name"]+".csv"

        if os.path.exists(filelocation) is False:
            try:
                wget.download(instructions["url"]+filename, filelocation)
            except Exception as e:
                print("Problem downloading and saving", table["name"], ":", e)
                raise

# next, if they are zip files, unzip them

        if instructions["format"] == ".zip":
            if os.path.exists(str(csvfile).lower()) is False:
                print("Unzipping "+table["name"])
                with zf(filelocation,"r") as zip_ref:
                    zip_ref.extractall(DATA_PATH)

# load each CSV file into a temporary data frame, tdf

        tdf = pd.read_csv(str(csvfile).lower(), encoding = "ISO-8859-1")

# filter the data according to any values needed

        if "filter" in table:
            for filterinfo in table["filter"]:
                tdf = tdf.loc[tdf[filterinfo[0]] == filterinfo[1]]

# then, depending on the instructions in the JSON file, include all the headers, or include a selection.

        if table["includeall"]:

# include the whole list, excluding the specific tables (might be an empty list)
# and add the primary key to select

            to_include = list(tdf)
            to_exclude = table["exclude"]
            headers = [x for x in to_include if x not in to_exclude]
            headers.append(pk)

# otherwise, stick these three lists together plus the primary key

        else:
            headers = [pk, *table["strings"], *table["discrete"], *table["continuous"]]

        selected_headers = [x for x in tdf.columns if x in headers]

# columns that begin with an X should be removed, at least for now, because they don't describe
# anything other than how the data was collected
    
        selected_headers = [x for x in selected_headers if x[:1] is not "X"]

# ok, so do we have a primary key? if not, stop right there

        if pk not in tdf.columns:
            raise KeyError("Primary key "+pk+" not found in "+table["name"])

# now we can select the headers we want.

        tdf = tdf[selected_headers]
    
# the code below adds the table name to the headers now, to prevent issues with duplication
# to everythng other than the key

        if unique_headers:
            tdf.rename(columns = lambda x: pk if x == pk else table["name"]+"_"+x, inplace = True)

# next we must check in the JSON instructions if this table contains multiple rows for each
# unique ID. if so, we need to put them all on the same row. to do this, we change them to strings
# then read them into a multiple index, unstack, and then join the column names.

        if "multi" in table:

            if unique_headers:
                multi = [table["name"]+"_"+x for x in table["multi"]]
            else:
                multi = table["multi"]
            tdf[multi] = tdf[multi].astype(str)
            tdf = tdf.set_index([pk, *multi])
            tdf = tdf.unstack(multi)               # need to specify ALL the levels
            tdf.columns = ['_'.join(col) for col in tdf.columns.values]
            tdf = tdf.reset_index(level=pk)

# any '.', '. ', '(X)' data should be NaN

        tdf = tdf.replace(r"\. ?", np.nan, regex=True)

# one important thing to do here is to set the right data types--strings or discrete data.
# most of the data in IPEDS and ACS are continuous, but there are a couple of strings and a 
# handful of discrete data.

        if "defaulttype" in table:
            tdfheds = list(tdf) # get list of headers
            hedstoremove = [pk, *table["strings"], *table["discrete"], *table["continuous"]]
            tdfheds = [x for x in tdfheds if x not in hedstoremove]
            if table["defaulttype"] == "discrete":
                convert(tdf, tdfheds, "category")  
            elif table["defaulttype"] == "string":
                convert(tdf, tdfheds, "str")
            else:
                convert(tdf, tdfheds, "float")
                # only way to store NaNs

        convert(tdf, table["strings"], "str")
        convert(tdf, table["discrete"], "category")
        convert(tdf, table["continuous"], "float")
        
        if pk in tdf:
            tdf[pk] = tdf[pk].astype('str') # it is in the index if the data is unstacked

# if it's the first time around the loop, take the first set of data.

        if ourdata.empty:
            ourdata = tdf

# if not, then we need to join on the primary key

        else:
            ourdata = ourdata.merge(tdf,on=pk,how="left")

        print("Imported "+ table["name"]+ ": "+str(len(ourdata.columns))+" columns total, "
              + str(round(float(ourdata.memory_usage().sum() / 1048576), 2)) + "MB")
    
    if "predictive" in table:
        return ourdata, table["predictive"]
    else
        return ourdata, None

ipeds, predictive = import_data("ipeds-instructions.json")

Imported HD2016: 27 columns total, 0.4MB
Imported IC2016: 132 columns total, 1.67MB
Imported IC2016_AY: 252 columns total, 8.56MB
Imported ADM2016: 290 columns total, 10.74MB
Imported EFFY2016: 380 columns total, 15.91MB
Imported EF2016A: 650 columns total, 31.4MB
Imported EF2016B: 776 columns total, 38.63MB
Imported EF2016C: 906 columns total, 46.09MB
Imported EF2016D: 917 columns total, 46.72MB
Imported EF2016A_DIST: 962 columns total, 49.3MB
Imported SFA1516: 1285 columns total, 67.83MB
Imported SFAV1516: 1303 columns total, 68.87MB
Imported F1516_F1A: 1413 columns total, 75.13MB
Imported F1516_F2: 1553 columns total, 83.11MB
Imported F1516_F3: 1632 columns total, 87.53MB
Imported EAP2016: 3990 columns total, 222.84MB
Imported SAL2016_IS: 4123 columns total, 230.47MB
Imported SAL2016_NIS: 4151 columns total, 232.08MB
Imported S2016_OC: 4833 columns total, 271.21MB
Imported S2016_SIS: 4910 columns total, 275.63MB
Imported S2016_NH: 4941 columns total, 277.41MB
Imported AL2016: 4971 c

We have the two-letter state codes and the county names. We now need to convert them to County Name, State in order to make the join with the American Community Survey data. We'll create a new column and drop the old county name column.

As you can see from the selection below, there's a one-to-many relationship between the counties and the educational institutions. And, of course, many counties won't have institutions in at all...

In [4]:
with open("states_hash.json") as file:
    states = json.load(file)
if "County_Name" not in ipeds:
    ipeds["County_Name"] = ipeds["COUNTYNM"]+", "+ipeds["STABBR"].map(states)
    ipeds = ipeds.drop(columns="COUNTYNM")
print(ipeds["County_Name"][1000:1025])

1000       Dupage County, Illinois
1001         Cook County, Illinois
1002         Cook County, Illinois
1003         Cook County, Illinois
1004         Cook County, Illinois
1005         Cook County, Illinois
1006         Cook County, Illinois
1007         Cook County, Illinois
1008         Cook County, Illinois
1009         Cook County, Illinois
1010         Cook County, Illinois
1011         Cook County, Illinois
1012         Cook County, Illinois
1013         Cook County, Illinois
1014         Cook County, Illinois
1015         Cook County, Illinois
1016         Cook County, Illinois
1017    Vermilion County, Illinois
1018    Vermilion County, Illinois
1019         Cook County, Illinois
1020       Dupage County, Illinois
1021      Mchenry County, Illinois
1022       Dupage County, Illinois
1023         Cook County, Illinois
1024        Coles County, Illinois
Name: County_Name, dtype: object


### Step 2. Preparing the data for analysis

This is, clearly, a horrific amount of data. Let's remove surplus columns before putting them under analysis. First, find out how many NaNs there are for each column and remove those above the threshold. (We should do this before figuring out correlations.)

In [5]:
original_length = len(ipeds.columns)

nanlist = ipeds.isna().sum(axis=0) / len(ipeds)
nans = nanlist.loc[nanlist >= MAXIMUM_NAN]
print (str(round(float(len(nans) / len(ipeds)),4) * 100) + 
       "% of columns removed for missing too much data")

31.96% of columns removed for missing too much data


Then we need to find out which datasets are too similar to one another, and remove them. 
The most basic way to do this is to do a Pearson correlation between all of the attributes, and
then remove one of each pair, or all-but-one of each collection.

As there are around 4,000 attributes, this will take a little while to process.

In [None]:
# get the pearson correlation between each column. 

cmatrix = ipeds.corr(method="pearson").abs()

# change the matrix to a dataframe, and put the indices in the first two columns

c = cmatrix.unstack().to_frame().reset_index()
c.columns = ["A", "B", "Correlation"]

# remove NaNs
c = c.dropna()

# remove pairs (everything correlates with itself)
c = c.loc[c["A"] != c["B"]]

# and remove everything that is less than the maximum correlation.
c = c.loc[c["Correlation"] >= MAXIMUM_COR]

print (c.head())

In [7]:
# we now have a list of pairs of variables that correlate with each other.
# now we need to extract which variables to remove from our main dataframe.

# first, let's sort the two columns into alphabetical order so that the first always lies behind
# the second. we do this by applying a lambda across each row and broadcasting the result back
# into the dataframe. 

c = c.apply(lambda row: [row["B"], row["A"], row["Correlation"]] 
                         if row["A"] > row["B"] else [row["A"], row["B"], row["Correlation"]],
                         axis=1, result_type='broadcast')

# we then drop all the duplicates. this should remove probably half or so of the rows.

c = c.drop_duplicates()

# if we then just take the first column, and dedupe that, we'll get a list of columns we can remove.
# if there are more than 2 columns that correlate with one another, one column will always
# only appear in the 'B' list. this will be the one that goes forward to the next stage.

cols_to_remove = c["A"].drop_duplicates()

ipeds = ipeds.drop(columns = cols_to_remove)

print ("Original number of attributes: "+str(original_length))
print ("Reduced number of attributes: "+str(len(ipeds.columns)))
print (str(round(float(len(ipeds.columns) / original_length),4) * 100) + "% of original.")


Original number of attributes: 4971
Reduced number of attributes: 732
14.729999999999999% of original.


That's better.