In [2]:
import pandas as pd

## Converting csv into pandas dataframe

Converting the **MERGEDXXXX_YY_PP.csv** files to dataframes with `read_csv()` causes an error regarding mixed types in columns. I'm using the `low_memory=False` argument to continue, but should consider using the `converters` or `dtype` args.

With `converters`:    
```Python
def convert_dtype(x):
    if not x:
        return ''
    try:
        return str(x)   
    except:        
        return ''

pd.read_csv('file.csv',converters={'first_column': convert_dtype,'second_column': convert_dtype})
```

With `dtype`:
```Python
pd.read_csv('file.csv', dtype={'first_column'='string', 'second_column'='string'})
```

The problem with either of these approaches in this context is that this dataset is wide (2000+ columns) and a lot of columns seem to have mixed columns. I will research how to solve this problem without ignoring memory concerns.

The advantage we have when using `low_memory=False` is that the dataset is not large (6000+ rows), so the converstion from csv to dataframe is quick (~80ms on my machine)

Reference: <a href="https://www.roelpeters.be/solved-dtypewarning-columns-have-mixed-types-specify-dtype-option-on-import-or-set-low-memory-in-pandas/">dtype warning: columns have mixed types</a>

In [3]:
df18 = pd.read_csv("data/MERGED2018_19_PP.csv",low_memory=False)
dfc18 = df18.copy() # create a copy of original df

## Analysis

First we want to drop all fully empty columns

In [4]:
dfc18 = dfc18.dropna(axis='columns',how='all')
print(f'original \'18 shape: {df18.shape} \n      after dropna: {dfc18.shape}')

original '18 shape: (6806, 2044) 
      after dropna: (6806, 711)


We removed 1333 fully empty columns. Note: *We should probably inspect what columns they are in case the null values provide some sort of insight*

Next, let's filter which schools we are focused on (4 year institutions):

In [5]:
is_4year = dfc18['HIGHDEG'] >= 3
dfc18_4yr = dfc18[is_4year]
dfc18_4yr.shape

(2725, 711)

We've got 2725 4-year schools. Let's see how many columns have null values, and go from there.

In [6]:
len(dfc18_4yr.columns[dfc18.isnull().any() == True].tolist())

693

In [11]:
rec_co = pd.read_csv("data/Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False)

In [9]:
rec_co.shape

(6806, 2045)