# Check Data - EDA

## Introduction

Dataset documentation: https://collegescorecard.ed.gov/data/documentation/

Documentation: https://seaborn.pydata.org/

First though libraries are loaded and define the `read_dataframe` function.

Load the libraries.

In [4]:
import pandas  as pd
import numpy   as np
import seaborn as sns
(pd.__version__,
 np.__version__,
 sns.__version__
)

In [5]:
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil import copyfile
  import os.path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  if not os.path.exists(cache_file_path):
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(cache_file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                     )\
            .rename(columns = {'\ufeffUNITID':'UNITID'})
def fix_missing_values(df):
  median_dict = \
    read_college_scorecard()\
      .dropna(thresh=7024,axis=1)\
      .dropna(thresh= 254,axis=0)\
      .fillna(value={'NPCURL': 'NA', 
                     'INSTURL': 'NA'})\
      .pipe(lambda df: df.loc[:,df.isnull().any()])\
      .apply(lambda x: x.median(skipna=True),axis=0)\
      .to_dict()
  return df.dropna(thresh=7024,axis=1)\
           .dropna(thresh= 254,axis=0)\
           .fillna(value={'NPCURL': 'NA',
                          'INSTURL': 'NA'})\
           .fillna(value=median_dict)

In [6]:
read_college_scorecard()\
  .pipe(fix_missing_values)\
  .info()

## Read Data Dictionary

Find the data dictionary as a CSV file.

In [9]:
%sh ls /dbfs/mnt/datalab-datasets/college-scorecard

Check its contents, in particular, the variable names.

In [11]:
%sh head /dbfs/mnt/datalab-datasets/college-scorecard/CollegeScorecardDataDictionary-09-12-2015.csv

Create the function `data_dictionary` to read the data dictionary into a DataFrame, and rename the variables. Keep only some of the columns.

In [13]:
def data_dictionary():
  data_dict = pd.read_csv('/dbfs/mnt/datalab-datasets/college-scorecard/CollegeScorecardDataDictionary-09-12-2015.csv',
                          usecols=['NAME OF DATA ELEMENT','dev-category','VARIABLE NAME','API data type','SCORECARD? Y/N'],
                          dtype='object')
  data_dict.columns = [x.replace('-','_').replace(' ','_').replace('?_y/n','') 
                       for x in data_dict.columns.str.lower()
                      ]
  return data_dict

Notice the datatypes are all `object` (as specified above.)

In [15]:
data_dictionary().dtypes

The dataset originally had `1729` columns. The dataset returned by `data_dictionary` has entries for `1720`, which is most of them.

In [17]:
data_dictionary().info()

I expect that the values of `dev_category` will help organize the data set descriptions and summaries.

In [19]:
data_dictionary().dev_category.unique()

The `api_data_type` might also be useful.

In [21]:
data_dictionary().api_data_type.unique()

## Check `root` Variables

Store the list of `root` variables from the data dictionary in the DataFrame `root_var_type`.

In [24]:
root_var_list = \
  data_dictionary()\
    .query('dev_category == "root"')\
    .loc[:,'variable_name']\
    .pipe(set)\
    .intersection(read_college_scorecard().pipe(fix_missing_values).columns)
root_var_list

This list will be used below as column names. 

Display the `root` columns remaining in the dataset after rows and columns were removed because of missing values.

In [26]:
read_college_scorecard()\
  .pipe(fix_missing_values)\
  .loc[:,root_var_list]\
  .head()

These ID variables will only be useful if we match/merge this dataframe with another.

## Check `school` Variables

In [29]:
school_var_list = \
  data_dictionary()\
    .query('dev_category == "school"')\
    .loc[:,'variable_name']\
    .pipe(set)\
    .intersection(read_college_scorecard().pipe(fix_missing_values).columns)
school_var_list

Check the `dtype` for these variables.

In [31]:
read_college_scorecard()\
  .pipe(fix_missing_values)\
  .loc[:,school_var_list]\
  .dtypes

__The End__