# `pandas` - Read DataFrames

__Contents__: This notebook describes the process of reading in a dataframe from a CSV file. This entails:
1. Finding the CSV file and any companion files
1. Visually checking their contents, especially the CSV file
1. Reading the CSV file into a DataFrame
1. Check the datatypes of DataFrame columns 
1. Possibly rereading the DataFrame with different parameters (then recheck the datatypes)

Load libraries.

In [4]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

## Find Files

In [6]:
%sh ls -oh /dbfs/mnt/datalab-datasets/college-scorecard/

### Check File Content

In [8]:
%sh head /dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv

## Read CSV File

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Create a function to read the file.

In [12]:
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil import copyfile
  import os.path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  if not os.path.exists(cache_file_path):
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,nrows=n_rows)

Check (some) of the rows and columns (using the function.)

In [14]:
read_college_scorecard(n_rows=5)

## Check Datatypes

Look for numeric columns that were read in as `object` (character) types.

In [17]:
read_college_scorecard()\
  .select_dtypes(include=['object'])\
  .columns

In [18]:
read_college_scorecard()\
  .select_dtypes(include=['object'])\
  .head()

There are too many variables to check visually. 

Create list of acceptable `object` variables to drop before checking the object variables (as above.)

In [20]:
object_var_list = ['INSTNM','CITY','STABBR','AccredAgency','INSTURL','NPCURL']

Check the object variables after dropping those from `object_var_list`.

In [22]:
read_college_scorecard()\
  .select_dtypes(include=['object'])\
  .drop(object_var_list,axis=1)\
  .head()

There seem to be two problems:
1. The `ZIP` variable takes two formats (ZIP5 and ZIP9)
1. There are several columns that should be numeric, but are coded as `object` as they contain the string `PrivacySuppressed`.

The first problem will be dealt with in the next notebook. 

The second problem will be dealt with by coding `PrivacySuppressed` as a missing value.

In [24]:
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil  import copyfile
  from pathlib import Path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  cache_file_obj  = Path(cache_file_path)
  if not cache_file_obj.exists():
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                     )
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil import copyfile
  import os.path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  if not os.path.exists(cache_file_path):
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                     )

Check the `object` variables after excluding those in `obj_var_list`.

In [26]:
read_college_scorecard()\
  .select_dtypes(include=['object'])\
  .drop(object_var_list,axis=1)\
  .head()

The only remaining problem seems to be the `ZIP` variable, which will be dealt with in the next notebook.

Check the `object` variables in the first three rows.

In [28]:
read_college_scorecard().select_dtypes(include=['object']).head(3)

That looks appropriate for `object` variables.

The `read_csv` function won't use a numeric type, for a column, unless all values fit that type. So the visual check isn't essential. 

In this case there are too many numeric variables for the two checks below to be definitive, but it's still a good idea to visually check the data.

In [31]:
read_college_scorecard().select_dtypes(include=['int64']).head()

In [32]:
read_college_scorecard().select_dtypes(include=['float64']).head()

This looks good. 
- The variables/columns are typed correctly. 
- The `PrivacySuppressed` in the CSV has been coded as NA, specifically as numpy `NaN`.

The following function reads in the data set correctly and will be used in the following notebooks.

In [34]:
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil  import copyfile
  from pathlib import Path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  cache_file_obj  = Path(cache_file_path)
  if not cache_file_obj.exists():
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                     )
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil import copyfile
  import os.path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  if not os.path.exists(cache_file_path):
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                    )

In [35]:
read_college_scorecard().columns[0]

In [36]:
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil  import copyfile
  from pathlib import Path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  cache_file_obj  = Path(cache_file_path)
  if not cache_file_obj.exists():
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                    )\
            .rename(columns = {'\ufeffUNITID':'UNITID'})
def read_college_scorecard(file_path='/dbfs/mnt/datalab-datasets/college-scorecard/MERGED2013_PP.csv',
                           n_rows=None):
  from shutil import copyfile
  import os.path
  cache_file_path = '/tmp/MERGED2013_PP.csv'
  if not os.path.exists(cache_file_path):
    copyfile(file_path,
             cache_file_path)
  return pd.read_csv(file_path,
                     nrows=n_rows,
                     na_values=['PrivacySuppressed']
                    )\
            .rename(columns = {'\ufeffUNITID':'UNITID'})

In [37]:
read_college_scorecard().columns[0]

### DataFrame Summaries
The remainder of the notebook consists of a few methods that provide basic summaries of DataFrames.

The `info` method lists the name, number of non-null values and dtype of each column (for small datasets.)

For larger datasets the specific variables are not listed.

In [40]:
read_college_scorecard().info()

The `columns` attribute lists the column names.

In [42]:
read_college_scorecard().columns

The output above can be converted into a list of strings with the `list` function. Python does not display all elements of the list.

In [44]:
list(read_college_scorecard().columns)

The `dtypes` attribute lists the column names and their types.

In [46]:
read_college_scorecard().dtypes

The `dtypes` attribute returns a `Series` with a string index.

In [48]:
type(read_college_scorecard().dtypes)

In [49]:
read_college_scorecard().dtypes.index

Retrieving a sub-series seems to work as expected.

In [51]:
read_college_scorecard().dtypes.loc[['ZIP','region']]

But retreiving a specific element does not.

In [53]:
read_college_scorecard().dtypes.loc['region']

In [54]:
read_college_scorecard().dtypes.loc['ZIP']

__The End__