# Tutorial 3.2: Pandas Data Loading
Python for Data Analytics | Module 3  
Professor James Ng

In [None]:
import numpy as np
import pandas as pd

So far in our course, I've taken care of loading all of the data sets that you've interacted with. **But that changes today!** In this tutorial, we will be going over a couple of different methods on how to load data into your notebooks with Pandas.

## Loading CSV Files with `pd.read_csv()`

*Pandas* **`read_csv()`** function is likely to become one of your most used tools.  It creates a **`DataFrame`** object from a CSV file (surprise!). 

Let's go over the basics of using it and some of the common options you might want to specify.

### Simple Case

In [None]:
# Download the Chicago recent crime dataset from OSF
!curl -L https://osf.io/u6xqa/download --create-dirs -o data-sets/chicago-recent-crime.csv

In [None]:
# In the simpliest case, all you have to do pass the 
# location of your CSV file to the function.
chicago_crime = pd.read_csv('data-sets/chicago-recent-crime.csv')
chicago_crime.head()

In [None]:
# You can also point it directly at a csv or zipped csv file that is online.
# Super cool! 

# BUT, this will be slow depending on the size of the CSV
# and your connection speed. And if your internet connection
# goes belly up, you are out of luck.
online_csv = pd.read_csv(
    'https://bulkdata.uspto.gov/data/trademark/assignment/economics/2016/tm_convey.csv.zip')
online_csv.head()

### Specify an Encoding

Text files are **encoded** in different formats when they are written. To read them, you must decode them with the same standard or you'll have a problem.

For example, our `college-scorecard-data-scrubbed.csv` file was encoded using `latin-1`, but the default setting for Pandas in Python 3 is `utf-8` so we will get an error if we try to read the file without specify the correct encoding like so:

In [None]:
# Download the College Scorecard dataset from OSF
!curl -L https://osf.io/cz253/download --create-dirs -o data-sets/college-scorecard-data-scrubbed.csv

In [None]:
college_scorecard = pd.read_csv(
    'data-sets/college-scorecard-data-scrubbed.csv')

To avoid this error, we need to specify the correct encoding with the `encoding` parameter when we call the function.

In [None]:
# Load the CSV file with 'latin-1' encoding specified.
college_scorecard = pd.read_csv(
    'data-sets/college-scorecard-data-scrubbed.csv', encoding='latin-1') 
college_scorecard.head()

#### Tip: Encoding
Text isn't actually stored in its natural form inside of your computer. Instead it is "encoded" into a format that the computer understands, but would look like gibberish to a human. 

Without knowing how a file was "encoded" you don't have a way to turn that gibberish back into readable text, which is called "decoding".

### Choosing an Index Column

The default behavior of `pd.read_csv` is to generate an integer index, but you can override this by specifying a column name.

We can tell from the previous example that our `college-scorecard-data-scrubbed.csv` has a 'institution_name' column. Let's tell *pandas* to use that as our index. When the data is loaded, the values of that column will become the index.

In [None]:
# Specify a data column to use as the index
college_scorecard = pd.read_csv(
    'data-sets/college-scorecard-data-scrubbed.csv', 
    index_col='institution_name', encoding='latin-1')
college_scorecard.head()

Suppose I now change my mind about setting 'institution_name' as the DataFrame's index. How do I reset the index back to the default integer form?

In [None]:
college_scorecard.reset_index()

In [None]:
# Let's check to see if the index was reset. It wasn't!
college_scorecard.head()

In [None]:
# Need to commit the change to a variable, either a new one if you want a new one, or back to itself.

# To assign back to itself, you could either do this:
# college_scorecard = college_scorecard.reset_index()

# or use the inplace option
college_scorecard.reset_index(inplace=True)
college_scorecard.head()

### Limiting the Columns to Load

Data sets will often contain hundreds of data points. But in many cases, we will only be interested in working with a subset of them.

When this happens, loading all the data would not only result in a difficult to work with *DataFrame* but would also take up unnecessary computer memory and slow down your processing. 

Thankfully, we can limit the CSV columns to load via the `usecols` parameter, which takes a list of column names you want to load.

In [None]:
# Download another data for this example from OSF.
!curl -L https://osf.io/vesuh/download --create-dirs -o data-sets/college-loan-default-rates.csv

In [None]:
# We're only interested in the name, city, and state columns
# for all colleges in the loan defaults data set.
college_default_rates = pd.read_csv(
    'data-sets/college-loan-default-rates.csv',
    usecols=['name', 'city', 'state']
)
college_default_rates.head()

In [None]:
# You can also specify the columns you want to include 
# in your DataFrame from the CSV by specifying the column 
# positions (like always, start with 0).

# It is just like specifying index numbers. Start with 0
college_default_rates = pd.read_csv(
    'data-sets/college-loan-default-rates.csv',
    usecols=[0, 1, 2]
)
college_default_rates.head()

# RULE OF THUMB: If a column has a name, use the name! Column positions can change.


### Manually Specifying the Column Names
The default behavior of `pd.read_csv()` is to use the values found in the first row of the CSV file as the column header values. You can however, override this and manually specify the names of the columns. 

To do so, you must provide the `names` parameter and often times the `skiprows` parameter:
* `names`: Allows you to specify a list of names to use for the column headers.
* `skiprows`: Can be passed an `int` indicating the numbers of rows in the data set not to process. *Often times, if you are changing the header names, you are choosing to do so because you don't like the existing ones, not because there aren't any.* Using this parameter allows you to exclude the original ones from being processed and being added to your DataFrame as an additional row.

In [None]:
# Use the `names` parameter to override the default
# column names. 
college_default_rates = pd.read_csv(
    'data-sets/college-loan-default-rates.csv',
    usecols=[0, 1, 2], 
    names=['obtuse_college_id', 'college_name', 'college_address'])
college_default_rates.head()

In [None]:
# In the previous example you can see that the original header names
# were still processed and the result wasn't ideal.

# Use skiprows to specify the number of rows that 
# should not be processed from the original data set.
college_default_rates = pd.read_csv(
    'data-sets/college-loan-default-rates.csv',
    usecols=[0, 1, 2], 
    names=['obtuse_college_id', 'college_name', 'college_address'],
    skiprows=1)
college_default_rates.head()

### Automatically Parsing Dates
One of the most painful things to work with in data sets is date conversions. Just thinking about it makes me cringe!

Thankfully, Pandas does an awesome job trying to convert various string representations into Python date objects for us if we just ask it to using the `parse_dates` parameter. To demonstrate this, let's load a data set containing information on all of SpaceX's launches. First we will do it without parsing the date(s) and then with date parsing.

In [None]:
# Download SpaceX data set
!curl -L https://osf.io/xz98h/download --create-dirs -o data-sets/spacex-launch-data.csv

In [None]:
# Just load the data to get started. You'll see that there is a 'Date'
# column which we could tell pandas to parse.
space_x = pd.read_csv('data-sets/spacex-launch-data.csv')
space_x.head()

In [None]:
# Without the parsing directive, dates will be treated as 
# strings, which is problematic for analysis.

# Check out the current data type of the 'Date' Series object.
# It will be 'O' which stands for object and is used for strings.
space_x['Date'].dtype

In [None]:
# But if we pass the 'Date' column name to pd.read_csv via
# the `parse_dates` parameter it will convert it 
# to datatype '<MS8[ns]' - which is a cryptic way of saying a DateTime object.
space_x = pd.read_csv('data-sets/spacex-launch-data.csv', parse_dates=['Date'])
space_x['Date'].dtype

In [None]:
# Take another look at the columns in this data set that we printed above.
# There is another date-related field, 'Time (UTC)' which represents the
# specific time of day that the launch took place.

# You can tell Pandas to combine both of those columns into a single data column
# by passing a nested list to the `parse_dates` parameter like so:
space_x = pd.read_csv(
    'data-sets/spacex-launch-data.csv', 
    parse_dates=[['Date', 'Time (UTC)']])
space_x

# Make sure to notice how it removes the original two columns and replaces 
# it with a single one that takes into account both pieces of data. AWESOME!

## EXERCISE

In [None]:
# How much time elapsed between the first flight and the second flight?

### Leaning More
You can learn more about all the other available parameters for the `read_csv` function in the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv' target='_blank'>Panda's online documentation.</a>