# Tutorial 3.3: Pandas Data Selection
Python for Data Analytics | Module 3  
Professor James Ng

In [None]:
# SETUP: DO NOT CHANGE
import numpy as np
import pandas as pd

## Introduction

In this tutorial, we will be exploring how to extract data from **`Series`** and **`DataFrame`** objects.

You'll find that many of the same techniques that we used with NumPy arrays will also be available for these objects. In addition, they introduce some additional functionality that will feel very familiar to working with dictionaries.

To get started, let's load our college scorecard data set.

In [None]:
# Download the College Scorecard dataset from OSF
!curl -L https://osf.io/cz253/download --create-dirs -o data-sets/college-scorecard-data-scrubbed.csv

college_scorecard = pd.read_csv(
    'data-sets/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')
college_scorecard.head()

## Selecting Data from `DataFrame` Objects

You can retrieve an individual `Series` from a `DataFrame` by passing the `Series` name/key to the `DataFrame`:

In [None]:
college_scorecard['religious_affiliation_desc']

You can retrieve multiple columns at once by passing a list of column names. When you do this, you actually get a new DataFrame, rather than a list/ndarray of individual `Series` objects:

In [None]:
college_scorecard[['institution_name', 'city']][0:10]

You can test for the existence of a given `Series` in a `DataFrame` with the `in` operator:

In [None]:
'city' in college_scorecard

### Dropping Columns

To drop a column, use pandas.DataFrame.drop()

In [None]:
# Drop the url column
college_scorecard.drop(columns = ['url'])

### Slicing `DataFrame` Objects by Rows

Slicing with standard syntax affects affects rows, not columns in a `DataFrame`. In other words, you can slice based on the index values, but not the column names. 

To demonstrate, let's get a slice of the *first 5 rows* of our `DataFrame`:

In [None]:
college_scorecard[0: 5]

#### Slicing with String-Based Indexes
Now, you might be wondering how this would work if we were using a string based index instead of a integer based index. We'll let's find out.

In [None]:
# First, we will update the index of our DataFrame 
# to use the "institution_name" column.
college_scorecard.index = college_scorecard['institution_name']

# Now display a few rows so that you can see how the index has been changed
college_scorecard.head()

With *pandas*, you can pass in two string values as the `start` and `stop` elements of a slice notation. SUPER AWESOME!

In [None]:
# Retrieve all rows from "Southwest University of Visual Arts-Tucson" to "Thunderbird School of Global Management"
# (inclusive! See note in next cell.)
college_scorecard["Southwest University of Visual Arts-Tucson":"Thunderbird School of Global Management"]

**WARNING**  
When you "slice" a `DataFrame` using strings like this, the row that matches the `stop` string <strong>is</strong> included in your results as you can see here. This is different from normal slice notation, which does not include the stop element in the results. 

#### Slicing (Implicit Index)

When we changed the index of our `college_scorecard` DataFrame object, *pandas* did a little sleight of hand. It didn't really get rid of the original integer index. It just tells it to go and hide backstage so to speak. It is still there in the shadows and you can still use it:

In [None]:
# Even though we changed the index to be the 'institution_name' column, 
# the original integer index is still hiding in the background.
college_scorecard[0:5]

### Slicing `DataFrame` Objects by Columns

`DataFrame` objects have two properties, `loc` and `iloc` that allow you to slice both rows and columns of a `DataFrame` in a similar manner to slicing a 2-dimensional `ndarray`:
* `loc`: Use this when you want to slice rows/columns based on their labels (ex. string based or non-zero based numeric indexes)
* `iloc`: Use this to slice rows/columns based on their implicit integer indices.

First let's demonstrate using `iloc` as it is practically identical to slicing a 2-dimensional NumPy array:

In [None]:
# This will retrieve rows 5-9 and columns 0-4
college_scorecard.iloc[5:10, 0:5]

We've already talked about how we currently have a hidden integer index for the rows of our *college_scorecard* `DataFrame`.  

There is also a hidden integer index for the columns of a `DataFrame` (first column gets the 0 index, second column gets 1, etc). This is why we were able to pass `0:5` to retrieve the first 5 columns of the DataFrame when using the `iloc` attribute.

And now for a quick example of using the `loc` property to slice rows/columns by their labels:

In [None]:
college_scorecard.loc[
    "Southwest University of Visual Arts-Tucson":"Thunderbird School of Global Management", # Slice the rows
    "institution_name": "url" # Slice the columns
] 

#### .loc Slices Include the 'End' Element
Again, it's very important to remember one interesting aspect of `.loc` behavior. 

Unlike normal slice notation, those done with this attribute will **include** the element at the `end` of your slice. You can see this in the example above.

This is the opposite of normal slicing and when using `.iloc`. *Make sure to remember this or you will get quite confused at times.*

### Masking
You are already quite familiar with masking based on your work in NumPy. Masking is also used in *pandas*, but here a mask can be used to filter the entire `DataFrame` - not just one `Series`. 

Let's demonstrate with a simple example:

In [None]:
# First, let's create a mask using a comparison function
# based on which rows have a "state" value of `IN`
mask = college_scorecard['state'] == 'IN'

# Now display the first 10 rows of our object
# so that you can see what it looks like
mask[:10]

Ok, now notice what is returned here. It is boolean `Series` object. Very similar to the Boolean `ndarray` that would be returned in a similar operation from NumPy.

The key difference here is the presence of the index "column". The boolean values of the Series are tied to specific index values. What this means is that we can take this mask and apply it to any other *pandas* object where these index values are present, including entire `DataFrame` objects.

Let's take our mask and apply it to our `college_scorecard` DataFrame and inspect the results:

In [None]:
college_scorecard[mask]

Can you see how all the rows of our resulting `DataFrame` are for institutions in Indiana (IN)? That is so awesome! This ability in Pandas allows you to focus on the data you want to analyze from a larger data set in almost no time at all.

In [None]:
# Just like in NumPy, you can use BITWISE operators to 
# combine multiple comparisons into a single mask.

# For example, which colleges in NY primarily offer Bachelors degrees?
# Remember that parentheses are important when joining comparisons.
mask = (college_scorecard['state'] == 'NY') & (college_scorecard['predominant_degree_desc'] == 'Bachelors')
college_scorecard[mask]

<div class="alert alert-block alert-info">
<p>
All the standard comparison operators that you used on NumPy arrays are also available on <em>pandas</em> objects.
</p>
</div> 

## Selecting Data from `Series` Objects

Let's start by grabbing the `url` series object out of our *DataFrame*:

In [None]:
url_series = college_scorecard['url']
url_series.head()

As a reminder, a `Series` object is comprised of an explicit index and the values. **Notice here that our `Series` object inherits the 'institution_name' column values as the index from the `DataFrame`.**

Several of the methods available on Python **`dict`** objects are also available on `Series` objects. The reason that this is possible is because Pandas maintains a mapping relationship between the explicit index elements and the Series values - just like standard Python does between the keys & values of a dictionary.  

#### Value Retrieval via Index "Key"
You can retrieve a value from the `Series` by passing it the index "key" you are interested in.

In [None]:
url_series['University of Notre Dame']

#### `keys()` and  `items()` methods
These methods, which exist on all Python dictionaries, are also available on `Series` objects.

In [None]:
# Series.keys() returns all the elements of the Series index
# and is equivalent to calling Series.index
url_series.keys()

In [None]:
# Series.iteritems() returns something called an iterator, which 
# is a special type of object that will generate a new value 
# from an existing data structure each time you ask for it.

smaller_sized_series = url_series[0:10]
    
for record in smaller_sized_series.iteritems():
    print(record)

In [None]:
# If you remember how method chaining works, you can simplify 
# what we just did into the following. 
for record in url_series[0:10].iteritems():
    print(record)

In [None]:
# DataFrame.iterrows() also returns an iterator but on a DataFrame. 
# It enables you to loop through each row in a DataFrame, returning an iterator
# containing the index of each row and the data for that row as a Series.

In [None]:
for idx, row in college_scorecard.iterrows():
    print(idx)  

In [None]:
for idx, row in college_scorecard.iterrows():
    print(row['url'])  

In [None]:
for idx, row in college_scorecard.iterrows():
    print("The URL for " + idx + " is " + row['url'])  

### Slicing with Explicit Indexes & Implicit Indexes

Just as we saw with `DataFrame` objects, you can use both the explicit index (i.e. labels) and the implicit index (always integers) to retrieve data via slice notation on a `Series` object.

In [None]:
# Slicing by the explicit index
# Remember that the 'end' element will be included in the results
url_series['Stanford University':'University of Notre Dame']

<div class="alert alert-block alert-info">
<p>
It is important to note that the reverse request, `url_series['University of Notre Dame': 'Stanford University']` would have yielded no results.
</p>
<p>
This is because Notre Dame appears after Stanford in the CSV file. Remember that, technically, the first item in a slice notation is the 'start' and the second is the 'end'. It is important that you have them in the right order.
</p>
</div> 

In [None]:
# Using the implicit hidden integer index 
# to slice off the last 10 elements in the Series
url_series[-10:]

### Series Masking
You can do masking on `Series` objects in the same way you did with NumPy Arrays. The only difference is that your results will always include the index values along with the data values.

In [None]:
# Let's get a new Series object with numeric data on SAT average scores.
sat_average_series = college_scorecard['sat_average']

In [None]:
# Return schools with SAT averages over 1400
sat_average_series[sat_average_series > 1400]

In [None]:
# You can build up multiple comparisons in your masks as normal.

# What schools have averages between 1400 & 1500?
# Remember the parentheses for these are important!
sat_average_series[(sat_average_series >= 1400) & (sat_average_series <=1500)]

We demonstrated above how you can apply a mask to an entire `DataFrame` object. You can also apply them directly to another `Series`. This is the same thing we did in NumPy when we would apply a mask from one array to another array with the same index values.

In [None]:
# Generate a mask from one series with a compound comparison.
mask = (sat_average_series >= 1400) & (sat_average_series <=1500)

# Then apply it to another series.
# This gives us all the states with colleges where the
# average SAT score are between 1400 and 1500.
college_scorecard['state'][mask]

### Unique Values
In NumPy we had the convenient `np.unique()` function that would return all the unique value in an *ndarray*. Not surprisingly, *pandas* has a method that provides that same functionality for *Series* objects.

In [None]:
# Obtain the unique retention rates from our DataFrame
college_scorecard['predominant_degree_desc'].unique()

Notice a couple of things here:
1. The object returned from this method is a NumPy array, **not** a Series. This is somewhat surprising.
2. Unlike the np.unique() function, the results of this method are **not** sorted by default. You would have to do that yourself.

In [None]:
# Sorting the results of unique()
# Remember that we are dealing with a NumPy array here
# so we are sorting using NumPy.
unique_retention_rates = college_scorecard['predominant_degree_desc'].unique()
np.sort(unique_retention_rates)

**Note**  
The preceding method is not supported on `DataFrame` objects. To find unique rows of a DataFrame, use pandas's drop_duplicates() function.

### Sorting a DataFrame

In [None]:
# Sort the college_scorecard DataFrame by predominant_degree_code
college_scorecard.sort_values(by=['predominant_degree_code'])

In [None]:
# Sort by predominant_degree_code and median_student_earnings
college_scorecard.sort_values(by=['predominant_degree_code', 'median_student_earnings'])

## Exercise
Find the college in each city with the highest median student earnings.
Steps: 1) Sort the college_scorecard DataFrame by city, state and descending order of median student earnings (highest earnings appears first). 2) Use drop_duplicates to check for duplicated city and state values and keep the first of the duplicates. Feel free to google it!