# Tutorial 3.8: Pandas - Handling Missing Data
Python for Data Analytics | Module 3  
Professor James Ng

In [None]:
# SETUP: DO NOT CHANGE
import numpy as np
import pandas as pd

In this tutorial, we will discuss how *`pandas`* represents missing data and experiment with the methods that are available to deal with missing data inside of `DataFrame` and `Series` objects.

In [None]:
# Load the college scorecard admissions data set
# which contains a bunch of info on SAT/ACT scores of 
# students at thousands of different universities
!curl -L https://osf.io/ekwz5/download --create-dirs -o data-sets/college-scorecard-admissions.csv

college_scorecard = pd.read_csv('https://www3.nd.edu/~jng2/college-scorecard-admissions.csv')
college_scorecard.head()

## How Missing Data is Represented in pandas

In *`Series`* objects, missing data can be represented by two possible values: **`None`** and **`np.nan`**. The value that will be present in a given Series will be determined by the data type held in the series.

If a given object has only numeric (and missing) values, then missing values will always be represented by `np.nan` objects, which simply appear as `NaN`.

A good example of this is the SAT Average series in our `college_scorecard` object:

In [None]:
# Grab the SAT Average series
sat_average = college_scorecard['SAT_AVG']
sat_average[:10]

If a given series has as `dtype` of *object*, which generally results from having strings inside of it, then it is possible for missing values to be represented by either `None` or `np.nan`. Here is an example of a series where this occurs:

In [None]:
missing_data_series = pd.Series(
    ['We', 'have', '2', np.nan, None, 'pieces of missing data'])
missing_data_series

In practical terms, at least for beginners, we can think of these two values as interchangable because the methods that we will introduce next will handle both of them seemlessly (because *pandas* is your friend).

## Missing values in pandas vs numpy

numpy includes NaN values in calculations, but pandas ignores them. That is, consider the mean of these numbers: [1, 2, NaN, 3].
<br>Passing it to numpy.mean() will return NaN, but pandas's mean() will return 2.

In [None]:
np.mean([1, 2, np.NaN, 3])

In [None]:
pd.DataFrame([1, 2, np.NaN, 3]).mean()

This is an important difference. Which behavior is more correct depends on the situation. To force pandas to adopt numpy's behavior, specify skipna=False.

In [None]:
pd.DataFrame([1, 2, np.NaN, 3]).mean(skipna=False)

## `Series` Methods for Handling Missing Data
There are three common methods for dealing with missing data available to us: `isnull()`, `notnull()`, `dropna()`.

We will first go over how to use them with *`Series`* objects and then we will discuss how they work with *`DataFrame`* objects.

### `isnull()`
The `isnull()` method returns a boolean mask indicating which elements of a `Series` object are `np.nan` or `None`.

In [None]:
# Use isnull() to generate a mask of which values in sat_average are missing.
na_mask = sat_average.isnull()
na_mask[:15]

In [None]:
# As is true for any boolean mask, we can use this one to 
# pull out which indices have missing data.
sat_average[na_mask][:15]

In [None]:
# You can also invert the mask to pull out non-missing records
# You'll see there is another method that effectively does this in a moment.
sat_average[~na_mask][:15]

**Pythonista Note**  
Want to count the number of NaN records?

Well, go ahead and try using the `count()` method. It won't work, 
because, if you remember, `count()` only counts non-null values.

You can use the `len()` function instead like so: `len(sat_average[na_mask])`

### `notnull()`
This method is the logical inverse of `isnull()`. Use it to return a boolean mask indicating which elements are not null. The results of this method can be used to pull back all non-null elements of a Series when used as a mask/filter.

In [None]:
mask = sat_average.notnull()
sat_average[mask].head()

### `dropna()`
Using `notnull()` to generate a mask and then applying it back to the original *Series* object is so common that the *`pandas`* developers created the convenience method *`dropna()`* which does both steps for you.

In [None]:
# This takes the place of the two statements in the previous example.
sat_average.dropna()[:25]

## `DataFrame` Methods for Handling Missing Data
Now let's cover how our missing data methods work with *`DataFrame`* objects.

We'll start by extracting a couple of SAT & ACT related Series from our original *DataFrame* so that we have something more managable to work with.

In [None]:
subset_scorecard = college_scorecard[['SATVRMID', 'SATWRMID', 'ACTMTMID', 'ACTCMMID']]
subset_scorecard.head()

### `isnull()` & `notnull()`
These methods work exactly as you'd expect them too, returning a `DataFrame` of boolean values indicating where missing values are or are not :`isnull()` or `notnull()` respectively.

**Important: This is a Boolean Mask that is also a DataFrame!**

Up until this point in our class, I believe we've only used single Series masks. When you have a DataFrame boolean mask, *pandas* will apply the mask based on the combintation of index and column name.

Essentially you can think of it as a two-dimensional filter.

In [None]:
# Using isnull() to identify missing data in a DataFrame
subset_scorecard.isnull().head()

In [None]:
# And notnull() does the opposite
subset_scorecard.notnull().head()

**Do not use the results of `isnull()` or `notnull()` as a mask with DataFrame objects!**

You may be tempted to use the results of `isnull()` or `notnull()` as a mask on a DataFrame to pull out the non-null values.

This won't work as you expect. Use the next method instead.

### `dropna()`
The `dropna()` method works in a similar fashion on `DataFrame` objects to what we've seen with `Series` objects. It does, however, have some additional parameters that can be used.

In [None]:
# Simplest Usage
subset_scorecard.dropna()[:10]

The default behavior of `dropna()` when used on a `DataFrame` is to drop all rows where *any* column has a missing value. You can adjust this so that rows will be included if there is at least one non-missing value with the **`how`** parameter.

In [None]:
# Specify 'all' as the value of `how`
# To return all rows with at least one
# non-missing value.
subset_scorecard.dropna(how='all')[:10]

Alternatively, you can use the `thresh` parameter to indicate the minimum number of non-missing values must be present in a given record (row) for it to be included in the results.

In [None]:
# Indicate that at least 3 non-NaN record must be present
# in each record for it to be included.
subset_scorecard.dropna(thresh=3).head()

You can also drop columns instead of rows by specifying a value of *columns* or 1 in the `axis` parameter:

In [None]:
# Note that this will actually drop all
# of our columns, since they all have
# NaN values in them.
subset_scorecard.dropna(axis='columns')[:10]

And of course you could combine these parameters:

In [None]:
# Drop all columns where there are not at least 1000 non-NaN values.
subset_scorecard.dropna(axis='columns', thresh=1000)[:10]