**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 17. Working with Missing Values in Pandas

## In this lesson...

- We learned a little bit about missing values back in Lesson 13


- In this lesson, we'll learn about working with missing values in Pandas in more detail &mdash; quirks and all

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Another small example dataset 

* Let's start by importing Pandas and NumPy:

In [None]:
import pandas as pd
import numpy as np

* For this lesson, we'll work with a small example dataset, in the CSV file `data/toy.csv` in the same folder as this notebook

In [None]:
toy_df = pd.read_csv(
    'data/toy.csv',
    parse_dates=['datetimes']
)

- Note that `parse_dates=...` takes a list of columns to be parsed as a datetime data type (dtype)
    - See [the documentation for `pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for details


* Let's see what this dataset looks like:

In [None]:
toy_df

- Looking at the raw CSV file, we see that the `integers` column contains integers or missing values


- However, note that the `integers` column, consisting of integers, was read in as a float


- Unfortunately, Pandas cannot (easily) handle a Series of integers with NaN values
    - Actually, [there is a way to do this](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html), but it's currently experimental

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Values considered "missing"

- We saw in an earlier lesson that Pandas uses `NaN` to mark values that are *"missing"* or *"not available"* or *"NA"*


- Pandas has other markers for missing values, such as `NaT` for datetime dtypes


- Good news: `.isna()` and `.notna()` will detect NA values, no matter the marker used

- For example, we can query for all rows with a missing value of `datetimes` like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Sorting with missing data

- Recall that when we use `.sort_values()` to sort a DataFrame by the values of one of its columns, the missing values get placed at the end


- For example, let's see what happens when we sort our DataFrame by `floats`:

## Computations with missing values

- Arithmetic operations with missing values results in NA


- For example, let's see what happens when we add the `floats` and `integers` columns together:

- Happily, many of the built-in Pandas methods that perform descriptive statistics and computational methods, like the *reduction*/*aggregation* and _"same size"_ methods from Lessons 15 and 16, are written to account for missing values


- For example, when using `.sum()`, NA values are skipped by default


- To illustrate, let's see what happens when we sum the values in the `floats` and `integer` columns across all the rows:

- [The documentation for the `.sum()` DataFrame method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) describes the `skipna=...` keyword argument, which is `True` by default


- This is also the case for Series/DataFrame methods like `.max()`, `.mean()`, `.median()`, `.min()`, `.mode()`, `.std()`, `.var()`, etc.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Groupby with missing values

- NA values in groupby operations are automatically excluded


- For example, let's group our dataset by the values of `strings` and compute the mean of `floats` for each group:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Filling missing values

- Sometimes ignoring or skipping NA values is the way to go, like `.sum()` and other Pandas built-in methods do by default


- Other times, you may want to fill NA values with something else that makes sense


- We can use the `.fillna()` Series/DataFrame method to accomplish this


- For example, we can replace the NA values in the `floats` column with 0:

- Note that `.fillna()`, when applied to a Series, returns a Series
    - When applied to a DataFrame, `.fillna()` returns a DataFrame

- Instead of using 0, we can replace the NA values in the `floats` column with the mean of the non-NA values:

- We can also use `.fillna()` to forward-fill or back-fill values, like this:

In [None]:
toy_df['floats'].fillna(method='ffill')

In [None]:
toy_df['floats'].fillna(method='bfill')

- This way of filling NA values often makes sense for time series data, sorted chronologically


- See [the documentation for `.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) for details

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Dropping rows or columns with missing values

- If you just want to exclude rows or columns with missing values, you can use `.dropna()`


- By default, `.dropna()` returns a new DataFrame with all the rows containing NA values (in any column) dropped:

In [None]:
toy_df.dropna()
# toy_df.dropna(axis='rows') does the same thing

- We can obtain a new DataFrame with all the columns containing NA values (in any row) dropped by using the `axis='columns'` keyword argument:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Replacing values with NA

- Sometimes, missing values are encoded with a **sentinel value**: a special value designated for missing values


- For example, suppose `-99` is a sentinel value in our dataset: we see that the `integers` column contains the value `-99.0`, which was originally `-99` in the CSV file:

In [None]:
toy_df

- We can replace values of `-99.0` with a proper NA value marker using the `.replace()` Series/DataFrame method, like this:

- Note that `.replace()` returns a new Series when applied to a Series
    - `.replace()` returns a new DataFrame when applied to a DataFrame

- So, we can add a new column to our DataFrame with the revised `integers` column, like this:

- Alternately, we can replace the `NaN` values in the `integers` column with a sentinel value


- Then, we can convert the `integers` column to be a proper column of `int` dtypes with the `.astype()` method


- Like this:

- Here is [the documentation for the `.replace()` Series method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)


- Here is [the documentation for the `.astype()` Series method](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 0

In the same folder as this notebook, there is a zipped CSV file `data/nycflights13_flights.csv.zip`, containing the same nycflights13 dataset we used in previous lessons. Read the CSV file into a DataFrame. Display the top 5 rows of the DataFrame.

### Problem 1

Are there any rows with missing `time_hour` values?

### Problem 2

Drop all rows with missing values in the nycflights13 dataset. How many rows remain?

### Problem 3

Compute the average arrival delay for each month, when you 

1. assume that any missing arrival delay has a value of 0, and 
2. omit all missing arrival delays. 

Compare the values you get from both computations. Do they make sense?

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- From the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html):
    - [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)