<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Missing Data</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/empty_set.png">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                    <p style="font-style: italic; color: white;">
                    "The devil's finest trick is to persuade you that he does not exist."
                    </p>
                    <br>
                    <p style="color: white;">-Charles Baudelaire</p>
            </blockquote>
        </div>
    </div>
</div>

<br>


<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Empty_set.svg'>Octahedron80</a>, released into the public domain
</div>


<hr>

# Generally

Missing data is a fact of life. Consequently. pandas has good resources for dealing with it. Before we get to the tools and tactics available for dealing with missing data, a brief primer on NaNs, Nones, and Nulls:

<!--- The parentheisis at the end screws up typical markdown --->
### <a href="https://en.wikipedia.org/wiki/Null_(SQL)">Null</a>


Null (or NULL, or null) is a datatype in Structured Query Language (SQL) that signifies the absence of a value. It is not the same as zero (0), and it is not the same as an empty string (''). Nulls will generally be converted to something more Pythonic like NaN or None before you will work on it.

Example:

    SELECT 
        height 
    FROM
        table 
    WHERE
        height IS NOT NULL;

### [None](https://docs.python.org/3.6/library/constants.html#None)

None is Python's builtin datatype used to signify the absence of a value. You basically assign it to a variable like anything else. It is different than false, however please note bool(None) == False. 

Example:

    placeholder = None

### [NaN](https://docs.scipy.org/doc/numpy-1.13.0/user/misc.html)

NaN (not a number) is a datatype from numpy (numpy.NaN, often seen as np.NaN). It is [weird by design](https://en.wikipedia.org/wiki/NaN), but this weirdness allows for some pretty nifty workarounds. Two quick caveats: 1) numpy is technically a float and mixes best with other floats or Python objects; 2) if you are comparing something to NaN, use a specialized function for doing so. When tested for equality, np.NaN == np.NaN is false.

Related to NaN is NaT (not a time), which is like a NaN for datetime types.

Example:
    
    s = pandas.Series([np.NaN, 5.0, 1.0, np.NaN])


---

# Modules covered

### Standard Library
* None

### Third-Party Libraries
* [pandas](https://pandas.pydata.org/pandas-docs/stable/)
* [numpy](https://docs.scipy.org/doc/)


# Modules not covered

### Standard Library
* None

### Third-Party Libraries
* None

---

In [None]:
# Stdlib imports
import pathlib

# Third party imports
import numpy as np
import pandas as pd

### <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html">read_csv()</a> allows you to load your NaNs on import.

It will automatically pick up 'NA', 'N/A', '#N/A', '', 'NULL', 'null', 'NaN', 'nan', and a variety of other values by default. You can add your own custom NA values using the na_values argument.

In [None]:
# Example of what it looks like.
print(pathlib.Path('./data/disability.csv').read_text()[:500])

In [None]:
# And here's a more complete example.
df = pd.read_csv(
    './data/disability.csv',
    na_values=['CUSTOM_NULL_VALUE'],
    parse_dates=['Update Date'],
    infer_datetime_format=True,
)

# Floats are generally better than ints. Don't convert to int unless needed.
df.head(5)

### <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html">drona()</a>

Both Series and DataFrames support the dropna() method. Unsurprisingly, this drops missing values.

In [None]:
# By default, dropna() will drop any row containing a NaN value.
df.dropna().head(5)

In [None]:
# Demonstating on a series as well. We will skip the Series examples going forward.
df['State Code'].dropna().head(5)

In [None]:
# Using the 'subset' arugment will allows you to drop only if specific columns or columns
# have NA values.
df.head(5)

In [None]:
# Drop rows with NaN in 'Population age'
df.dropna(subset=['Population age 18-64']).head(5)

In [None]:
# Drop rows with NaN in 'Update Date' or 'State Code'
df.dropna(subset=['Update Date', 'State Code']).head(5)

### <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">fillna()</a>

Instead of dropping values, sometimes you will want to replace the NaN value or impute a value. This is what the fillna() function does. It can do a variety of different types of fills.

In [None]:
# This replaces all the NAs with zero.
df.fillna(0).head(5)

In [None]:
# This replaces 'File Name' with unknown and 'Update Date' with NaT
df.fillna({'File Name': 'Unknown', 'Update Date': np.datetime64('NaT')}).head(5)

In [None]:
# It can get as complex as you want.
# Note: the nice thing is we can calculate column statistics without
# gettign rid of the NaN values first. See median usage below.
median_val = df['Population age 18-64'].median()
df['Population age 18-64'].fillna(median_val).head()

In [None]:
# It also has handy forward fill and backward fill options.
df['State Code'].fillna('').head(13)

In [None]:
# Forward fill
df['State Code'].fillna(method='ffill').head(13)

In [None]:
# Backfill fill
df['State Code'].fillna(method='bfill').head(13)

### <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html">isnull()</a> /  <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html">notnull()</a>

The pandas functions isnull() and notnull() are used to test whehter values are null, which allows for boolean indexing.

In [None]:
# View the head
df['State Code'].head(6)

In [None]:
# Check whether null
null_mask = df['State Code'].head(6).isnull()
null_mask

In [None]:
# Get the rows with NA State Codes
df.head(6)[null_mask]

In [None]:
# Check whether notnull
notnull_mask = df['State Code'].head(6).notnull()
notnull_mask

In [None]:
# Get the rows with valid state codes
df.head(6)[notnull_mask]

In [None]:
# Note, you can also negate the mask with the negation operator ~
df.head(6)[~notnull_mask]

### Np.NaN as a placeholder

In [None]:
# Creating empty data columns
df = pd.DataFrame({'x': np.arange(10) * 7, 'y': np.NaN})
df

In [None]:
df.loc[df['x'] < 30, 'y'] = 'I am less than 30!'
df.loc[df['x'] >= 30, 'y'] = 'I am 30 or greater!'
df

# Additional Learing Resources

* ### [Pandas Missing Data Tutorial](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

---

# Next Up: [Dtype Specific Techniques](4_dtype_specific_techniques.ipynb)

<br>

<div align='left'>

<img style="margin-left: 0;" src="./static/my_kingdom_for_a_decent_data_type_image.png" width="200">


<br>

Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Binario_cropped.png'>MdeVicente</a>, released into the public domain.
</div>


---