# Missing data

In this lesson we'll introduce you to the different concepts of missingness, and give a brief introduction to how to impute (replace) missing data.

Some datasets may be perfectly complete, but many will arrive with some missing values. Cleaning can increase the amount of missing data. Even if missingness is random, it can cause difficulties for analysis. The Python implementations of basic statistical methods like ANOVA, t-tests, and correlations will fail if any of the variables involved has a missing value. One way to solve this problem is to drop any rows that contain missing values in your variables of interest. The pandas package has the .dropna() data frame method for doing exactly this:

In [10]:
import pandas as pd

# Sample data to play with and clean.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
df = pd.DataFrame(data)

# Full dataset.
print(df)

# Drop all rows that have any missing values in any column.
print(df.dropna()) 

# Drop only rows where all values are missing.
print(df.dropna(how='all'))

# Drop only rows where more than two values are missing.
print(df.dropna(thresh=2))

# Drop all rows that have any missing values in the 'gender' or 'height' columns.
print(df.dropna(subset=['gender','height']))

# Your turn. Write code below to drop rows where both height and weight
# are missing and print the result.
print('\n')
print(df.dropna(subset = ['height', 'weight']))

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


# When Does Missingness Matter?

Sometimes dropping all rows with missing data is fine, but sometimes it creates problems. Missing data matter if we believe the missingness will cause:

Loss of statistical power because so many rows have to be thrown out, making it harder to detect effects, or
Bias because certain values are more likely to be missing than others.

To know when to worry about missing data and when to throw out incomplete cases and proceed as planned, see where the missingness falls in the following categories:

Missing Completely at Random ("MCAR"):

A catastrophic flood washed away some of the servers and 20% of the data was lost.

Unless so much data is lost that sample sizes are now too small, it is fair to throw out the missing values and proceed.

Missing at Random ("MAR"):

Women are more likely to skip a question about weight, regardless of their actual weight.
Because we can explain why the data is missing using data we have, we can proceed as long as we include the variable that "explains" the missingness in our analyses.

There is no way to know that data is MAR, but sometimes we can assume it is. If we find a variable in our dataset that seems to differentiate really well between missing and non-missing (90% of the people with missing values on the "depression" score are men) we have reason to suspect MNAR.

Missing Not at Random ("MNAR")

LGBT individuals less likely to answer a survey question about their sexual orientation.

Systematic missingness: People who would answer in a certain way (LGBT vs. Heterosexual) are less likely to answer at all.

Stop, do not pass Go, do not collect $200. If we throw out MNAR data, we end up with a biased sample (proportionately fewer LGBT people than in the population we want to study) and biased conclusions.

Note that since, by definition, we don't know what people would have said for questions they don't answer, MNAR is an assumption based on looking at the data and noticing what isn't there: Abnormally low counts of LGBT people, almost no men who say they are depressed, variables with missingness where nobody picks the highest or lowest value, etc.
What do you do if you have MNAR data you can't drop (or if it is MCAR or MAR but dropping missing values leaves your sample too small)?

# Imputing Data

In cases where we want to keep all the information from all rows, even incomplete ones, we can "guess" what the missing data would have been and fill in that cell with our guess. This approach is called imputation.

There are many methods for imputing data, from the simple to the very complex. The most straightforward involves replacing missing values with the mode, mean, or median of the variable. This method isn't perfect: it keeps central tendency the same, but reduces variance and correlations among variables. Here's how it works:

In [11]:
import pandas as pd

# Sample data to play with.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
df = pd.DataFrame(data)

# For each numeric column, replace the missing values with the mean for that column.
df.fillna(df.mean(),inplace=True)
print(df)

# For each column, replace the missing values with the most common value for that
# column. Useful for filling in missing categorical values.
# As written, this command will fill in missing values for both numerical and
# categorical columns.
df = pd.DataFrame(data)
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
print(df)

# Your turn. Try replacing each value with the median, mode, or other statistic
# of your choice.

df.fillna(df.median(), inplace = True)
print(df)

    age gender  height  weight
0  27.0      f   64.00   140.0
1  50.0      f   67.25   135.0
2  34.0      f   71.00   130.0
3  37.0      m   66.00   110.0
4  37.0      m   68.00   160.0
5  37.0   None   67.25   135.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    68.0   160.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0      f    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    68.0   160.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0      f    68.0   160.0


If you'd like to see a more sophisticated method of replacing missing data, involving grouping existing entries into "similar" groups and filling in the missing values within a group with the mean for that group, check out this in-depth tutorial.

Imputation is a deep and complex topic, and will be discussed more in Unit 5 as one of the optional specializations.

# Beyond Imputation

If the causes of MNAR (or of major, catastrophic amounts of missingness that is MCAR or MAR) are clear and easy to fix, then fixing those causes and collecting new data may be easier than imputation. Either run the study afresh, or collect more data with an intentional focus on the groups with highest missingness. For example, if a coding error in a tech usage survey means data wasn't recorded for any Mac users, it may be easier to fix the coding error and run the study again (or fix the coding error and collect data from just Mac users) than try to impute such a centrally important variable.


# Wrap up

After completing this lesson, you should feel comfortable creating attractive, clean plots and subplots using Seaborn, and know which plots to choose to highlight various data features. You should also understand how visualization is critical to data cleaning. Finally, you should be able to explain how various kinds of dirty and missing data threaten the validity of analyses, and implement some basic approaches for dealing with them.