# Pandas Missing Data

This notebook explains how to identify missing with `pandas`.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)

In [1]:
import pandas as pd

## Creating the data

We will create a dataframe that contains multiple occurances of duplication for this example.

In [2]:
df = pd.DataFrame({'A': ['text']*20,
                   'B': [1, 2.2]*10,
                   'C': [True, False]*10,
                   'D': pd.to_datetime('2020-01-01')
                  })

Next, delete some of the entries to create missing data.

In [3]:
df.iloc[0,0] = None
df.iloc[1,0] = None
df.iloc[10,0] = None
df.iloc[5,1] = None
df.iloc[7,1] = None
df.iloc[4,2] = None
df.iloc[5,2] = None
df.iloc[9,2] = None
df.iloc[12,2] = None
df.iloc[2,3] = None
df.iloc[12,3] = None
df

Unnamed: 0,A,B,C,D
0,,1.0,1.0,2020-01-01
1,,2.2,0.0,2020-01-01
2,text,1.0,1.0,NaT
3,text,2.2,0.0,2020-01-01
4,text,1.0,,2020-01-01
5,text,,,2020-01-01
6,text,1.0,1.0,2020-01-01
7,text,,0.0,2020-01-01
8,text,1.0,1.0,2020-01-01
9,text,2.2,,2020-01-01


## Identify missing data

The function `isna` will identify duplicates in the data.

In [4]:
missing = df.isna()
missing

Unnamed: 0,A,B,C,D
0,True,False,False,False
1,True,False,False,False
2,False,False,False,True
3,False,False,False,False
4,False,False,True,False
5,False,True,True,False
6,False,False,False,False
7,False,True,False,False
8,False,False,False,False
9,False,False,True,False


Use `sum` to get the count of missing values in each column.

In [5]:
missing.sum()

A    3
B    2
C    4
D    2
dtype: int64

The rows that contain missing data can be selected using the pandas function `any` with **axis** set to **1**.

In [6]:
anymissing = missing.any(axis=1)
anymissing

0      True
1      True
2      True
3     False
4      True
5      True
6     False
7      True
8     False
9      True
10     True
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
dtype: bool

In [7]:
df[anymissing]

Unnamed: 0,A,B,C,D
0,,1.0,1.0,2020-01-01
1,,2.2,0.0,2020-01-01
2,text,1.0,1.0,NaT
4,text,1.0,,2020-01-01
5,text,,,2020-01-01
7,text,,0.0,2020-01-01
9,text,2.2,,2020-01-01
10,,1.0,1.0,2020-01-01
12,text,1.0,,NaT
