# Managing missing data with pandas

This notebook introduces some ways to manage missing data using Pandas DataFrames. For more information, see the Pandas documentation: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) and [Missing data cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-missing-data).

> **See also:**
> 
> * [Dora](https://github.com/NathanEpstein/Dora)
> * [Badfish](https://github.com/harshnisar/badfish)

In [1]:
import pandas as pd
from numpy import random

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv')

## 1. Check the data

In [3]:
df.head(20)

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12.0,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
1,2017-01-01T12:01:09,kharrison,6.0,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2,2017-01-01T12:01:34,smithadam,5.0,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
3,2017-01-01T12:02:09,eddierodriguez,28.0,76,,0.0,update
4,2017-01-01T12:02:36,kenneth94,29.0,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,,
5,2017-01-01T12:03:04,bryanttodd,13.0,86,0897dbe5-9c5b-71ca-73a1-7586959ca198,0.0,interval
6,2017-01-01T12:03:51,andrea98,17.0,81,1c07ab9b-5f66-137d-a74f-921a41001f4e,1.0,
7,2017-01-01T12:04:35,scott28,16.0,76,7a60219f-6621-e548-180e-ca69624f9824,,interval
8,2017-01-01T12:05:05,hillpamela,5.0,82,a8b87754-a162-da28-2527-4bce4b3d4191,1.0,
9,2017-01-01T12:05:41,moorejeffrey,25.0,63,585f1a3c-0679-0ffe-9132-508933c70343,0.0,wake


In [4]:
df.dtypes

timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object

Output values in the `note` column:

In [5]:
df.note.value_counts()

wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: note, dtype: int64

## 2. Remove all null values (including the indication `n/a`)

[pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) usually already filters out many values that it recognises as `NA` or `NaN`. Further values can be specified with `na_values`.

In [6]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv',
                 na_values=['n/a'])

### 2.1 Test if we can use [pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

> **See also:**
> 
> * [DataFrame.isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) indicates missing values
> * [DataFrame.notna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notna.html) indicates existing (not missing) values
> * [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) replaces missing values
> * [Series.dropna](https://pandas.pydata.org/docs/reference/api/pandas.Series.dropna.html) deletes missing values
> * [Index.dropna](https://pandas.pydata.org/docs/reference/api/pandas.Index.dropna.html) deletes missing indices

To do this, we first display the dimensonality of the DataFrame with [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html):

In [7]:
df.shape

(146397, 7)

In [8]:
df.dropna().shape

(46116, 7)

In [9]:
df.dropna(how='all', axis=1).shape

(146397, 7)

* `how='all'` entfernt eine Zeile oder Spalte, wenn irgendwelche NA-Werte vorhanden sind.
* `axis=1` entfernt Spalten, die fehlende Werte enthalten.

### 2.2 Find all columns where all data is present

In [10]:
my_columns = list(df.columns)

In [11]:
my_columns

['timestamp',
 'username',
 'temperature',
 'heartrate',
 'build',
 'latest',
 'note']

In [12]:
list(df.dropna(thresh=int(df.shape[0] * .9), axis=1).columns)

['timestamp', 'username', 'heartrate']

`thresh` requires a certain number of NA values, in our case 90% before `axis=1` lashes a column.

### 2.3 Find all columns where data is missing

In [13]:
missing_info = list(df.columns[df.isnull().any()])

* [pandas.DataFrame.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) detects missing values.
* [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) returns whether an element is valid, usually across a column.

In [14]:
missing_info

['temperature', 'build', 'latest', 'note']

In [15]:
for col in missing_info:
    num_missing = df[df[col].isnull() == True].shape[0]
    print('number missing for column {}: {}'.format(col, 
                                                    num_missing))

number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704


* `num_missing` indicates the number of missing values per column.

In [16]:
for col in missing_info:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print('percent missing for column {}: {}'.format(
        col, percent_missing))

percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959


### 2.4 Replace missing data

To be able to check our changes we use [pandas.Series.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). It returns a series containing counts of unique values:

In [17]:
df.latest.value_counts()

0.0    75735
1.0    38364
Name: latest, dtype: int64

Now we fill replace the missing values with [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):

In [18]:
df.latest = df.latest.fillna(0)

In [19]:
df.latest.value_counts()

0.0    108033
1.0     38364
Name: latest, dtype: int64

### 2.5 Replace missing data using `backfill`

For this we first set the index for `timestamp` with [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):

In [20]:
df = df.set_index('timestamp')

In [21]:
df.head(20)

Unnamed: 0_level_0,username,temperature,heartrate,build,latest,note
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01T12:00:23,michaelsmith,12.0,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
2017-01-01T12:01:09,kharrison,6.0,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2017-01-01T12:01:34,smithadam,5.0,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
2017-01-01T12:02:09,eddierodriguez,28.0,76,,0.0,update
2017-01-01T12:02:36,kenneth94,29.0,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,0.0,
2017-01-01T12:03:04,bryanttodd,13.0,86,0897dbe5-9c5b-71ca-73a1-7586959ca198,0.0,interval
2017-01-01T12:03:51,andrea98,17.0,81,1c07ab9b-5f66-137d-a74f-921a41001f4e,1.0,
2017-01-01T12:04:35,scott28,16.0,76,7a60219f-6621-e548-180e-ca69624f9824,0.0,interval
2017-01-01T12:05:05,hillpamela,5.0,82,a8b87754-a162-da28-2527-4bce4b3d4191,1.0,
2017-01-01T12:05:41,moorejeffrey,25.0,63,585f1a3c-0679-0ffe-9132-508933c70343,0.0,wake


Then we use [pandas.DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to group `username` and fill the missing data with the `backfill` method of [pandas.core.groupby.DataFrameGroupBy.fillna](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.fillna.html). `limit` defines the maximum number of consecutive `NaN` values:

In [22]:
df.temperature = df.groupby('username').temperature.fillna(
    method='backfill', limit=3)

In [23]:
for col in missing_info:
    num_missing = df[df[col].isnull() == True].shape[0]
    print('number missing for column {}: {}'.format(col, 
                                                    num_missing))

number missing for column temperature: 22633
number missing for column build: 32350
number missing for column latest: 0
number missing for column note: 48704
