<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">

## <center> Working with missing data

In [1]:
import pandas as pd

In [2]:
temps = pd.DataFrame({"sequence":[1,2,3,4,5],
          "measurement_type":['actual','actual','actual',None,'estimated'],
          "temperature_f":[67.24,84.56,91.61,None,49.64]
         })
temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


Using `isna()` to identify null values in a dataframe

In [3]:
temps.isna()

Unnamed: 0,sequence,measurement_type,temperature_f
0,False,False,False
1,False,False,False
2,False,False,False
3,False,True,True
4,False,False,False


How is missing data handled?

In [4]:
temps['temperature_f'].cumsum()

0     67.24
1    151.80
2    243.41
3       NaN
4    293.05
Name: temperature_f, dtype: float64

In [5]:
temps['temperature_f'].cumsum(skipna=False)

0     67.24
1    151.80
2    243.41
3       NaN
4       NaN
Name: temperature_f, dtype: float64

In [9]:
# can specify to retain NA dimensions in grouping
temps.groupby(by=['measurement_type'],dropna=False).max()

Unnamed: 0_level_0,sequence,temperature_f
measurement_type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64
,4,


Dealing with missing data: The blunt approach using `dropna()`

In [7]:
#drop rows with null using axis=0 (default)
temps.dropna()

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
4,5,estimated,49.64


In [10]:
#drop columns with null using axis=1
temps.dropna(axis=1)

Unnamed: 0,sequence
0,1
1,2
2,3
3,4
4,5


Replace null values using `fillna()`

In [11]:
temps.fillna(0)

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,0,0.0
4,5,estimated,49.64


In [12]:
temps.fillna(method='pad')

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,actual,91.61
4,5,estimated,49.64


Interpolate

In [16]:
temps.interpolate()
# calculates a value that 's in between previous and next value

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,70.625
4,5,estimated,49.64
