# Agenda: Cleaning data

1. `NaN` and cleaning it up
    - Series
    - Data frames
    - Two techniques: (a) replacing and (b) removing
2. Nullable types
3. Interpolation
4. Replacement of values

# Why clean our data? Because the real world is messy

- Sensors go dead
- People make mistakes
- People don't report data on time
- Weird errors

We have to balance out cleaning out the bad data, but also not getting rid of too much data. If we're data purists, then we run the risk of not having enough data to work with at all.

# `NaN` -- what is it, and how can we handle it?

Remember that `NaN` stands for "not a number," and it is a float value. It isn't equal to anything, including to itself. 

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, np.nan, 40, 50])

In [3]:
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [4]:
s.astype(np.int64)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [5]:
# one way to get rid of NaN is to use the .fillna method
# this replaces all NaN values with whatever value we give

s.fillna(5)

0    10.0
1    20.0
2     5.0
3    40.0
4    50.0
dtype: float64

In [6]:
# I can also calculate a value, and insert it there

s.fillna(s.mean())   # this is a common technique

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

# Don't use inplace=True!

`fillna` and many other methods in Pandas have an optional keyword argument, `inplace`, where if I say `inplace=True`, then the series/data frame is modified, and we get `None` back from our operation.

This sounds like it'll save memory and be really convenient. It is neither! It doesn't save memory, and it means that we cannot do method chaining, because we get `None` back. The core Pandas developers keep threatening to deprecate and then remove the `inplace=True` option.

In [7]:
s = Series([10, 20, np.nan, 40, 50, 60, 70, np.nan, 90, 100])
s.fillna(s.mean())

0     10.0
1     20.0
2     55.0
3     40.0
4     50.0
5     60.0
6     70.0
7     55.0
8     90.0
9    100.0
dtype: float64

In [8]:
# there is another option, namely dropna
# as you can imagine from its name, it returns a new series without any of the 
# original series' NaN values

s.dropna()

0     10.0
1     20.0
3     40.0
4     50.0
5     60.0
6     70.0
8     90.0
9    100.0
dtype: float64

In [9]:
# you can still handle the indexes via .iloc, which always uses the position
# but if you use .loc, be prepared to have things go missing on your when you dropna
# of course, if your index is a bunch of strings, then that's totally fine...

# Exercise: Missing weather details

1. Create a series in which the index is days of the week, and the values are the projected high temperatures for where you live in the next 10 days.
2. Assign `NaN` to three of those values.
3. First, use `fillna` to replace those values with the mean and the median. Which seems to give closer/better values?
4. Next, use `dropna` to remove the `NaN` values. What happens now if you try to show a forecast? What advantages and disadvantages do you see?

In [3]:
s = Series([30, 30, 28, 28, 29, 30, 27, 26, 27, 28],
           index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())
s

Tue    30
Wed    30
Thu    28
Fri    28
Sat    29
Sun    30
Mon    27
Tue    26
Wed    27
Thu    28
dtype: int64

In [4]:
s.loc[['Wed', 'Sat']] = np.nan
s

Tue    30.0
Wed     NaN
Thu    28.0
Fri    28.0
Sat     NaN
Sun    30.0
Mon    27.0
Tue    26.0
Wed     NaN
Thu    28.0
dtype: float64

In [5]:
s.mean()

28.142857142857142

In [6]:
s.median()

28.0