# Agenda: Cleaning data

1. `NaN` and cleaning it up
    - Series
    - Data frames
    - Two techniques: (a) replacing and (b) removing
2. Nullable types
3. Interpolation
4. Replacement of values

# Why clean our data? Because the real world is messy

- Sensors go dead
- People make mistakes
- People don't report data on time
- Weird errors

We have to balance out cleaning out the bad data, but also not getting rid of too much data. If we're data purists, then we run the risk of not having enough data to work with at all.

# `NaN` -- what is it, and how can we handle it?

Remember that `NaN` stands for "not a number," and it is a float value. It isn't equal to anything, including to itself. 

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, np.nan, 40, 50])

In [3]:
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [4]:
s.astype(np.int64)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [5]:
# one way to get rid of NaN is to use the .fillna method
# this replaces all NaN values with whatever value we give

s.fillna(5)

0    10.0
1    20.0
2     5.0
3    40.0
4    50.0
dtype: float64

In [6]:
# I can also calculate a value, and insert it there

s.fillna(s.mean())   # this is a common technique

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

# Don't use inplace=True!

`fillna` and many other methods in Pandas have an optional keyword argument, `inplace`, where if I say `inplace=True`, then the series/data frame is modified, and we get `None` back from our operation.

This sounds like it'll save memory and be really convenient. It is neither! It doesn't save memory, and it means that we cannot do method chaining, because we get `None` back. The core Pandas developers keep threatening to deprecate and then remove the `inplace=True` option.

In [7]:
s = Series([10, 20, np.nan, 40, 50, 60, 70, np.nan, 90, 100])
s.fillna(s.mean())

0     10.0
1     20.0
2     55.0
3     40.0
4     50.0
5     60.0
6     70.0
7     55.0
8     90.0
9    100.0
dtype: float64

In [8]:
# there is another option, namely dropna
# as you can imagine from its name, it returns a new series without any of the 
# original series' NaN values

s.dropna()

0     10.0
1     20.0
3     40.0
4     50.0
5     60.0
6     70.0
8     90.0
9    100.0
dtype: float64

In [9]:
# you can still handle the indexes via .iloc, which always uses the position
# but if you use .loc, be prepared to have things go missing on your when you dropna
# of course, if your index is a bunch of strings, then that's totally fine...

# Exercise: Missing weather details

1. Create a series in which the index is days of the week, and the values are the projected high temperatures for where you live in the next 10 days.
2. Assign `NaN` to three of those values.
3. First, use `fillna` to replace those values with the mean and the median. Which seems to give closer/better values?
4. Next, use `dropna` to remove the `NaN` values. What happens now if you try to show a forecast? What advantages and disadvantages do you see?

In [7]:
s = Series([30, 30, 28, 28, 29, 30, 34, 26, 27, 28],
           index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())
s

Tue    30
Wed    30
Thu    28
Fri    28
Sat    29
Sun    30
Mon    34
Tue    26
Wed    27
Thu    28
dtype: int64

In [8]:
s.loc[['Wed', 'Sat']] = np.nan
s

Tue    30.0
Wed     NaN
Thu    28.0
Fri    28.0
Sat     NaN
Sun    30.0
Mon    34.0
Tue    26.0
Wed     NaN
Thu    28.0
dtype: float64

In [9]:
s.mean()

29.142857142857142

In [10]:
s.median()

28.0

In [11]:
# fill in the values with mean 
s.fillna(s.mean())

Tue    30.000000
Wed    29.142857
Thu    28.000000
Fri    28.000000
Sat    29.142857
Sun    30.000000
Mon    34.000000
Tue    26.000000
Wed    29.142857
Thu    28.000000
dtype: float64

In [12]:
s.fillna(s.median())

Tue    30.0
Wed    28.0
Thu    28.0
Fri    28.0
Sat    28.0
Sun    30.0
Mon    34.0
Tue    26.0
Wed    28.0
Thu    28.0
dtype: float64

In [13]:
s.dropna()

Tue    30.0
Thu    28.0
Fri    28.0
Sun    30.0
Mon    34.0
Tue    26.0
Thu    28.0
dtype: float64

# Data frames and `NaN`

As a general rule, anything that we can do with a series, we can also do with a data frame. And when we do that, we get the result from applying the method to every single column.

In [14]:
np.random.seed(0)
df = DataFrame(np.random.randint(-500, 500, [3,4]),
               index=list('abc'),
               columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,184,59,129,-308
b,335,263,207,-141
c,-491,223,-223,254


In [15]:
df.loc['a', 'w'] = np.nan
df.loc['a', 'y'] = np.nan
df.loc['c', 'y'] = np.nan

df



Unnamed: 0,w,x,y,z
a,,59,,-308
b,335.0,263,207.0,-141
c,-491.0,223,,254


In [16]:
# if I want to run fillna, I can with a scalar value
df.fillna(9999)

Unnamed: 0,w,x,y,z
a,9999.0,59,9999.0,-308
b,335.0,263,207.0,-141
c,-491.0,223,9999.0,254


In [17]:
# I can do better than that!
# I can say df.mean()

df.mean()

w    -78.000000
x    181.666667
y    207.000000
z    -65.000000
dtype: float64

In [18]:
# watch what happens when we now use df.fillna with the results of df.mean:

df.fillna(df.mean())

Unnamed: 0,w,x,y,z
a,-78.0,59,207.0,-308
b,335.0,263,207.0,-141
c,-491.0,223,207.0,254


In [19]:
# I can pass, if I want, a dict to fillna, whose keys are
# the column names. We can indicate what value(s) we want to 
# pass to each column.

df.fillna({'w':9999, 'y':df['y'].mean()})

Unnamed: 0,w,x,y,z
a,9999.0,59,207.0,-308
b,335.0,263,207.0,-141
c,-491.0,223,207.0,254


In [20]:
df

Unnamed: 0,w,x,y,z
a,,59,,-308
b,335.0,263,207.0,-141
c,-491.0,223,,254


In [21]:
# what will happen when I use dropna?
# every row containing NaN will be removed
# (or really, will not be in the new data frame that's returned)

df.dropna()

Unnamed: 0,w,x,y,z
b,335.0,263,207.0,-141


In [22]:
# we can limit the degree to which nan is seen as a problem, and dropped. We can pass
# one or both of the following keyword arguments:

# 1. thresh, an integer indicating how many good values a row must have to be kept
# 2. subset, a list of column names (strings) that we want to look at when determining if we should drop the row
#   meaning: if a column is not in subset, then we don't look at it when making a decision

df.dropna(thresh=3)  # this means: If I have 3 values, I'm good!

Unnamed: 0,w,x,y,z
b,335.0,263,207.0,-141
c,-491.0,223,,254


In [24]:
df

Unnamed: 0,w,x,y,z
a,,59,,-308
b,335.0,263,207.0,-141
c,-491.0,223,,254


In [26]:
df.dropna(subset=['w', 'z'])

Unnamed: 0,w,x,y,z
b,335.0,263,207.0,-141
c,-491.0,223,,254


# Exercise: Boston temperatures

1. Read the file at `../data/boston,ma.csv` and put into a CSV file. We only care about two columns, namely the min and max temperatures that were recorded during that period.
2. Put `NaN` values every 5 items in the maxtemp
3. Put `NaN` values every 3 items in the mintemp. (Note: For both of these, using `iloc` and a slice will really come in handy.)
4. How many rows remain (as a percentage) if we do a total `dropna`?
5. What if we only drop those rows where there is a `NaN` in the mintemp?
6. Replace the `NaN` values with the mean.

In [27]:
filename = '../data/boston,ma.csv'
!head $filename

date_time,"boston,ma_maxtempC","boston,ma_mintempC","boston,ma_totalSnow_cm","boston,ma_sunHour","boston,ma_uvIndex","boston,ma_uvIndex","boston,ma_moon_illumination","boston,ma_moonrise","boston,ma_moonset","boston,ma_sunrise","boston,ma_sunset","boston,ma_DewPointC","boston,ma_FeelsLikeC","boston,ma_HeatIndexC","boston,ma_WindChillC","boston,ma_WindGustKmph","boston,ma_cloudcover","boston,ma_humidity","boston,ma_precipMM","boston,ma_pressure","boston,ma_tempC","boston,ma_visibility","boston,ma_winddirDegree","boston,ma_windspeedKmph"
2018-12-11 00:00:00,1,-4,0.0,8.7,2,0,21,10:19 AM,08:12 PM,07:03 AM,04:11 PM,-7,-3,0,-3,10,0,57,0.0,1022,-3,10,339,8
2018-12-11 03:00:00,1,-4,0.0,8.7,2,0,21,10:19 AM,08:12 PM,07:03 AM,04:11 PM,-7,-1,1,-1,7,2,57,0.0,1023,-3,10,319,6
2018-12-11 06:00:00,1,-4,0.0,8.7,2,0,21,10:19 AM,08:12 PM,07:03 AM,04:11 PM,-9,-5,-3,-5,8,4,60,0.0,1023,-4,10,334,7
2018-12-11 09:00:00,1,-4,0.0,8.7,2,2,21,10:19 AM,08:12 PM,07:03 AM,04:11 PM,-9,1,1,1,3,6,49,0.0,1022,-1,10,334,

In [29]:
df = pd.read_csv(filename, usecols=[1, 2],
                 names=['maxtemp', 'mintemp'])