# Handling Missing data

Missing data is common in most data analysis applications.

One of the goals in designing pandas was to make working with missing data as painless as possible

In [None]:
import pandas as pd

In [None]:
from numpy import NaN, NAN, nan

**Pandas** uses the floating point value NaN (Not a Number) to represent missing data in
both floating as well as in non-floating point array

The thing with missing values is that they are not true, nor false, nor equal to each other. 
The reason beyond this is it could be 10 , 2.5 or anything else.
We have no idea what it is

In [None]:
nan == True

In [None]:
nan == False

In [None]:
nan == nan

In [None]:
pd.isnull(nan)

In [None]:
pd.isnull(42)

In [None]:
pd.isnull('-')

In [None]:
NaN == nan

Let's create a series with  missing values in it.

In [None]:
data =  pd.Series(['limbe' , 'douala', nan, 'buea',NAN, 'muyuka',NaN, 'bafut', 'wum'])
data

In [None]:
data.isnull()                      #.value_counts()

In [None]:
data.notnull()

We have a number of options for filtering out missing data. While doing it by hand is
always an option, **dropna** can be very helpful. On a Series, it returns the Series with only
the non-null data and index values:

In [None]:
data.dropna()

In [None]:
data[data.notnull()]

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. dropna by default drops
any row containing a missing value

In [None]:
df = pd.DataFrame([['alan' ,'bob', 'tim', NAN, 'jonas', 'kate'], 
                   [2,4,6,8,10,12] ,
                   [NaN for i in range(6)],
                  list('abcdef')]
                 )

In [None]:
df

In [None]:
df.dropna(axis=0)

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(how='all' , axis=0)

Dropping columns in the same way is only a matter of passing **axis=1**

In [None]:
df.dropna(axis=1, how='all')

In [None]:
df

In [None]:
#THE EBOLA DATASETS  2014 wEST AFRICA OUTBREAK
ebola = pd.read_csv('data/ebola_country_timeseries.csv')

In [None]:
ebola.head()

In [None]:
ebola.info()  #object is a generic way of saying this is a string

In [None]:
# tally of missing values (or whatever is it) 
ebola['Cases_Liberia'].value_counts(dropna=False).head()
#The most frequent get sorted to the top    #another way to get duplicates

In [None]:
# how many unique
ebola['Cases_Guinea'].nunique()

In [None]:
# the actual unique values
ebola['Cases_Guinea'].unique()

In [None]:
ebola.head()

### Filling the missing values

Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. 

For most purposes, the
**fillna** method is the workhorse function to use. Calling fillna with a constant replaces
missing values with that value

In [None]:
ebola.fillna(100)

In [None]:
ebola.head()

However, it does exist some sophisticated methods to fill our missing values.


When facing certains measurements, it might be convenient to fill the na values with the value forward the missing point or backward the missing point

We therefore the forward fill **ffill** and the backward fill **bfill**

In [None]:
ebola.fillna(method='ffill').head()

In [None]:
ebola.fillna(method='bfill').head()

In [None]:
ebola.head()

To complete a sum over a column containing the missing values , the .sum() function automatically skips them.
It is the same for others descriptive stats such as mean() , std(), var()

In [None]:
ebola['Cases_Guinea'].sum()

In [None]:
ebola['Cases_Guinea'].sum(skipna=False)

In [None]:
df = pd.DataFrame({
    'a': [1, 2, 88, 99],
    'b': [3, NaN, 999, NaN]
})
df

In [None]:
df.replace(to_replace=[88, 99, 999], value=NaN)