<img style="float: right;" width="120" src="../Images/supplier-logo.png">
<img style="float: left; margin-top: 0" width="80" src="../Images/client-logo.png">
<br><br><br>


# Synopsis

This notebook will explain the following topics and concepts:


1) What is a missing Value

2) How missing Values are Created

3) Calculations with missing values


# What is NaN

- Not A Number <BR>
- Different from all other data types
- Not equal to anything
    
    
- Defined in `numpy`
- 3 symbols - NAN, nan and NaN

In [None]:
# import missing values from numpy

from numpy import NaN, NAN, nan

## Comparing to other Values 

NaN is not equal to True, False, 0 or Empty Strings

print(NaN == True)

print(NaN == False)

print(NaN == 0)

print(NaN == '')


## Comparing to other NaN 

NaN is not equal to other NaNs

In [None]:
print(NaN == NaN)

print(NaN == nan)

print(NaN == NAN)

print(nan == NAN)

## Testing for a missing value

Use pandas inbuilt methods and functions

- `isnull()` 
- `notnull()`

In [None]:
import pandas as pd

print(pd.isnull(NaN))
print(pd.isnull(nan))
print(pd.isnull(NAN))

print(pd.notnull(NaN))
print(pd.notnull(42))
print(pd.notnull('some text'))

# Where does missing data come from 

Main sources are

- from input data
- from merging data
- result of re-indexing


Also less common are
- user input
- bugs incode

## 1) From imput data

In [None]:
# Load in data - accept all defaults
df = pd.read_csv(filepath_or_buffer='../Data/survey_visited.csv')

display(df)

In [None]:
# Load in data - without defaul missing values

df = pd.read_csv(filepath_or_buffer='../Data/survey_visited.csv', keep_default_na=False)

display(df)

In [None]:
# Manually Specify missing values

df = pd.read_csv(filepath_or_buffer='../Data/survey_visited.csv', na_values=[''], keep_default_na=False)

display(df)

## 2) From merging Data

In [None]:
visited = '../Data/survey_visited.csv'
survey = '../Data/survey_survey.csv'

df_visited = pd.read_csv(filepath_or_buffer=visited)
df_survey = pd.read_csv(filepath_or_buffer=survey)

df_merged = df_visited.merge(right = df_survey, left_on='ident', right_on='taken')

display(df_merged)

### 3) Reindexing

In [None]:
df_gap = pd.read_csv(filepath_or_buffer='../Data/gapminder.tsv', sep='\t')

# Group by year, take the lifeExp colum and calculate the mean
life_exp = df_gap.groupby(by=['year'])['lifeExp'].mean()

display(life_exp)

# Re-Index from year 2000 to 2010 - causes a reindex
display( life_exp.reindex(range(2000, 2010)) )

# Working with missing Data

## Find and Count missing data

In [None]:
df_countries = pd.read_csv(filepath_or_buffer='../Data/country_timeseries.csv')

# Number of non-missing values
print(df_countries.count())

In [None]:
# Subtract number of non missing rows from total number of rows
num_rows = df_countries.shape[0]
num_missing = num_rows - df_countries.count()
print(num_missing)

In [None]:
# Use count_zero from numpy
import numpy as np

print(np.count_nonzero(df_countries.isnull()))

print(np.count_nonzero(df_countries['Cases_Guinea'].isnull()))

In [None]:
# Use value counts for a frequency distribution
display(df_countries['Cases_Guinea'].value_counts().head())


# Drop NA to get number of missing values
display(df_countries['Cases_Guinea'].value_counts(dropna=False).head())

# Cleaning Missing Data

## 1) Recode / Replace

Use `fillna`

In [None]:
# recode missing values to be ZERO

# Before
print(df_countries.iloc[0:10, 0:5])

# After
print(df_countries.fillna(0).iloc[0:10, 0:5])

## 2) Fill Forward

In [None]:
# Before
print(df_countries.iloc[0:10, 0:5])

# After
print(df_countries.fillna(method='ffill').iloc[0:10, 0:5])

## 3) Fill Backward

In [None]:
# Before
print(df_countries.iloc[0:10, 0:5])

# After
print(df_countries.fillna(method='bfill').iloc[0:10, 0:5])

## 4) Interpolate

In [None]:
# Before
print(df_countries.iloc[0:10, 0:5])

# After
print(df_countries.interpolate().iloc[0:10, 0:5])

## 5) Drop missing Values

In [None]:
# Before
print(df_countries.shape)

# After
print(df_countries.dropna().shape)

# Calculations with Missing Data

In [None]:
# Before
df_countries['Multiple'] = df_countries['Cases_Guinea'] + \
                           df_countries['Cases_Liberia'] + \
                           df_countries['Cases_SierraLeone']

cols = ['Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Multiple']
df_countries.loc[:, cols].head(10)

# Note the MUltiple column is only calcualtes when ALL individual cells have data 

Many inuilt mehtods have a `skipna` parameter

e.g. `sum()`, `mean()` etc etc

In [None]:
# Default
print(df_countries[cols].sum())

#skipna = True - also the default
print(df_countries[cols].sum(skipna=True))

# Adding NAN to a numebr results in NaN
print(df_countries[cols].sum(skipna=False))