# Data Prep - Part 2

## Missing Values

In [4]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

### Finding out what data is missing

In [5]:
missing = np.nan
s1 = Series(['r1', 'r2', 'r3', missing, 'r5', 'r6', missing, 'r8'])
s1

0     r1
1     r2
2     r3
3    NaN
4     r5
5     r6
6    NaN
7     r8
dtype: object

In [6]:
s1.isnull()

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
dtype: bool

### Filling in Missing Values

In [8]:
np.random.seed(25)
df = DataFrame(np.random.rand(36).reshape(6,6))
df

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
3,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
4,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
5,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


Set some missing values (for our demo)

In [12]:
df.loc[3:5, 0] = missing
df.loc[1:4, 5] = missing
df

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,
2,0.447031,0.585445,0.161985,0.520719,0.326051,
3,,0.836375,0.481343,0.516502,0.383048,
4,,0.559053,0.03445,0.71993,0.421004,
5,,0.900274,0.669612,0.456069,0.289804,0.525819


To replace mssing values with a value we specify (note that we have not updated the 'original' dataframe) :-

In [14]:
new_df = df.fillna(0)
new_df

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.0
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.0
3,0.0,0.836375,0.481343,0.516502,0.383048,0.0
4,0.0,0.559053,0.03445,0.71993,0.421004,0.0
5,0.0,0.900274,0.669612,0.456069,0.289804,0.525819


It is also possible to specify a specific column with a specific value (as a dictionary - {col: value}) :-

In [16]:
another_df = df.fillna({0: 0.1, 5: 1.25})
another_df

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,1.25
2,0.447031,0.585445,0.161985,0.520719,0.326051,1.25
3,0.1,0.836375,0.481343,0.516502,0.383048,1.25
4,0.1,0.559053,0.03445,0.71993,0.421004,1.25
5,0.1,0.900274,0.669612,0.456069,0.289804,0.525819


We can also replace missing values with the 'previous' value in the column - here we specify the `ffill()` (fill-forward) method that does just that.

In [17]:
yet_another_df = df.fillna(method='ffill')
yet_another_df

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.117376
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.117376
3,0.447031,0.836375,0.481343,0.516502,0.383048,0.117376
4,0.447031,0.559053,0.03445,0.71993,0.421004,0.117376
5,0.447031,0.900274,0.669612,0.456069,0.289804,0.525819


### Counting missing values

In [20]:
df.isnull().sum()

0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

### Filter missing values

Here we can use the `dropna()` method - only one row is left in the resulting dataframe.

In [21]:
df_no_NaN = df.dropna()
df_no_NaN

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376


The `dropna()` method (without any arguments) removes rows that contain missing values - but you can also drop by _column_ :-

In [22]:
df_no_NaN = df.dropna(axis=1)
df_no_NaN

Unnamed: 0,1,2,3,4
0,0.582277,0.278839,0.185911,0.4111
1,0.437611,0.556229,0.36708,0.402366
2,0.585445,0.161985,0.520719,0.326051
3,0.836375,0.481343,0.516502,0.383048
4,0.559053,0.03445,0.71993,0.421004
5,0.900274,0.669612,0.456069,0.289804
