
![image](../Utilities/datacleaning.png)

__Author: Christian Urcuqui__

__Date: 23 August 2018__



# Data Cleaning and Preparation


Data cleaning and preparation is the process that we would spend more time in our data science projects, and it depends of the information complexity and it's problems. In this notebook we will see the different methods in Python in order to transform our raw data in tidy data for the next analyses. 

This notebook is divided in:

+ [Introduction](#Introduction)
+ [Handling Missing Data](#Handling-Missing-Data)
+ [Filtering out missing data](#Filtering-out-missing-data)
+ [Filling In Missing Data](#Filling-In-Missing-Data)



## Introduction

We can have different situations or problems in our datasets, in order to find them we must have pay attention to the data dictionary. 


## Handling Missing Data

Missing data appears in many data projects due different complex situations, such as human and system problems. Pandas associates these missing values with the floating-point value NaN (Not a Number).

In [3]:
from pandas import Series

example = Series(['ftp', 'ssh', np.nan, 'icmp'])

example

0     ftp
1     ssh
2     NaN
3    icmp
dtype: object

In [4]:
example.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
example.isna

<bound method Series.isna of 0     ftp
1     ssh
2     NaN
3    icmp
dtype: object>

The value None in Python is also treated as NA in object arrays

In [7]:
example[0] = None

example.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Some methods for NA handling are:
+ _dropna_, filter and erase each NA value associated to a axis label
+ _fillna_, fill in missing data with some value or by a method such as 'ffill' or 'bfill'
+ _isnull_, it returns a list of boolean values associated to the missing values.
+ _notnull_, negation of isnull


## Filtering out missing data

Using some of the methods previously metioned we can filter the NaNs in our datasets.

In [11]:
from numpy import nan as NA
import pandas as pd

data = Series([1, NA, 2.5, NA, 9])

data.dropna()

0    1.0
2    2.5
4    9.0
dtype: float64

In [13]:
data[data.notnull()]

0    1.0
2    2.5
4    9.0
dtype: float64

In the next examples we will se the same application of filtering in DataFrame objects. By default _dropna_ erases all the rows that have NaNs.

In [15]:
from pandas import DataFrame

data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [16]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [27]:
# pay attention to the parameter in the dropna method, if we specify how=all we are traying to erase only the rows that have all the values in NaNs
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


If we want to erase the columns that have the same way of NaNs in all of their values we can use axis=1

In [28]:
data[4] = None
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [29]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


If we only want to keep a certain number of observations, remember that we can select them with the method iloc from the object DataFrame

In [30]:
# we will make the dataframe to process
import numpy as np
df = DataFrame(np.random.rand(7,3))
df

Unnamed: 0,0,1,2
0,0.822287,0.745434,0.374203
1,0.953188,0.727975,0.304002
2,0.283246,0.00619,0.047044
3,0.575857,0.720681,0.979117
4,0.513575,0.636307,0.847163
5,0.165529,0.937517,0.423376
6,0.188833,0.944823,0.072005


In [32]:
df.iloc[:4, 1] = NA # We are changing the first four rows in the second column to NaNs

df.iloc[:2, 2] = NA 

df


Unnamed: 0,0,1,2
0,0.822287,,
1,0.953188,,
2,0.283246,,0.047044
3,0.575857,,0.979117
4,0.513575,0.636307,0.847163
5,0.165529,0.937517,0.423376
6,0.188833,0.944823,0.072005


The idea is to specify the parameter thresh that allows to define the rows that we will keep and the method will not erase it

In [34]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.283246,,0.047044
3,0.575857,,0.979117
4,0.513575,0.636307,0.847163
5,0.165529,0.937517,0.423376
6,0.188833,0.944823,0.072005


## Filling In Missing Data




We can use different methods in order to fill the missing data, one of them is to use the _fillna_ method with a constant value, this method will replace the NaNs with the constant defined in the parameter.

In [36]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.822287,0.0,0.0
1,0.953188,0.0,0.0
2,0.283246,0.0,0.047044
3,0.575857,0.0,0.979117
4,0.513575,0.636307,0.847163
5,0.165529,0.937517,0.423376
6,0.188833,0.944823,0.072005


In the same way we can a dictionary in order to define more specifically the data to replace in the NaNs

In [38]:
df.fillna({1:0.5, 2:0}) # pay attention that this method searches and replaces by indexes of the columns 

Unnamed: 0,0,1,2
0,0.822287,0.5,0.0
1,0.953188,0.5,0.0
2,0.283246,0.5,0.047044
3,0.575857,0.5,0.979117
4,0.513575,0.636307,0.847163
5,0.165529,0.937517,0.423376
6,0.188833,0.944823,0.072005


In the same way we can use methods incorporated in _fillna_, specifically, fill NaN values using interpolation.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
 ```
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
 ```


In [39]:
df2 = DataFrame(np.random.randn(6,3))

df2.iloc[2:, 1] = NA
df2.iloc[4:, 2] = NA

df2

Unnamed: 0,0,1,2
0,-1.529642,-1.908967,1.19381
1,-0.083051,-1.171295,-0.063588
2,-0.403066,,-0.732555
3,0.894471,,-1.013693
4,0.878349,,
5,0.508237,,


In [43]:
df2.fillna(method = 'ffill')

Unnamed: 0,0,1,2
0,-1.529642,-1.908967,1.19381
1,-0.083051,-1.171295,-0.063588
2,-0.403066,-1.171295,-0.732555
3,0.894471,-1.171295,-1.013693
4,0.878349,-1.171295,-1.013693
5,0.508237,-1.171295,-1.013693


In [44]:
df2.fillna(method = 'ffill', limit=2)

Unnamed: 0,0,1,2
0,-1.529642,-1.908967,1.19381
1,-0.083051,-1.171295,-0.063588
2,-0.403066,-1.171295,-0.732555
3,0.894471,-1.171295,-1.013693
4,0.878349,,-1.013693
5,0.508237,,-1.013693


But, sometimes is important to evaluate first other methods to fill our data, for example through the application of the basic statistics

In [46]:
data = Series ([1., NA, 3.5, NA, 7])

data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64