
![image](../Utilities/datacleaning.png)

__Author: Christian Urcuqui__

__Date: 23 August 2018__

__Last updated: 5 September 2018__


# Data Cleaning and Preparation


Data cleaning and preparation is the process that we would spend more time in our data science projects, and it depends of the information complexity and it's problems. In this notebook we will see the different methods in Python in order to transform our raw data in tidy data for the next analyses. 

This notebook is divided in:

+ [Introduction](#Introduction)
+ [Handling Missing Data](#Handling-Missing-Data)
+ [Filtering out missing data](#Filtering-out-missing-data)
+ [Filling In Missing Data](#Filling-In-Missing-Data)
+ [Verifying the Format and the Variable Types](#Verifying-the-Format-and-the-Variable-Types)
+ [Discretization and Binning](#Discretization-and-Binning)
+ [Detecting and Filtering Outliers](#Detecting-and-Filtering-Outliers)
+ [Preprocessing Data](#Preprocessing-Data)
+ [References](#References)


## Introduction

We can have different situations or problems in our datasets, in order to find them we must have pay attention to the data dictionary. 


## Handling Missing Data

Missing data appears in many data projects due different complex situations, such as human and system problems. Pandas associates these missing values with the floating-point value NaN (Not a Number).

In [1]:
from pandas import Series
import numpy as np

example = Series(['ftp', 'ssh', np.nan, 'icmp'])

example

0     ftp
1     ssh
2     NaN
3    icmp
dtype: object

In [2]:
example.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [3]:
example.isna

<bound method Series.isna of 0     ftp
1     ssh
2     NaN
3    icmp
dtype: object>

The value None in Python is also treated as NA in object arrays

In [4]:
example[0] = None

example.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Some methods for NA handling are:
+ _dropna_, filter and erase each NA value associated to a axis label
+ _fillna_, fill in missing data with some value or by a method such as 'ffill' or 'bfill'
+ _isnull_, it returns a list of boolean values associated to the missing values.
+ _notnull_, negation of isnull


## Filtering out missing data

Using some of the methods previously metioned we can filter the NaNs in our datasets.

In [5]:
from numpy import nan as NA
import pandas as pd

data = Series([1, NA, 2.5, NA, 9])

data.dropna()

0    1.0
2    2.5
4    9.0
dtype: float64

In [6]:
data[data.notnull()]

0    1.0
2    2.5
4    9.0
dtype: float64

In the next examples we will se the same application of filtering in DataFrame objects. By default _dropna_ erases all the rows that have NaNs.

In [7]:
from pandas import DataFrame

data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [9]:
# pay attention to the parameter in the dropna method, if we specify how=all we are traying to erase only the rows that have all the values in NaNs
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


If we want to erase the columns that have the same way of NaNs in all of their values we can use axis=1

In [10]:
data[4] = None
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [11]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


If we only want to keep a certain number of observations, remember that we can select them with the method iloc from the object DataFrame

In [12]:
# we will make the dataframe to process
import numpy as np
df = DataFrame(np.random.rand(7,3))
df

Unnamed: 0,0,1,2
0,0.896416,0.150864,0.194446
1,0.701061,0.4328,0.824707
2,0.226927,0.892425,0.45786
3,0.888534,0.088153,0.745627
4,0.307353,0.559184,0.672312
5,0.732523,0.64007,0.488645
6,0.744788,0.866652,0.953361


In [13]:
df.iloc[:4, 1] = NA # We are changing the first four rows in the second column to NaNs

df.iloc[:2, 2] = NA 

df


Unnamed: 0,0,1,2
0,0.896416,,
1,0.701061,,
2,0.226927,,0.45786
3,0.888534,,0.745627
4,0.307353,0.559184,0.672312
5,0.732523,0.64007,0.488645
6,0.744788,0.866652,0.953361


The idea is to specify the parameter thresh that allows to define the rows that we will keep and the method will not erase it

In [14]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.226927,,0.45786
3,0.888534,,0.745627
4,0.307353,0.559184,0.672312
5,0.732523,0.64007,0.488645
6,0.744788,0.866652,0.953361


## Filling In Missing Data




We can use different methods in order to fill the missing data, one of them is to use the _fillna_ method with a constant value, this method will replace the NaNs with the constant defined in the parameter.

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.896416,0.0,0.0
1,0.701061,0.0,0.0
2,0.226927,0.0,0.45786
3,0.888534,0.0,0.745627
4,0.307353,0.559184,0.672312
5,0.732523,0.64007,0.488645
6,0.744788,0.866652,0.953361


In the same way we can use a dictionary in order to define more specifically the data to replace in the NaNs

In [16]:
df.fillna({1:0.5, 2:0}) # pay attention that this method searches and replaces by indexes of the columns 

Unnamed: 0,0,1,2
0,0.896416,0.5,0.0
1,0.701061,0.5,0.0
2,0.226927,0.5,0.45786
3,0.888534,0.5,0.745627
4,0.307353,0.559184,0.672312
5,0.732523,0.64007,0.488645
6,0.744788,0.866652,0.953361


In the same way we can use methods incorporated in _fillna_, specifically, fill NaN values using interpolation.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
 ```
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
 ```


In [None]:
df2 = DataFrame(np.random.randn(6,3))

df2.iloc[2:, 1] = NA
df2.iloc[4:, 2] = NA

df2

In [None]:
df2.fillna(method = 'ffill')

In [None]:
df2.fillna(method = 'ffill', limit=2)

But, sometimes is important to evaluate first other methods to fill our data, for example through the application of the basic statistics

In [None]:
data = Series ([1., NA, 3.5, NA, 7])

data.fillna(data.mean())

## Verifying the Format and the Variable Types

As we saw in the introduction, one of the usual issues in a dataset is the format and the type of variables related with the data dictionary, we must be careful because sometimes when we load the dataset the method loads the data with some predefined type of variables, due this we need to review the data and solve this problem in order to continue in the project. 

In [17]:
# we will use the iris dataset to verify the type of the data 
import seaborn as sns
data_iris = sns.load_dataset('iris')

print(type(data_iris))
print(data_iris.shape)

<class 'pandas.core.frame.DataFrame'>
(150, 5)


In [18]:
data_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [19]:
data_iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [None]:
data_iris.head()

In [20]:
data_iris2 = data_iris
print(data_iris2.columns)
# we will change the type of the column target (float64) to a categorical variable
data_iris2.species = data_iris2.species.astype('category')
print(data_iris2.dtypes)
print(data_iris2.head())

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


## Removing duplicates

Another problem is the duplicate data in the dataset, we must be careful with this problem because the descriptive and predictive analyses might be wrong if we have a lot of them

In [21]:
data = DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 
                 'k2': [1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The method _duplicated_ from pandas returns an array of boolean Series whose allows us to understand if we have rows repeated.

In [22]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

We can erase them with the method (__drop_duplicates__) from the DateFrame object. By default the method will erase the first duplicated row, but we can change it with the parameter _keep=last_.

In [None]:
data.drop_duplicates()

In [23]:
data.drop_duplicates(keep='last')

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
6,two,4


## Discretization and Binning

Continuous data is often discretized or otherwise separated into "bins" for analysis. Suppose you have an array of ages and you want to discrete them in age buckets.

In [24]:
import pandas as pd

ages = [20 , 22, 27, 21, 23, 37, 32, 45, 15]

# we will divide them in the next age buckets, such as 18 to 25, 22 to 25.

btns = [18, 25, 35, 60, 100]

cats =  pd.cut(ages, btns)

cats

[(18, 25], (18, 25], (25, 35], (18, 25], (18, 25], (35, 60], (25, 35], (35, 60], NaN]
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

Pay attention that we have now a special Categorical object, we can access to their indexes and the number of values in each bucket.

In [None]:
# we will display the codes of each bucket
cats.codes

In [None]:
# we will display the categories associated to each bucket
cats.categories

In [None]:
# we will count the number of values in each bucket
pd.value_counts(cats)

In order to provide more information to the previous structure we can make the bin names 

In [None]:
group_names = ['Youth', 'YouthAdult', 'MiddleAged', 'Senior']
pd.cut(ages, btns, labels=group_names)

## Detecting and Filtering Outliers

The next method allows us to understand how the data in a DataFrame is distributed.

In [27]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 4))
#data
data.describe() # Describe only will make the statistics for the numerical variables in our DataFrame

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.03704,0.01131,0.021231,0.009824
std,0.982585,0.990619,1.006768,0.980379
min,-2.889494,-3.555132,-3.665452,-3.155167
25%,-0.696565,-0.609681,-0.711506,-0.67668
50%,-0.02455,0.039231,0.037445,0.021147
75%,0.59568,0.635408,0.696785,0.712062
max,3.011026,3.261039,3.124056,3.734548


In [None]:
from numpy import nan as NA

df2 = pd.DataFrame(np.random.randn(6,3))

df2.iloc[2:, 1] = NA
df2.iloc[4:, 2] = NA

df2.describe() # In the same way, we must be careful with the NaNs

In [30]:
df_park = pd.read_csv('../datasets/parks.csv', index_col=['Park Code'], encoding='utf-8')
print(df_park.head())
print(df_park.shape)
df_park.describe() # Look that the describe method only gives us the information about the numerical variables 

                        Park Name State   Acres  Latitude  Longitude
Park Code                                                           
ACAD         Acadia National Park    ME   47390     44.35     -68.21
ARCH         Arches National Park    UT   76519     38.68    -109.57
BADL       Badlands National Park    SD  242756     43.75    -102.50
BIBE       Big Bend National Park    TX  801163     29.25    -103.25
BISC       Biscayne National Park    FL  172924     25.65     -80.08
(56, 5)


Unnamed: 0,Acres,Latitude,Longitude
count,56.0,56.0,56.0
mean,927929.1,41.233929,-113.234821
std,1709258.0,10.908831,22.440287
min,5550.0,19.38,-159.28
25%,69010.5,35.5275,-121.57
50%,238764.5,38.55,-110.985
75%,817360.2,46.88,-103.4
max,8323148.0,67.78,-68.21


In [29]:
# if we want to have the statistics for all the variables we need to add the parameter include = 'all'
df_park.describe(include = 'all')

Unnamed: 0,Park Name,State,Acres,Latitude,Longitude
count,56,56,56.0,56.0,56.0
unique,56,27,,,
top,Mammoth Cave National Park,AK,,,
freq,1,8,,,
mean,,,927929.1,41.233929,-113.234821
std,,,1709258.0,10.908831,22.440287
min,,,5550.0,19.38,-159.28
25%,,,69010.5,35.5275,-121.57
50%,,,238764.5,38.55,-110.985
75%,,,817360.2,46.88,-103.4


## Computing Dummy Variables

It is a transformation technique for statistical modeling or machine learning applications, it allows us to convert a categorical variable into a "dummy" or "indicator" matrix. 

In [31]:
df = pd.DataFrame ({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data': range(6)})
df

Unnamed: 0,key,data
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [32]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can be merged with the other data. 

In [33]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data']].join(dummies)
df_with_dummy

Unnamed: 0,data,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


## Preprocessing Data

_Scikit learn_ provides the functions and transformer classes to change raw feature vectors into a representation that is more suitable for the training of the machine learning algorithms. 

The next examples were took from the scikit learn documentation.

### Standarization

__Standarization__ of datasets is a common requeriment for many machine learning algorithms; they might behave badly if the individual features do not look like standard normally distributed adta: Gaussian with zero mean and unit variance.

In this case the function _scale_ allows us to perform this operation on a single array-like dataset

In [34]:
from sklearn import preprocessing
import numpy as np

X_train = np.array([[1., -1, 2.],
                    [2., 0., 0.],
                    [0., 1., -1.]    
])
X_scaled = preprocessing.scale(X_train)

X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

Scaled data has zero mean and unit variance

In [None]:
X_scaled.mean(axis=0)

In [None]:
X_scaled.std(axis=0)

The preprocessing module provides the class __StandardScaler__ that implements the _Transformer_ API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

In [35]:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [None]:
scaler.mean_

In [None]:
scaler.scale_

In [36]:
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

The next step is to apply this scaler on new data (for the example the testing set)

In [None]:
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

We might disable either centering or scaling by the parameters __(with_mean=False)__ and __(with_std=False)__ in the StandardScaler.

### Scaling features to a range

Another way is scaling features to lie between a given minimum and maximum value, often between zero and one, or the maximum absolute value of each feature is scaled to unit size. 

In [None]:
# example to scale the data matrix to the [0,1] range

X_train = np.array([[ 1., -1., 2.],
                  [2., 0., 0.],
                  [0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

In the same way to the previous example the idea is to apply the scaler to some new test data unseen during the fit call. 

In [None]:
X_test = np.array([[-3., -1., 4.]])

X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

__MaxAbsScaler__ scales in a way that the traning data lies within the range [-1, 1] by the dividing through the largest maximum value in each feature.

In [None]:
X_train = np.array([[ 1., -1., 2.],
                  [2., 0., 0.],
                  [0., 1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs


In [None]:
X_test = np.array([[-3., -1., 4.]])

X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs

### Normalization

__Normalization__ is the process of scaling individual samples to have unit norm. This process is useful to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

The function _normalize_ allows us to perform this operation on a single array-like dataset, either using the l1 or l2 norms

In [None]:
X = [[1., -1., 2.],
    [2., 0., 0.],
    [0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

Remember the idea is to make a normalizer in order to use in other data that we didn't use during the fit.

In [None]:
X = [[1., -1., 2.],
    [2., 0., 0.],
    [0., 1., -1.]]

normalizer = preprocessing.Normalizer().fit(X)
normalizer

In [None]:
normalizer.transform(X)

In [None]:
normalizer.transform([[-1.,  1., 0.]])  

## References

+ http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range