## Data Preprocessing

### Data Cleaning

#### Handling missing data

Numpy constant `np.nan`, and python built-in constant `None` are treated as missing values by pandas. It is common to alias `np.nan` by NA while importing as shown below.

In [1]:
from numpy import nan as NA

Descriptive statistics computation on pandas objects exclude missing data by default.

In [2]:
import pandas as pd
x = pd.Series(
    [12.3, None, 12.2, 12.0, NA, 12.2]
    )
x

0    12.3
1     NaN
2    12.2
3    12.0
4     NaN
5    12.2
dtype: float64

In [3]:
x.count()

4

In [5]:
x.sum()

48.7

In [6]:
x.mean()

12.175

#### Check for missing values
Pandas Series and DataFrame objects have `isnull` method that is applied element wise to check whether the value is missing.


In [7]:
x.isnull()

0    False
1     True
2    False
3    False
4     True
5    False
dtype: bool

If we want to count the number of missing values, we can use `sum` method as shown below.

In [8]:
x.isnull().sum()

2

In [9]:
x[1] = 12.5  # Introduce one more missing value in the data.

In [10]:
x

0    12.3
1    12.5
2    12.2
3    12.0
4     NaN
5    12.2
dtype: float64

In [11]:
x.isnull().sum()

1

To check whether there is any missing value in the data, `any` method can be used as shown below.

In [12]:
x.isnull().any()

True

In [13]:
x[4] = 12.1
x.isnull().any()

False

#### Removing the observation containing missing value

In [14]:
x[4] = NA
x

0    12.3
1    12.5
2    12.2
3    12.0
4     NaN
5    12.2
dtype: float64

In [18]:
x_clean = x.dropna()
x_clean

0    12.3
1    12.5
2    12.2
3    12.0
5    12.2
dtype: float64

In [16]:
x

0    12.3
1    12.5
2    12.2
3    12.0
4     NaN
5    12.2
dtype: float64

In [13]:
df = pd.DataFrame([[2, 4, 3], [1.9, 4.1, 3.4], [2.1, 3.9, NA]])
df

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,1.9,4.1,3.4
2,2.1,3.9,


In [14]:
df.dropna()

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,1.9,4.1,3.4


#### Replacing missing values

Pandas provides `fillna` method for replacing missing values. IN the following example, missing values are filled with value 0.

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,1.9,4.1,3.4
2,2.1,3.9,0.0


Different fill values can be specified for different columns using `dict` or `Series` as fill value.

For example, to fill missing values with the column means, use the following code.

In [16]:
df.loc[1, 0] = NA
df

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,,4.1,3.4
2,2.1,3.9,


In [18]:
df.fillna({0:1, 2:df[2].mean()})

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,1.0,4.1,3.4
2,2.1,3.9,3.2


In [17]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,2.05,4.1,3.4
2,2.1,3.9,3.2


The `fillna` method provides multiple options to fill the values by specifying the `method` argument.

In [19]:
df.fillna(method = "ffill")

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,2.0,4.1,3.4
2,2.1,3.9,3.4


The `fillna` method returns the Series/ DataFrame with missing values filled as requested. To make a change in the Series/ DataFrame, supply `inplace` argument set to **True**

To fill the missing values using interpolation, use `interpolate` method.

In [20]:
df.interpolate()

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,2.05,4.1,3.4
2,2.1,3.9,3.4


By default, `interpolate` method uses linear interpolation. Other methods, including those available in `scipy` library, can be used by supplying the `method` argument.

### Identifying the duplication

In [21]:
df.loc[3]=df.loc[0]
df

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,,4.1,3.4
2,2.1,3.9,
3,2.0,4.0,3.0


In [22]:
df.duplicated()

0    False
1    False
2    False
3     True
dtype: bool

In [23]:
df.duplicated().any()

True

In [24]:
df.drop_duplicates()

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,,4.1,3.4
2,2.1,3.9,


By default these methods consider entire row to detect/drop duplicates. However, we can also specify only a part of a row for detecting duplicates.

In [25]:
df.loc[4] = pd.Series([2.0, 4.0, 3.5])
df

Unnamed: 0,0,1,2
0,2.0,4.0,3.0
1,,4.1,3.4
2,2.1,3.9,
3,2.0,4.0,3.0
4,2.0,4.0,3.5


In [26]:
df.duplicated()

0    False
1    False
2    False
3     True
4    False
dtype: bool

In [27]:
df.duplicated([0, 1]) # Check only columns 0 and 1

0    False
1    False
2    False
3     True
4     True
dtype: bool

### Data Transformation

In [20]:
salaries = pd.DataFrame(
  {'Gender': ["Male", "Male", "Female", "Male", "Female", "Female"],
    'Salary': [15000, 17000, 12000, 14000, 13000, 12500]})
salaries

Unnamed: 0,Gender,Salary
0,Male,15000
1,Male,17000
2,Female,12000
3,Male,14000
4,Female,13000
5,Female,12500


#### Transformation of a categorical variable

In [29]:
salaries.Gender = salaries.Gender.map({'Male': "M", "Female":"F"})
salaries

Unnamed: 0,Gender,Salary
0,M,15000
1,M,17000
2,F,12000
3,M,14000
4,F,13000
5,F,12500


#### Transformation of a nonnumerical variable

In [21]:
def standardize(x):
    return (x - x.mean())/x.std()
salaries.Salary = salaries.Salary.transform(standardize)
salaries

Unnamed: 0,Gender,Salary
0,Male,0.583953
1,Male,1.662019
2,Female,-1.033147
3,Male,0.044919
4,Female,-0.494114
5,Female,-0.763631


In [26]:
salaries.Salary = salaries.Salary.transform(
                    lambda x: (x-x.min())/(x.max()-x.min()
                    ))
salaries

Unnamed: 0,Gender,Salary
0,Male,0.6
1,Male,1.0
2,Female,0.0
3,Male,0.4
4,Female,0.2
5,Female,0.1


#### Descritization

Data descritization can be achieved through the `cut` function of pandas.

In [27]:
from numpy import random
x = random.normal(10, 0.4, 100)
print (x.min())
print(x.max())

9.036320180511632
10.695158308156882


In [28]:
x[:5]

array([10.04694437, 10.13603498, 10.66385418,  9.50587559,  9.99081607])

In [29]:
bins = [9, 9.25, 9.5, 9.75, 10, 10.25, 10.5, 10.75, 11]
xd = pd.cut(x, bins)
xd

[(10.0, 10.25], (10.0, 10.25], (10.5, 10.75], (9.5, 9.75], (9.75, 10.0], ..., (9.5, 9.75], (9.5, 9.75], (9.5, 9.75], (10.0, 10.25], (10.25, 10.5]]
Length: 100
Categories (8, interval[float64, right]): [(9.0, 9.25] < (9.25, 9.5] < (9.5, 9.75] < (9.75, 10.0] < (10.0, 10.25] < (10.25, 10.5] < (10.5, 10.75] < (10.75, 11.0]]

In [34]:
Age = [5, 13, 20, 12, 25, 45, 22, 18, 75]
bins = [0, 12, 19, 40, 60, 100]
dAge = pd.cut(Age, bins, labels = ['Child', 'Teen', 'Young', 'Middle Aged', 'Old'])
dAge

['Child', 'Teen', 'Young', 'Child', 'Young', 'Middle Aged', 'Young', 'Teen', 'Old']
Categories (5, object): ['Child' < 'Teen' < 'Young' < 'Middle Aged' < 'Old']