In [1]:
import pandas as pd

# Feature Engineering

Table of contents:

- [Encoding categorical features](#1.-Enconding-categorical-features)
- [Normalization and Standardization](#2.-Normalization-and-Standardization)
- [Data imputation](#3.-Data-imputation)
- [Polynomial features](#4.-Polynomial-features)

In [33]:
# load titanic dataset
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/titanic.csv'
titanic = pd.read_csv(url ,index_col = 'PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Dataframe Columns:**

- survived: 0 = No; 1 = Yes
- Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- name: Name
- sex: Sex
- age: Age
- sibsp: Number of Siblings/Spouses Aboard
- parch: Number of Parents/Children Aboard
- ticket: - Ticket Number
- fare: Passenger Fare
- cabin: Cabin
- embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**Categorical features in the Titanic dataset**: sex, ticket, cabin, embarked. 

**Numerical features in the Titanic dataset**: Pclass, Age, SibSp, Parch, Fare

In [34]:
# missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

## 1. Enconding categorical features

- [Ordinal encoding](#1.1.-Ordinal-encoding)
- [One hot encoding](#1.2.-One-hot-encoding)

Often features are not given as continuous values but categorical. For example a person could have features ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently coded as integers. 
To convert categorical features to such integer codes, we can use the OrdinalEncoder.

### 1.1. Ordinal encoding

In [35]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()

In [36]:
# ordingal encoding of the "Sex" feature  
titanic.Sex.unique()

array(['male', 'female'], dtype=object)

In [37]:
# fit the encoder and transform the column sex
encoder.fit(titanic[['Sex']])
encoder.transform(titanic[['Sex']])

array([[1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],

The feature `sex`  has been replaced by a numeric value.

In [38]:
encoder.categories_ # female is mapped to 0, male to 1

[array(['female', 'male'], dtype=object)]

IMPORTANT: ordinal encoding assumes that there is a clear ordering of the categories.

In [39]:
# ordinal encoding of the "Embarked" feature
titanic.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [45]:
encoder = OrdinalEncoder(encoded_missing_value=-1)
encoder.fit(titanic[['Embarked']])
encoder.transform(titanic[['Embarked']])

TypeError: __init__() got an unexpected keyword argument 'encoded_missing_value'

In [41]:
encoder.categories_

[array(['C', 'Q', 'S', nan], dtype=object)]

In [11]:
# # ordinal encoding of the "Sex" and "Embarked" features
encoder.fit(titanic[['Sex','Embarked']])
encoder.transform(titanic[['Sex','Embarked']])

array([[1., 2.],
       [0., 0.],
       [0., 2.],
       ...,
       [0., 2.],
       [1., 0.],
       [1., 1.]])

In [12]:
encoder.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

Such integer representation can, however, not be used directly with all scikit-learn models, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired

### 1.2. One hot encoding

Another possibility to convert categorical features to features that can be used with scikit-learn models is to use a one-of-K, also known as one-hot encoding. 
This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

In [13]:
# one hot encoding of the "sex" feature
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False) # initializer the one hot enconder

In [14]:
titanic[['Sex']].head()

Unnamed: 0_level_0,Sex
PassengerId,Unnamed: 1_level_1
1,male
2,female
3,female
4,female
5,male


In [15]:
encoder.fit(titanic[['Sex']])
encoder.transform(titanic[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [16]:
encoder.categories_

[array(['female', 'male'], dtype=object)]

In [17]:
# one hot encoding of the "Embarked" feature
titanic.Embarked.head(10)

PassengerId
1     S
2     C
3     S
4     S
5     S
6     Q
7     S
8     S
9     S
10    C
Name: Embarked, dtype: object

In [18]:
encoder.fit(titanic[['Embarked']])
encoder.transform(titanic[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [19]:
encoder.categories_

[array(['C', 'Q', 'S'], dtype=object)]

In [20]:
# one hot encoding of the "Sex" and "Embarked" features
encoder.fit_transform(titanic[['Sex','Embarked']])

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0.]])

In [21]:
encoder.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

## 2. Normalization and Standardization 

- [Normalization](#2.1.-Normalization)
- [Standardization](#2.2.-Standardization)

In [22]:
X = titanic[['Pclass','Age','Fare']] # feature matrix
X

Unnamed: 0_level_0,Pclass,Age,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,22.0,7.2500
2,1,38.0,71.2833
3,3,26.0,7.9250
4,1,35.0,53.1000
5,3,35.0,8.0500
...,...,...,...
887,2,27.0,13.0000
888,1,19.0,30.0000
889,3,,23.4500
890,1,26.0,30.0000


### 2.1. Normalization

**Normalization** is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1,1] or [0,1].
This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

Normalizing the data is not a strict requirement. 
However, in practice, it can lead to an increased speed of training. 

In [23]:
# min-max scaler scales data to the [0, 1] range
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
scaler.transform(X)

array([[1.        , 0.27117366, 0.01415106],
       [0.        , 0.4722292 , 0.13913574],
       [1.        , 0.32143755, 0.01546857],
       ...,
       [1.        ,        nan, 0.04577135],
       [0.        , 0.32143755, 0.0585561 ],
       [1.        , 0.39683338, 0.01512699]])

In [24]:
# max-abs scaler scales data to the [-1, 1] range 
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(X)
scaler.transform(X)

array([[1.        , 0.275     , 0.01415106],
       [0.33333333, 0.475     , 0.13913574],
       [1.        , 0.325     , 0.01546857],
       ...,
       [1.        ,        nan, 0.04577135],
       [0.33333333, 0.325     , 0.0585561 ],
       [1.        , 0.4       , 0.01512699]])

### 2.2. Standardization

**Standardization (or mean removal and variance scaling)** is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with mean 0 and standard deviation 1.

In [25]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X)

array([[ 0.82520863, -0.52766856, -0.50023975],
       [-1.57221121,  0.57709388,  0.78894661],
       [ 0.82520863, -0.25147795, -0.48664993],
       ...,
       [ 0.82520863,         nan, -0.17408416],
       [-1.57221121, -0.25147795, -0.0422126 ],
       [ 0.82520863,  0.16280796, -0.49017322]])

## 3. Data Imputation

In [26]:
X.isnull().sum()

Pclass      0
Age       177
Fare        0
dtype: int64

The typical approaches of dealing with missing values for a feature include:
- remove rows with missing features from the dataset (this can be done if your dataset is big enough)
- using a **data imputation** technique

The SimpleImputer class provides basic strategies for imputing missing values. 
Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. 

In [27]:
from sklearn.impute import SimpleImputer

In [28]:
titanic.Age.mean()

29.64209269662921

In [29]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
imputed_X = imputer.transform(X)
imputed_X

array([[ 3.       , 22.       ,  7.25     ],
       [ 1.       , 38.       , 71.2833   ],
       [ 3.       , 26.       ,  7.925    ],
       ...,
       [ 3.       , 29.6420927, 23.45     ],
       [ 1.       , 26.       , 30.       ],
       [ 3.       , 32.       ,  7.75     ]])

## 4. Polynomial features

Often it's useful to add complexity to the model by considering nonlinear features of the input data. 
A simple and common method to use is **polynomial features**, which can get features' high-order and interaction terms. 
It is implemented in PolynomialFeatures

In [30]:
from sklearn.preprocessing import PolynomialFeatures

In [31]:
poly = PolynomialFeatures(degree=2)
poly.fit(imputed_X)
poly.transform(imputed_X)

array([[1.00000000e+00, 3.00000000e+00, 2.20000000e+01, ...,
        4.84000000e+02, 1.59500000e+02, 5.25625000e+01],
       [1.00000000e+00, 1.00000000e+00, 3.80000000e+01, ...,
        1.44400000e+03, 2.70876540e+03, 5.08130886e+03],
       [1.00000000e+00, 3.00000000e+00, 2.60000000e+01, ...,
        6.76000000e+02, 2.06050000e+02, 6.28056250e+01],
       ...,
       [1.00000000e+00, 3.00000000e+00, 2.96420927e+01, ...,
        8.78653659e+02, 6.95107074e+02, 5.49902500e+02],
       [1.00000000e+00, 1.00000000e+00, 2.60000000e+01, ...,
        6.76000000e+02, 7.80000000e+02, 9.00000000e+02],
       [1.00000000e+00, 3.00000000e+00, 3.20000000e+01, ...,
        1.02400000e+03, 2.48000000e+02, 6.00625000e+01]])

In [None]:
poly.get_feature_names(X.columns)