In [None]:
import pandas as pd

Table of contents:

- [Encoding categorical features](#1.-Enconding-categorical-features)
- [Normalization and Standardization](#2.-Normalization-and-Standardization)
- [Data imputation](#3.-Data-imputation)
- [Polynomial features](#4.-Polynomial-features)

In [None]:
# load titanic dataset
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/titanic.csv'
titanic = pd.read_csv(url ,index_col = 'PassengerId')
titanic.head()

**Dataframe Columns:**

- survived: 0 = No; 1 = Yes
- Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- name: Name
- sex: Sex
- age: Age
- sibsp: Number of Siblings/Spouses Aboard
- parch: Number of Parents/Children Aboard
- ticket: - Ticket Number
- fare: Passenger Fare
- cabin: Cabin
- embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**Categorical features in the Titanic dataset**: sex, ticket, cabin, embarked. 

**Numerical features in the Titanic dataset**: Pclass, Age, SibSp, Parch, Fare

In [None]:
# missing values
titanic.isnull().sum()

## 1. Enconding categorical features

- [Ordinal encoding](#1.1.-Ordinal-encoding)
- [One hot encoding](#1.2.-One-hot-encoding)

Often features are not given as continuous values but categorical. For example a person could have features ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently coded as integers. 
To convert categorical features to such integer codes, we can use the OrdinalEncoder.

### 1.1. Ordinal encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()

In [None]:
# ordingal encoding of the "Sex" feature  
titanic.Sex.unique()

In [None]:
encoder.fit(titanic[['Sex']])
encoder.transform(titanic[['Sex']])

In [None]:
encoder.categories_

In [None]:
# ordinal encoding of the "Embarked" feature
titanic.Embarked.unique()

In [None]:
# OrdinarEncoder does not work where there are missing values;
# for this example, we'll drop the 2 missing values in the "Embarked" column
titanic.dropna(subset=['Embarked'], how='any', axis=0, inplace=True)

encoder.fit(titanic[['Embarked']])
encoder.transform(titanic[['Embarked']])

In [None]:
encoder.categories_

In [None]:
# # ordinal encoding of the "Sex" and "Embarked" features
encoder.fit(titanic[['Sex','Embarked']])
encoder.transform(titanic[['Sex','Embarked']])

In [None]:
encoder.categories_

Such integer representation can, however, not be used directly with all scikit-learn models, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired

### 1.2. One hot encoding

Another possibility to convert categorical features to features that can be used with scikit-learn models is to use a one-of-K, also known as one-hot encoding. 
This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

In [None]:
# one hot encoding of the "sex" feature
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False) # initializer the one hot enconder

In [None]:
titanic[['Sex']].head()

In [None]:
encoder.fit(titanic[['Sex']])
encoder.transform(titanic[['Sex']])

In [None]:
encoder.categories_

In [None]:
# one hot encoding of the "Embarked" feature
titanic.Embarked.head(10)

In [None]:
encoder.fit(titanic[['Embarked']])
encoder.transform(titanic[['Embarked']])

In [None]:
encoder.categories_

In [None]:
# one hot encoding of the "Sex" and "Embarked" features
encoder.fit_transform(titanic[['Sex','Embarked']])

In [None]:
encoder.categories_

## 2. Normalization and Standardization 

- [Normalization](#2.1.-Normalization)
- [Standardization](#2.2.-Standardization)

In [None]:
X = titanic[['Pclass','Age','Fare']] # feature matrix
X

### 2.1. Normalization

**Normalization** is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1,1] or [0,1].
This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

Normalizing the data is not a strict requirement. 
However, in practice, it can lead to an increased speed of training. 

In [None]:
# minmaxscaler scales data to the [0, 1] range
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
scaler.transform(X)

In [None]:
# maxabsscaler scales data to the [-1, 1] range 
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(X)
scaler.transform(X)

### 2.2. Standardization

**Standardization (or mean removal and variance scaling)** is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with mean 0 and standard deviation 1.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X)

## 3. Data Imputation

In [None]:
X.isnull().sum()

The typical approaches of dealing with missing values for a feature include:
- remove rows with missing features from the dataset (this can be done if your dataset is big enough)
- using a **data imputation** technique

The SimpleImputer class provides basic strategies for imputing missing values. 
Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. 

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
titanic.Age.mean()

In [None]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
imputed_X = imputer.transform(X)
imputed_X

## 4. Polynomial features

Often it's useful to add complexity to the model by considering nonlinear features of the input data. 
A simple and common method to use is **polynomial features**, which can get features' high-order and interaction terms. 
It is implemented in PolynomialFeatures

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(degree=2)
poly.fit(imputed_X)
poly.transform(imputed_X)

In [None]:
poly.get_feature_names(X.columns)