# Feature Engineering

Welcome to the Feature Engineering lesson. In this lesson we will be covering:

- <b>Feature Engineering</b>
    - Transformation of categorical and numerical variables
    - Data imputation
    - One hot encoding
    - Binning
- <b>Feature Selection</b>
    - Applying domain knowledge
    - Correlation
- <b>Feature Scaling</b>
    - Normalization
    - Standardization

The lab for Lesson 8 will consist of all the exercises that you will find throughout the notebook. 

### Our goal will be to build a model that predicts whether a passenger survived the titanic.

For this lesson we will again be using the Titanic Survival Dataset from Kaggle.

Let's review the column values once more as a reminder of the data we are using:

- Survived: Outcome of survival (0 = No; 1 = Yes)
- Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- Name: Name of passenger
- Sex: Sex of the passenger
- Age: Age of the passenger (Some entries contain ?)
- SibSp: Number of siblings and spouses of the passenger aboard
- Parch: Number of parents and children of the passenger aboard
- Ticket: Ticket number of the passenger
- Fare: Fare paid by the passenger
- Cabin: Cabin number of the passenger (Some entries contain ?)
- Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- Boat: Lifeboat (if survived)
- Body: Body Number (if did not survive and body was recovered)
- Home.Dest: Home / Destination

In [None]:
# import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import titanic dataset
titanic_data = pd.read_csv("titanic_data.csv")
titanic_data.head(10)

### Before we start our feature engineering process, we will need to handle our missing values.
We will need to first get rid of the question marks. Then we will take a look at the percentage of null values.

#### Review from lesson 6:

In [None]:
# replace ? with none
titanic_data = titanic_data.replace({'?': None})

# change the type of the age and fare columns to numeric
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')
titanic_data['fare'] = pd.to_numeric(titanic_data['fare'], errors = 'coerce')

# Lets round the age values
titanic_data['age'] = round(titanic_data['age'])

# lets check for null values again
titanic_data.isnull().sum()/len(titanic_data)*100

We see there are a few features that a large amount of nulls. 

***Should we get rid of them?*** ---> This is one example of how we are conducting <b>feature engineering</b> by applying our domain knowledge.

In this case, there is no right or wrong percentage threshold. However, if a large portion of data is missing, it may be best to remove the column since imputating may add bias to our models. But before we do that...

We need to first ask ourselves the reason behind why the data may be missing. 

- Cabin: we may assume that the missing values are because some passengers may not have had cabins, which could tell you something about whether or not they survived (missing data = perhaps no cabin, survived)
- Embarked: we may assume that missing values are because some passengers may not have given this info (small percentage)
- Boat: indicates whether or not they survived (missing data = perhaps did not survive)
- Body: indicates whether or not they survived (missing data = perhaps survived)
- Home.Dest: we may assume that missing values are because some passengers may not have given this info 

Since the only variable with large missing value percentage not associated with survival is home.dest, let's look at the number of occurrences to see if we see any pattern. If there is a pattern, we can feature engineer this column to be more useful.

In [None]:
#Let's look at the percentage of occurrences 

titanic_data["home.dest"].value_counts()/titanic_data["home.dest"].value_counts().sum()*100

- The home/destination with the highest percentage is New York, NY with 8.6% of passengers arriving at this destination. The total number of locations is 369. 
- Since this is a large spread and the highest occurrence is less than 10%, it is safe to assume there are no patterns detected and we can remove this column. 

In [None]:
#Removing the home.dest column
titanic_data = titanic_data.drop(columns=['home.dest'])

titanic_data.head()

### Exercise 1

We dropped the "home.dest" column, but if we wanted to keep it, what could we have done to make this column useful to predict whether a passenger survived? 

Hint: What kind of information does the home or destination give you about a passenger? 

Hint Hint: Google the following cities: 
  - Southhampton, UK
  - Queenstone, UK
  - Cherbourg, FR










(Double click here)

Response Exercise 1








### Feature transformation

Now let's consider the other 3 out of 4 variables - cabin, boat, and body. Since these features may help us predict survival, we can transform them so they are more useful to our model. 

In [None]:
#Instead of using cabin number, let's create a new column showing whether or not the passenger had a cabin

titanic_data['has_cabin'] = ~titanic_data.cabin.isnull()

titanic_data.head()

In [None]:
#We will now repeat this process using the boat and body columns 

titanic_data['has_boat'] = ~titanic_data.boat.isnull()
titanic_data['has_body'] = ~titanic_data.body.isnull()

titanic_data.head()

If we take a look at the name column, we see there are titles within the names such as Miss, Master, Mr., Mrs., etc. 

We can conduct feature engineering on this column by creating a new column showing the title. This information may be useful to understand social status, profession, etc. which could help us understand the passengers' chances of survival.

We can use regular expressions to extract the title from each name. You can visit these links to learn more about regular expressions: 
- https://docs.python.org/3/howto/regex.html 
- https://docs.python.org/3/library/re.html

In [None]:
#Extracting title from the names and storing into a new column 
import re

titanic_data['title'] = titanic_data.name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
titanic_data.head()

In [None]:
#Let's plot the title column to see how this looks
titanic_data['title'].value_counts().plot(kind='bar')

We see there are several titles that are less common. To help our model with its prediction, we can group the less common titles into one single group. 

In [None]:
#Grouping all columns as "Other"
titanic_data['title'] = titanic_data['title'].replace(['Don', 'Dona', 'Rev', 'Dr',
                    'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkheer', 'Ms', 'Mlle', 'Mme'],'Other')

#Plotting the results
titanic_data['title'].value_counts().plot(kind='bar')

We can now drop the irrelevant columns that are no longer needed.

In [None]:
#Removing the name, cabin, boat, body, ticket columns
titanic_data = titanic_data.drop(columns=['name','ticket', 'cabin', 'boat', 'body'])

titanic_data.head(10)

Now let's look at the embarked column. Since there is only less than 1% of missing data, we can impute the missing values using the most common value - i.e. the mode. 

In [None]:
#How many missing values do we have?
titanic_data.embarked.isnull().sum()

In [None]:
#What are the other values? 
titanic_data.embarked.unique()

In [None]:
#What is the most common occurrence? 
titanic_data.embarked.value_counts()

In [None]:
#We can impute the missing values by replacing None with the mode
titanic_data['embarked'].fillna(titanic_data.embarked.mode()[0], inplace=True)

titanic_data.embarked.value_counts()

### Exercise 2

Feature engineer a new column by combining the "sibsp" and "parch" columns. You can call it "Family Size".

Hint: Recall back on your pandas lessons (aka use pandas to combine these columns). 

In [None]:
#Write Code Here











### One Hot Encoding

#### Transforming categorical variables into numerical variables



In [None]:
#Let's look at the dataframe
titanic_data.head(10)

In [None]:
#Creating a new dataframe that contains dummy variables 
titanic_data = pd.get_dummies(titanic_data, drop_first=True)
titanic_data.head()

#### Congratulations, we've now completed a few feature engineering techniques! 

#### But wait........ there's more!

Now let's go back to our dataframe and see what other features still contain missing values. 

In [None]:
#Let's check for null values again
titanic_data.isnull().sum()/len(titanic_data)*100

In [None]:
#Let's plot the 2 variables 

fg = sns.FacetGrid(titanic_data, col='survived')
fg.map(sns.histplot, "age", bins=20)
fg.add_legend()


fg = sns.FacetGrid(titanic_data, col='survived')
fg.map(sns.histplot, "fare", bins=20)
fg.add_legend()

Let's impute missing data for the remaining 2 numerical features.

Let's replace "age" and "fare" columns with the median value.

In [None]:
# Impute missing values for 'age' and 'fare'
titanic_data['age'] = titanic_data.age.fillna(titanic_data.age.mean())
titanic_data['fare'] = titanic_data.fare.fillna(titanic_data.fare.mean())

In [None]:
#Verifying that all missing values were taken care of
titanic_data.isnull().sum()/len(titanic_data)*100

Let's plot the remaining 2 numerical variables - age and fare. 

In [None]:
#Using facetgrid to plot based on survival
fg = sns.FacetGrid(titanic_data, col='survived')
fg.map(sns.histplot, "age", bins=20)
fg.add_legend()

fg = sns.FacetGrid(titanic_data, col='survived')
fg.map(sns.histplot, "fare", bins=20)
fg.add_legend()

### Binning Numerical Data

One technique used when dealing with ranges of numerical data is creating bins that reflect patterns in the data. This is one way to include outliers which in turn may create noise. Binning allows you to put observations within a certain range in the same bin. 

In this case, we can use the pandas function **`qcut()`** to bin the age column.

In [None]:
#Binning the age column 
titanic_data['age_bin'] = pd.qcut(titanic_data.age, q=4, labels=False )

#Plotting this column
titanic_data['age_bin'].value_counts().plot(kind='bar')

In [None]:
###Exercise 3 

#Using the qcut method, create a new bin column using "fare" that contains 10 bins



#Plot new column





## Feature Scaling

#### Normalization

- Normalization rescales the values so they are between the [0,1] range. 
- It subtracts the minimum value from each value and divides it by the maximum minus the minimum value. 
- It is also known as Min-Max scaling.

<img src="Normalization Formula.PNG">

We can use the <b>min()</b> and <b>max()</b> methods in pandas to normalize our age column.

In [None]:
# Applying normalization technique by using pandas min and max methods

# Find the max and min from the age column
max_age = titanic_data.age.max()
min_age = titanic_data.age.min()

# Use the Mapping Function
titanic_data['age'] = titanic_data.age.map(lambda p: (p - min_age)/(max_age - min_age))
  
# View normalized data
titanic_data.head()

In [None]:
#Let's plot the age column to see our new range of values
import plotly.express as px

hist = px.histogram(titanic_data,x = "age", opacity = 0.7)
hist

#### Standardization

- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. 
- This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
- Standardized Measurement = (original measurement – mean of the variable) divided by (standard deviation of the variable)

<img src="Standardization Formula.PNG">

In [None]:
# Applying standardization technique by using pandas mean and standard deviation methods
column = 'fare'
data[column] = (data[column] - data[column].mean()) / data[column].std()    
  
# View normalized data  
data.head()

In [None]:
#Let's plot the fare column to see our new range of values
fare_hist = px.histogram(data,x = "fare", opacity = 0.7)
fare_hist

### Correlation 

- Spearman and Pearson are two statistical methods used to calculate the strength of correlation between 2 variables. 

- <b>Pearson Correlation Coefficient </b> can be used with continuous variables that have linear relationship. 

- <b>Spearman Correlation Coefficient </b> can be used when you have a non-linear relationship or ordinal categorical variables.

In this case, we will be using correlation techniques to select the best feature for our model --> <b>feature selection</b>!

In [None]:
def correlation_heatmap(data):
    correlations = data.corr()

    fig, ax = plt.subplots(figsize=(20,20))
    sns.heatmap(correlations, vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70})
    plt.show();
    
correlation_heatmap(data)

To avoid multicollinearity, we must now remove features with high correlation:
- age OR age_bin
- has_boat OR survived
- title_Mr OR sex_male

# Congratulations Future Data Scientist/Machine Learning Engineer! 

## You've now added many awesome techniques to your toolbox. 

## You are one step closer to the $1,000.