Feature Engineering: Overloaded Operators
Feature engineering is when you add or modify features to your data. PCA is one example of feature engineering, but there are many more ways to add, separate, change or combine your features which may lead to better machine learning results.
You already know other ways of feature engineering as well, such as scaling and imputing. These transform you data into another form and often improve the results of machine learning.

In [1]:
import pandas as pd

df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vReZBM5OC6GLYbacisp_ToNiu3CLWxqPXw7mWBsdRjnYOFLWNufdQ4qd8u5qTzUF2_sBUAMEi5cgy1U/pub?gid=1040198428&single=true&output=csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Summing Features:
Pandas 'overloads' many binary operators, such as +, -, * and /. When used with Pandas series these are applied to every item in a column.
In the Titanic data "SibSp" is the number of siblings and spouses a passenger has on board with them, and "Parch" is the number of parents and children that are with them. Let's say we want to add a new
column that represents the total number of family members a person has aboard. We can define the column by adding the other two columns together.

In [2]:
df['TotalFamily'] = df['SibSp'] + df['Parch']
df = df.drop(['SibSp', 'Parch'], axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,TotalFamily
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,,S,0


Concatenating Features:
Let's say we want to do something more complex, like have a column with the information about both the sex of a passenger and the approximate age by decade.
First we would round the 'Age' column to the nearest decade. The argument of Series.round(#) is the number decimals places to round the number to. We want to round to 10s, so we use a negative
number.

In [3]:
# Pandas will overload the '+' sign to concatenate two string columns, as well as summing two numeric columns. It throws an error if we try to use it with one string and one numeric feature, though. We will
# need to change the datatype of 'Age' to a string before concatenating using '+'
df['GenderAge'] = df['Sex'] + df['Age'].astype('string')
df.drop(columns=['Sex','Age'], inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Ticket,Fare,Cabin,Embarked,TotalFamily,GenderAge
0,1,0,3,"Braund, Mr. Owen Harris",A/5 21171,7.25,,S,1,male22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,71.2833,C85,C,1,female38.0
2,3,1,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,7.925,,S,0,female26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,53.1,C123,S,1,female35.0
4,5,0,3,"Allen, Mr. William Henry",373450,8.05,,S,0,male35.0


In [4]:
# Squaring and Multiplying Features
# Let's do one more thing. Let's say we want to normalize the fares that passengers paid.
# We decide the way to do this is to multiply the fare by the square of the Pclass.

df['NormedFare'] = df['Fare'] * df['Pclass']**2
df.drop(columns='Fare', inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Ticket,Cabin,Embarked,TotalFamily,GenderAge,NormedFare
0,1,0,3,"Braund, Mr. Owen Harris",A/5 21171,,S,1,male22.0,65.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,C85,C,1,female38.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,,S,0,female26.0,71.325
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,C123,S,1,female35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",373450,,S,0,male35.0,72.45


Summary:
Pandas 'overloads', or changes the behavior of many operators in Python, such as '+', '-', '/' and '*' to apply them elementwise to pairs of features. This can be used to combine two features into one.