# Feature Engineering: Overloaded Operators

Learning Objectives:

    - Combine two features together using binary operators.

---

# Feature Engineering

Feature engineering is when you add or modify features to your data. PCA is one example of feature engineering, but there are many more ways to add, separate, change or combine your features which may lead to better machine learning results.

You already know other ways of feature engineering as well, such as scaling and imputing. These transform you data into another form and often improve the results of machine learning.
### Load Data

Let's do some feature engineering on our Titanic data.

In [1]:
# Imports 
import pandas as pd

In [2]:
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vReZBM5OC6GLYbacisp_ToNiu3CLWxqPXw7mWBsdRjnYOFLWNufdQ4qd8u5qTzUF2_sBUAMEi5cgy1U/pub?gid=1040198428&single=true&output=csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Summing Features:

Pandas 'overloads' many binary operators, such as +, -, * and /. When used with Pandas series these are applied to every item in a column.

In the Titanic data "SibSp" is the number of siblings and spouses a passenger has on board with them, and "Parch" is the number of parents and children that are with them. Let's say we want to add a new column that represents the total number of family members a person has aboard. We can define the column by adding the other two columns together.

In [3]:
df['TotalFamily'] = df['SibSp'] + df['Parch']
df = df.drop(['SibSp', 'Parch'], axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,TotalFamily
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,,S,0


# Concatenating Features:

Let's say we want to do something more complex, like have a column with the information about both the sex of a passenger and the approximate age by decade.

First we would round the 'Age' column to the nearest decade. The argument of Series.round(#) is the number decimals places to round the number to. We want to round to 10s, so we use a negative number.

In [4]:
df['Age'] = df['Age'].round(-1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,TotalFamily
0,1,0,3,"Braund, Mr. Owen Harris",male,20.0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,40.0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,30.0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,40.0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,40.0,373450,8.05,,S,0


Pandas will overload the '+' sign to concatenate two string columns, as well as summing two numeric columns. It throws an error if we try to use it with one string and one numeric feature, though. We will need to change the datatype of 'Age' to a string before concatenating using '+'

In [5]:
df['GenderAge'] = df['Sex'] + df['Age'].astype('string')
df.drop(columns=['Sex', 'Age'], inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Ticket,Fare,Cabin,Embarked,TotalFamily,GenderAge
0,1,0,3,"Braund, Mr. Owen Harris",A/5 21171,7.25,,S,1,male20.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,71.2833,C85,C,1,female40.0
2,3,1,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,7.925,,S,0,female30.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,53.1,C123,S,1,female40.0
4,5,0,3,"Allen, Mr. William Henry",373450,8.05,,S,0,male40.0


# Squaring and Multiplying Features

Let's do one more thing. Let's say we want to normalize the fares that passengers paid. We decide the way to do this is to multiply the fare by the square of the Pclass.

In [6]:
df['NormedFare'] = df['Fare'] * df['Pclass']**2
df.drop(columns=['Fare'], inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Ticket,Cabin,Embarked,TotalFamily,GenderAge,NormedFare
0,1,0,3,"Braund, Mr. Owen Harris",A/5 21171,,S,1,male20.0,65.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,C85,C,1,female40.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,,S,0,female30.0,71.325
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,C123,S,1,female40.0,53.1
4,5,0,3,"Allen, Mr. William Henry",373450,,S,0,male40.0,72.45


---

# Feature Engineering: Strings
Learning Objectives:

    - Use Pandas string methods to divide, process, and combine text features.

---

# Strings
Strings are a common kind of data you will encounter. Luckily, Pandas has a robust set of string methods that let you manipulate them in useful ways.
## Load Data
Let's do some feature engineering on our Titanic data.

In [7]:
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vReZBM5OC6GLYbacisp_ToNiu3CLWxqPXw7mWBsdRjnYOFLWNufdQ4qd8u5qTzUF2_sBUAMEi5cgy1U/pub?gid=1040198428&single=true&output=csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Separating Features: Strings

Pandas also contains string methods that can be applied to entire columns, these are accessible by putting .str before common strings methods. For example, say we want separate columns for first and last names. We can use .str.split(). By default it will split on white space and replace the strings with lists. By adding some arguments we can split on commas and expand the result into separate columns. The first argument in the Series.str.split() method is the separator to split on. expand=True returns 2 new columns rather than one column with a list of strings.

In [8]:
# create 2 new columns, FirstName and LastName by splitting the Name column
df[['LastName', 'FirstName']] = df['Name'].str.split(',', expand=True)
# drop the 'Name' column
df.drop('Name', axis=1, inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,FirstName
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry


# Clean-up

One thing we need to address is a little leftover dirt in our data from the last operation. We separated the 'Name' string on the comma, not the white space, so the white space remained in the string.

In [9]:
df.loc[0, 'FirstName']

' Mr. Owen Harris'

Let's clean up that column with another Pandas string method, Series.str.strip(). This removes whitespace from the front and back of all of the strings in a column.

In [10]:
df['FirstName'] = df['FirstName'].str.strip()
df.loc[0, 'FirstName']

'Mr. Owen Harris'

## Combining Strings

We can also combine strings. Perhaps we decided that instead of 'Last Name, First Name' we wanted the format of our 'Name' column to be 'First Name Last Name'. We could use the '+' operator to combine them in reverse order to the original and leave out the comma. However, we want a space between them, so we will add that as well.

In [11]:
df['Name'] = df['FirstName'] + ' ' + df['LastName']
df.drop(columns=['LastName', 'FirstName'], inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,Mr. Owen Harris Braund
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs. John Bradley (Florence Briggs Thayer) Cum...
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss. Laina Heikkinen
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,Mrs. Jacques Heath (Lily May Peel) Futrelle
4,5,0,3,male,35.0,0,0,373450,8.05,,S,Mr. William Henry Allen


---

# Feature Engineering: Datetime

Learning Objectives:

    - Convert date-like objects into datetime types.
    - Use Pandas datetime methods to engineer new features using datetime features.

## Datetime

Another fantastic feature of Pandas is its robust handling of dates and times. Using these tools we can extract the year, month, week, day, or even the name of the day of the week into new columns. Maybe, in your dataset, there is a correlation between what month something happens in and the target you want to predict. Let's take a quick look at some data from King County house sales and create new columns for the month a house was sold and the day of the week it was sold.
## Load the Data

We will be using data about King County House Sales to get a datetime feature to work with.

In [12]:
import pandas as pd
df2 = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSrgrUnz8mdosU-_k0aECouymqwds_mlaHpYlXzRtf7MBJ4N1r1inCfSDebaXwTVfLtH133EhwKf3mi/pub?gid=394699239&single=true&output=csv',                  usecols=['date','price','bedrooms','bathrooms'])
df2.head()

Unnamed: 0,date,price,bedrooms,bathrooms
0,20141013T000000,221900.0,3,1.0
1,20141209T000000,538000.0,3,2.25
2,20150225T000000,180000.0,2,1.0
3,20141209T000000,604000.0,4,3.0
4,20150218T000000,510000.0,3,2.0


In this case the 'date' column loaded as an 'object' type. We can use pd.to_datetime() to change the time to datatime.

In [13]:
df2['date'] = pd.to_datetime(df2['date'])
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       21613 non-null  datetime64[ns]
 1   price      21613 non-null  float64       
 2   bedrooms   21613 non-null  int64         
 3   bathrooms  21613 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 675.5 KB


Now that it's a date time type, we can do all sorts of fun stuff with the series.dt set of attributes, like series.dt.year, series.dt.month, series.dt.day (day of the month), Series.dt.weekday (day of the week), etc. [Here is more information about what Pandas datetime functionality can do](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). There are also some fun methods to retrieve the names of the months or week days. Note, these are methods, not attributes like above: Series.dt.day_name() and Series.dt.month_name()

Let's make new columns for the day of the week and the month of the year that a house was sold. Maybe houses sell for more on weekends, or during certain seasons. We want our model to be able to capture that trend.

In [14]:
df2['year'] = df2['date'].dt.year
df2['month (numeric)'] = df2['date'].dt.month
df2['month (name)'] = df2['date'].dt.month_name()
df2['day of month'] = df2['date'].dt.day
df2['day of week (numeric)'] = df2['date'].dt.weekday
df2['day of week (name)'] = df2['date'].dt.day_name()
df2.head()

Unnamed: 0,date,price,bedrooms,bathrooms,year,month (numeric),month (name),day of month,day of week (numeric),day of week (name)
0,2014-10-13,221900.0,3,1.0,2014,10,October,13,0,Monday
1,2014-12-09,538000.0,3,2.25,2014,12,December,9,1,Tuesday
2,2015-02-25,180000.0,2,1.0,2015,2,February,25,2,Wednesday
3,2014-12-09,604000.0,4,3.0,2014,12,December,9,1,Tuesday
4,2015-02-18,510000.0,3,2.0,2015,2,February,18,2,Wednesday


---

# Feature Engineering: Functions

Learning Objectives:

    - Use Series.apply() to apply a function to a column in a Pandas Dataframe.
    - Use Python lambda functions to create quick one-time use functions and use them with Series.apply().

---

## Functions

Often times the changes we want to make to data are not specifically covered by built-in methods of Pandas. Pandas also lets us define our own functions to apply to features. If we define a function that changes one input into another, we can use Series.apply() to apply that function to every value in a feature.
## Load Data

Let's do some more feature engineering on our Titanic data.

In [27]:
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vReZBM5OC6GLYbacisp_ToNiu3CLWxqPXw7mWBsdRjnYOFLWNufdQ4qd8u5qTzUF2_sBUAMEi5cgy1U/pub?gid=1040198428&single=true&output=csv')
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's say instead of a float for the Fare value, we want to bin these values into 'Expensive' and 'Cheap' tickets. We could find the median ticket price and change every fare value above that price to 'Expensive' and every value below that price to 'Cheap'.

We might write a function for that. The function would work as if a single value were passed it (not a whole array).

In [28]:
# find the median fare price
median_fare = df['Fare'].median()
# define a function that returns 'Expensive' or 'Cheap'
def bin_fare(fare):
    if fare > median_fare:
        return 'Expensive'
    else:
        return 'Cheap'

We defined `median_fare` outside of the function, but that's okay because Python will go looking in the global scope for that variable if it is not passed as an argument and available in the local scope of the function. Also notice that `bin_fare` just takes a single value.

series.apply() will apply a function to each value in a series. Notice we do not use the parentheses when passing a function.

In [29]:
# apply bin_fare() function to each item in the 'Fare' column
df['Fare'] = df['Fare'].apply(bin_fare)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,Cheap,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,Expensive,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,Cheap,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,Expensive,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,Cheap,,S


# Lambda Functions !!!!

Lambda functions can simplify our code some by creating single-use functions. In the above example we wrote a whole new function just to transform that column, and we probably won't use that function anywhere else. A Lambda function is a way to create a temporary function with less code.

Let's say we want to do the same kind of binning, but we want to bin the 'Age' column. We might decide that everyone over 30 is 'elderly' and everyone 30 or under is 'young'. I'm not saying that's true, but the mean age on the Titanic was about 30, so it's a good middle point.

In [18]:
df['Age'] = df['Age'].apply(lambda x: 'elderly' if x > 30 else 'young') 

There are a few parts to using lambda functions, as shown below:

First we tell Python are creating a lambda function with the special keyword 'lambda'

In [None]:
lambda

Then we define the input for the function. There could be more than one input, but in our case the function just takes one input, the passenger's age. The age that is being passed as an input to the function will be called 'x'. We could call it anything, but 'x' is traditional for the input to a function.

In [None]:
lambda x:

Notice that there is a colon (:) after the x. That tells Python that we are done defining inputs and are ready to start defining the return statement, or the outputs.

In [None]:
lambda x: 'elderly' if x > 30

The syntax is a little backward from if you were defining a persisting function, but this is how it works with lambda functions. First you say what the output should be, then you say the condition. However, the lambda function needs to know what to do with all of the values, so in this case it needs an 'else' clause as well.

If we wanted to change values over 30 to 'elderly' and leave the rest alone, we would have to say that in the function.

In [None]:
lambda x: 'elderly' if x > 30 else x

However, we want it change values of 30 or less to 'young'.

In [None]:
lambda x: 'elderly' if x > 30 else 'young'

The function will return 'elderly' if x (the age) is greater than 30, or else it will return 'young'.

Now we can pass this lambda function to the .apply() method.

In [30]:
df['Age'] = df['Age'].apply(lambda x: 'elderly' if x > 30 else 'young')
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,young,1,0,A/5 21171,Cheap,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,elderly,1,0,PC 17599,Expensive,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,young,0,0,STON/O2. 3101282,Cheap,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,elderly,1,0,113803,Expensive,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,elderly,0,0,373450,Cheap,,S
