# L9 Feature Engineering

This lesson will focus on a few aspects of Feature Engineering, which is similar to Data Wrangling. In particular, we're going to focus on the Feature Cross.

Also, we're going to look at a new data set that looks at subsample of NYC taxi rides taken from the NYC Taxi and Limosine Commission (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). This data set is part of Seaborn distributions, and can be loaded directly from Jupyter. You can find all of the other Seaborn provided data sets on github (https://github.com/mwaskom/seaborn-data).

Let's take a look!

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = sns.load_dataset('taxis')
df.shape

In [None]:
df.head(5)

Before we begin, you may like to explore this data set and get a feel for it.

## Feature Crosses

The Feature Cross is a broad category of feature engineering where you combine multiple features. Often this is numeric.

Since these are `datetime` objects, we'll have to convert them to seconds.

In [None]:
df.dtypes

In [None]:
df['travel_time'] = (df['dropoff'] - df['pickup']).apply(lambda x: x.total_seconds())

In [None]:
ax = sns.histplot(data=df, x='travel_time')
ax.set_xlabel('Time time (s)')

Now another thing we might like to do is see how this changes for different boroughs. But, customers can go between borough, so we really want to only look at those that start and stop in the same borough.

Here we create a simple `pd.Series` of `True`/`False` values if the pickup and droppoff boroughs are the same.

In [None]:
df['same_borough'] =  (df['dropoff_borough'] == df['pickup_borough'])

In [None]:
sns.histplot(data=df[df['same_borough'] == True], x='travel_time', hue='dropoff_borough')

Although I am not using this specific Feature Cross to draw conclusions, it helped me plot intra-borough travel times very quickly.

Also, we see we have a lot more trips in Manhattan, so we should probably look at relative frequencies between them.

In [None]:
sns.histplot(data=df[df['same_borough'] == True], x='travel_time', hue='dropoff_borough',
            stat='percent', common_norm=False)

Lastly, another example of a feature cross is to numerically manipulate features. For instance, we might want to know if rides with multiple passengers have different costs. Or simply, what is the cost per passenger?

In [None]:
df.head(5)

In [None]:
df['cost_per_passenger'] = (df['total'] / df['passengers'])

In [None]:
sns.histplot(data=df, x='cost_per_passenger')

Here we might also create a new feature specifying solo or multi passenger rides. I'll use a `lambda` function to quickly do this.

In [None]:
df['trip_type'] = df['passengers'].apply(lambda x: 'solo' if x == 1 else 'multi')

In [None]:
sns.boxplot(data=df, x='trip_type', y='cost_per_passenger', color='tab:blue')

Here we see that if you travel with friends, the total cost per person is advantageous.

## One Hot Encoding (OHE)

Finally I'd like to introduce OHE, which "binarizes" a categorical variable. This is useful in Machine Learning, and sometimes in data visualization.

To OHE features, you can use the `pd.get_dummies` method. If you don't specify a column(s), it will operate on _all_ categorical features.

For example, it expands our dataframe from 18 features to 423 columns!

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
pd.get_dummies(df).head(5)

If we specify the `payment` column, it expands to `payment_cash` and `payment_credit_card`.

In [None]:
pd.get_dummies(df, columns=['payment'])