# Feature Engineering

We saw how TF-IDF can be used to create features on text data. Let's now look at an example of a special transformation very common in the retail industry: RFM or recency-frequency-monetary transformation. The goal of this assignment is to implement create RFM features for the `retail-churn.csv` data. You will see that having time series data opens us up to many types of features (although how useful they will ultimately be is another question).


In [1]:
import pandas as pd
import datetime as dt
import numpy as np
pd.__version__

col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("retail-churn.csv", sep = ",", skiprows = 1, names = col_names)
churn.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar
0,101981,F,E,2860,818463,11/1/2000 0:00,4710000000000.0,1,37
1,101981,F,E,2861,818464,11/1/2000 0:00,4710000000000.0,1,17
2,101981,F,E,2862,818465,11/1/2000 0:00,4710000000000.0,1,23
3,101981,F,E,2863,818466,11/1/2000 0:00,4710000000000.0,1,41
4,101981,F,E,2864,818467,11/1/2000 0:00,4710000000000.0,8,288


Run the following steps to feature engineer the data.

1. Convert the `timestamp` column to be of type `datetime`. 

In [3]:
churn['timestamp'] = pd.to_datetime(churn['timestamp'])
churn.dtypes

user_id               int64
gender               object
address              object
store_id              int64
trans_id              int64
timestamp    datetime64[ns]
item_id             float64
quantity              int64
dollar                int64
dtype: object

2. Extract the date from `datetime` and store it in a new column called `date`. 

In [4]:
churn['date'] = churn['timestamp'].dt.date
churn.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar,date
0,101981,F,E,2860,818463,2000-11-01,4710000000000.0,1,37,2000-11-01
1,101981,F,E,2861,818464,2000-11-01,4710000000000.0,1,17,2000-11-01
2,101981,F,E,2862,818465,2000-11-01,4710000000000.0,1,23,2000-11-01
3,101981,F,E,2863,818466,2000-11-01,4710000000000.0,1,41,2000-11-01
4,101981,F,E,2864,818467,2000-11-01,4710000000000.0,8,288,2000-11-01


Notice that the **granularity** of the data is not daily spend, but rather individual transactions. We can see that because the same user has multiple transactions with the same timestamp. Before we run RFM, we need to **aggregate** the data so we have daily granularity.

3. Aggregate `quantity` and `dollar` to daily data (so that `user_id` and `date` are unique for each row). Call the aggregated data `churn_agg`. You can ignore all the other columns, as they are not needed. 

In [5]:
churn_agg = churn.groupby(['user_id','date']).agg({'quantity':'sum','dollar':'sum'})
churn_agg = churn_agg.reset_index()
churn_agg['date'] = pd.to_datetime(churn_agg['date'])
churn_agg.dtypes

user_id              int64
date        datetime64[ns]
quantity             int64
dollar               int64
dtype: object

4. Using the aggregated data, obtain recency, frequency and monetary features for both `dollar` and `quantity`. Use a 7-day moving window for frequency and monetary. Call your new features `last_visit_ndays` (recency) `quantity_roll_sum_7D` (frequency) and `dollar_roll_sum_7D` (monetary). 
  In `pandas` recency is a kind of **difference** feature, because it's based on calculating the difference between the current date and a previous date (called a **lag**). We can use the `diff` method to get recency. Frequency and monetary features are called **rolling** features, because it is a type of cumulative sum but over a moving window. We can use the `rolling` function to get frequency and monetary, where the `window` and `on` arguments need to chosen carefully.

In [6]:
quantity_roll_sum_7D = churn_agg.groupby('user_id').rolling(window = '7D', on = 'date')['quantity'].sum()
quantity_roll_sum_7D = quantity_roll_sum_7D.reset_index()
dollar_roll_sum_7D = churn_agg.groupby('user_id', as_index = False).rolling(window = '7D', on = 'date')['dollar'].sum()
dollar_roll_sum_7D = dollar_roll_sum_7D.reset_index()
last_visit_ndays = churn_agg.groupby('user_id')['date'].diff()
last_visit_ndays

0           NaT
1       14 days
2        1 days
3       40 days
4           NaT
          ...  
37053       NaT
37054       NaT
37055       NaT
37056       NaT
37057       NaT
Name: date, Length: 37058, dtype: timedelta64[ns]

5. Combine all three features into a single `DataFrame` and call it `churn_roll`. 

In [9]:
churn_roll = churn_agg
churn_roll['last_visit_ndays']=last_visit_ndays
churn_roll['quantity_roll_sum_7D'] = quantity_roll_sum_7D['quantity']
churn_roll['dollar_roll_sum_7D']=dollar_roll_sum_7D['dollar']
churn_roll.head()


Unnamed: 0,user_id,date,quantity,dollar,last_visit_ndays,quantity_roll_sum_7D,dollar_roll_sum_7D
0,1113,2000-11-12,5,420,NaT,5.0,420.0
1,1113,2000-11-26,3,558,14 days,3.0,558.0
2,1113,2000-11-27,6,624,1 days,9.0,1182.0
3,1113,2001-01-06,9,628,40 days,9.0,628.0
4,1250,2001-02-04,5,734,NaT,5.0,734.0


6. Use `fillna` to replace missing values for recency with a large value like 100 days (whatever makes business sense). You can use `pd.Timedelta('100 days')` to set the value.

In [10]:
churn_roll = churn_roll.fillna(pd.Timedelta('100 days'))
churn_roll.head()

Unnamed: 0,user_id,date,quantity,dollar,last_visit_ndays,quantity_roll_sum_7D,dollar_roll_sum_7D
0,1113,2000-11-12,5,420,100 days,5.0,420.0
1,1113,2000-11-26,3,558,14 days,3.0,558.0
2,1113,2000-11-27,6,624,1 days,9.0,1182.0
3,1113,2001-01-06,9,628,40 days,9.0,628.0
4,1250,2001-02-04,5,734,100 days,5.0,734.0


7. To see if things worked, merge the aggregated data `churn_agg` with the RFM features in `churn_roll`. You can use the `merge` method to do this with the right keys specified.

In [11]:
churn_merge = churn_agg.merge(churn_roll, how = 'right', on = 'user_id')

8. Check the features we created to make sure they appear to show the right calculations. 

In [12]:
churn_merge[0:10]

Unnamed: 0,user_id,date_x,quantity_x,dollar_x,last_visit_ndays_x,quantity_roll_sum_7D_x,dollar_roll_sum_7D_x,date_y,quantity_y,dollar_y,last_visit_ndays_y,quantity_roll_sum_7D_y,dollar_roll_sum_7D_y
0,1113,2000-11-12,5,420,NaT,5.0,420.0,2000-11-12,5,420,100 days,5.0,420.0
1,1113,2000-11-26,3,558,14 days,3.0,558.0,2000-11-12,5,420,100 days,5.0,420.0
2,1113,2000-11-27,6,624,1 days,9.0,1182.0,2000-11-12,5,420,100 days,5.0,420.0
3,1113,2001-01-06,9,628,40 days,9.0,628.0,2000-11-12,5,420,100 days,5.0,420.0
4,1113,2000-11-12,5,420,NaT,5.0,420.0,2000-11-26,3,558,14 days,3.0,558.0
5,1113,2000-11-26,3,558,14 days,3.0,558.0,2000-11-26,3,558,14 days,3.0,558.0
6,1113,2000-11-27,6,624,1 days,9.0,1182.0,2000-11-26,3,558,14 days,3.0,558.0
7,1113,2001-01-06,9,628,40 days,9.0,628.0,2000-11-26,3,558,14 days,3.0,558.0
8,1113,2000-11-12,5,420,NaT,5.0,420.0,2000-11-27,6,624,1 days,9.0,1182.0
9,1113,2000-11-26,3,558,14 days,3.0,558.0,2000-11-27,6,624,1 days,9.0,1182.0


One take-away from the above example is that feature engineering can be a complicated topic, and relies to some extent on creativity and domain knowledge, as we saw with time series data and RFM. For this reason, some modern machine learning libraries are working on what is called **automated feature engineering** to see if algorithms can automatically figure out a set of good features to use by the machine learning model.

