# Feature Engineering

We still want to read the `retail-churn.csv` dataset that we examined previously and begin to pre-process it. The goal of the assignment is to become familiar with some common pre-processing and feature engineering steps by implementing them.

In [1]:
import pandas as pd

col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("retail-churn.csv", sep = ",", skiprows = 1, names = col_names)
churn.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar
0,101981,F,E,2860,818463,11/1/2000 0:00,4710000000000.0,1,37
1,101981,F,E,2861,818464,11/1/2000 0:00,4710000000000.0,1,17
2,101981,F,E,2862,818465,11/1/2000 0:00,4710000000000.0,1,23
3,101981,F,E,2863,818466,11/1/2000 0:00,4710000000000.0,1,41
4,101981,F,E,2864,818467,11/1/2000 0:00,4710000000000.0,8,288


Some pre-processing steps are straight-forward, while others may require some work. Pre-process the data using the steps outlined below. Create a new data called `churn_processed` which stores only the pre-processed as you run through each of the these steps. You will need to make sure your columns are properly named.

1. Remove `store_id` from the data.

In [2]:
churn_processed = pd.DataFrame(churn)
churn_processed = churn_processed.drop(['store_id'], axis = 1)
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar
0,101981,F,E,818463,11/1/2000 0:00,4710000000000.0,1,37
1,101981,F,E,818464,11/1/2000 0:00,4710000000000.0,1,17
2,101981,F,E,818465,11/1/2000 0:00,4710000000000.0,1,23
3,101981,F,E,818466,11/1/2000 0:00,4710000000000.0,1,41
4,101981,F,E,818467,11/1/2000 0:00,4710000000000.0,8,288


2. Convert `timestamp` into a `datetime` column and extract two new columns: `dow` is the day of the week and `month` is the month of the year.

In [3]:
churn_processed['timestamp'] = pd.to_datetime(churn_processed['timestamp'])
churn_processed['dow'] = churn_processed['timestamp'].dt.dayofweek
churn_processed['month'] = churn_processed['timestamp'].dt.month
churn_processed.dtypes

user_id               int64
gender               object
address              object
trans_id              int64
timestamp    datetime64[ns]
item_id             float64
quantity              int64
dollar                int64
dow                   int64
month                 int64
dtype: object

3. One-hot encode `address`, `dow` and `month`.

In [4]:
from sklearn.preprocessing import OneHotEncoder

cat_col = ['address','dow','month']
churn_processed[cat_col] = churn_processed[cat_col].astype('category')
churn_cat = churn_processed[cat_col].copy() # only select columns that have type 'category'
onehot = OneHotEncoder(sparse = False) # initialize one-hot-encoder
onehot.fit(churn_cat)
col_names = onehot.get_feature_names(churn_cat.columns) # this allows us to properly name columns
churn_onehot =  pd.DataFrame(onehot.transform(churn_cat), columns = col_names)
churn_processed[churn_onehot.columns] = churn_onehot
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar,dow,month,...,dow_1,dow_2,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12
0,101981,F,E,818463,2000-11-01,4710000000000.0,1,37,2,11,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,101981,F,E,818464,2000-11-01,4710000000000.0,1,17,2,11,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,101981,F,E,818465,2000-11-01,4710000000000.0,1,23,2,11,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,101981,F,E,818466,2000-11-01,4710000000000.0,1,41,2,11,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,101981,F,E,818467,2000-11-01,4710000000000.0,8,288,2,11,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


4. Rescale `dollar` using min-max normalization. Use `pandas` to do it and call the rescaled column `dollar_std_minmax`. 

In [5]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
minmax_scaler = MinMaxScaler()
churn_dollar = churn_processed['dollar'].to_numpy()
minmax_scaler.fit(churn_dollar.reshape(-1,1))
churn_processed['dollar_std_minmax'] = pd.DataFrame(minmax_scaler.transform(churn_dollar.reshape(-1,1)))
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar,dow,month,...,dow_2,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax
0,101981,F,E,818463,2000-11-01,4710000000000.0,1,37,2,11,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00051
1,101981,F,E,818464,2000-11-01,4710000000000.0,1,17,2,11,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227
2,101981,F,E,818465,2000-11-01,4710000000000.0,1,23,2,11,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312
3,101981,F,E,818466,2000-11-01,4710000000000.0,1,41,2,11,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567
4,101981,F,E,818467,2000-11-01,4710000000000.0,8,288,2,11,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066


You can read about **robust normalization** [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). The word **robust** in statistics generally refers to methods that are not affected by outliers. For example, you can say that the median is a *robust* measure for the "average" of the data, while the mean is not. 

5. Robust normalization is using interquartile range (IQR = 3rd quartile - 1st quartile) while Z-normalization is using mean and std.

In [6]:
znorm_scaler = RobustScaler()
churn_quantity = churn['quantity'].to_numpy()
znorm_scaler.fit(churn_quantity.reshape(-1,1))
churn_processed['qty_std_robust'] = pd.DataFrame(znorm_scaler.transform(churn_quantity.reshape(-1,1)))
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar,dow,month,...,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax,qty_std_robust
0,101981,F,E,818463,2000-11-01,4710000000000.0,1,37,2,11,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00051,0.0
1,101981,F,E,818464,2000-11-01,4710000000000.0,1,17,2,11,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227,0.0
2,101981,F,E,818465,2000-11-01,4710000000000.0,1,23,2,11,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312,0.0
3,101981,F,E,818466,2000-11-01,4710000000000.0,1,41,2,11,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567,0.0
4,101981,F,E,818467,2000-11-01,4710000000000.0,8,288,2,11,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066,7.0


6. Rescale `quantity` a second time, using Z-normalization, but normalize `quantity` **per user**, i.e. group by `user_id` so that the mean and standard deviation computed to normalize are computed separately by each `user_id`. Call the rescaled feature `qty_std_Z_byuser`. This kind of normalization is used when data is weighted by some population. For example, because of the difference in money value, standard of living cost is different amongst countries and thus cannot conclude that London's rent is too expensive.

In [7]:
znorm_scaler = StandardScaler()
churn_quantity = churn['quantity'].to_numpy()
znorm_scaler.fit(churn_quantity.reshape(-1,1), churn_processed.groupby('user_id', axis = 1)['quantity'])
churn_processed['qty_std_Z_byuser'] = pd.DataFrame(znorm_scaler.transform(churn_quantity.reshape(-1,1), churn_processed.groupby('user_id', axis = 1)))
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar,dow,month,...,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax,qty_std_robust,qty_std_Z_byuser
0,101981,F,E,818463,2000-11-01,4710000000000.0,1,37,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00051,0.0,-0.10408
1,101981,F,E,818464,2000-11-01,4710000000000.0,1,17,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227,0.0,-0.10408
2,101981,F,E,818465,2000-11-01,4710000000000.0,1,23,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312,0.0,-0.10408
3,101981,F,E,818466,2000-11-01,4710000000000.0,1,41,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567,0.0,-0.10408
4,101981,F,E,818467,2000-11-01,4710000000000.0,8,288,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066,7.0,1.784889


7. Convert `item_id` into a `category` column, then replace the `item_id` of all the items sold only once in the entire data with `999999`.

In [8]:
churn_processed['item_id'] = churn_processed['item_id'].astype('category')
churn_processed['item_id'] = churn_processed['item_id'].cat.add_categories(['999999'])
churn_processed['item_id']
churn_processed['freq']= churn_processed.groupby('quantity')['quantity'].transform('count')
churn_processed.loc[churn_processed.loc[:,'freq'] == 1, 'item_id']='999999' 
churn_processed = churn_processed.drop(['freq'], axis =1)
churn_processed.head()

Unnamed: 0,user_id,gender,address,trans_id,timestamp,item_id,quantity,dollar,dow,month,...,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax,qty_std_robust,qty_std_Z_byuser
0,101981,F,E,818463,2000-11-01,4710000000000.0,1,37,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00051,0.0,-0.10408
1,101981,F,E,818464,2000-11-01,4710000000000.0,1,17,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227,0.0,-0.10408
2,101981,F,E,818465,2000-11-01,4710000000000.0,1,23,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312,0.0,-0.10408
3,101981,F,E,818466,2000-11-01,4710000000000.0,1,41,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567,0.0,-0.10408
4,101981,F,E,818467,2000-11-01,4710000000000.0,8,288,2,11,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066,7.0,1.784889
