# (3) Manual Feature Scaling, Selection, and Encoding

**Feature scaling** and **feature_selection** are yet two components of a classical data science pipeline that isn't fully adressed in neither `featuretools` and `h2o`. `featuretools` has no mention of these, and `h2o` includes them in some models as hyperparameters to learn and doesn't include them in others. Whereas **data encoding** for categorical variables are handled intrinsically in `h2o` as we shall discuss in the next notebook,  it does so by applying different encodings for different algorithms. For example, XGBoost models perform an internal *one-hot encoding* and Gradient Boosting Machine (GBM) models perform *enum encoding*. For this reason, we will both demonstrate the case where we handle encoding ourselves and the case where we completely leave it to `h2o`.

In [1]:
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype
from sklearn.feature_selection import SelectKBest, chi2, SelectFromModel, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from utils import categorical_to_onehot_columns

SEED = 42
pd.options.mode.chained_assignment = None  # suppress SettingWithCopyWarning() for chained assignments

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("white", {'ytick.major.size': 8.0})
sns.set_context("poster", font_scale=0.8)

## Reading Data

In [2]:
X_train = pd.read_csv('(2)data_automated_ops/train_users.csv')
X_train, Y_train = X_train.drop('country_destination', axis=1), X_train['country_destination']
X_train.head()

Unnamed: 0,gender,age,signup_method,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,...,buckets.CUM_SUM(CA),buckets.CUM_SUM(DE),buckets.CUM_SUM(FR),buckets.CUM_SUM(GB),buckets.CUM_SUM(AU),buckets.CUM_SUM(NL),buckets.CUM_SUM(US),buckets.CUM_SUM(IT),buckets.CUM_SUM(PT),buckets.CUM_SUM(ES)
0,FEMALE,31.0,basic,en,direct,direct,omg,Web,Mac Desktop,Safari,...,12114.0,23369.0,21961.0,21458.0,8783.0,5457.0,119834.0,16818.0,3092.0,13463.0
1,FEMALE,43.857143,basic,en,direct,direct,untracked,Web,iPad,Mobile Safari,...,16926.0,33405.0,29925.0,29738.0,12100.0,7523.0,161807.0,25250.0,4714.0,21218.0
2,FEMALE,45.857143,basic,en,direct,direct,untracked,Web,Windows Desktop,Firefox,...,19322.0,39395.0,34357.0,34188.0,13713.0,8742.0,182625.0,30158.0,5525.0,25122.0
3,MALE,40.0,basic,en,direct,direct,untracked,Web,Mac Desktop,Safari,...,18105.0,36038.0,32137.0,31839.0,12920.0,8105.0,171966.0,27678.0,5130.0,23270.0
4,FEMALE,33.714286,basic,en,direct,direct,untracked,Moweb,Windows Desktop,Chrome,...,12114.0,23369.0,21961.0,21458.0,8783.0,5457.0,119834.0,16818.0,3092.0,13463.0


In [3]:
X_test = pd.read_csv('(2)data_automated_ops/test_users.csv')
X_test.head()

Unnamed: 0,gender,age,signup_method,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,...,buckets.CUM_SUM(CA),buckets.CUM_SUM(DE),buckets.CUM_SUM(FR),buckets.CUM_SUM(GB),buckets.CUM_SUM(AU),buckets.CUM_SUM(NL),buckets.CUM_SUM(US),buckets.CUM_SUM(IT),buckets.CUM_SUM(PT),buckets.CUM_SUM(ES)
0,MALE,34.428571,basic,zh,seo,google,linked,Web,Mac Desktop,Chrome,...,13376.0,25996.0,23996.0,23648.0,9664.0,5960.0,130818.0,18632.0,3463.0,15211.0
1,FEMALE,23.0,facebook,en,seo,facebook,linked,Web,iPhone,Mobile Safari,...,7178.0,13444.0,13753.0,12950.0,5352.0,3410.0,74974.0,10143.0,1837.0,8015.0
2,MALE,19.0,basic,en,seo,google,untracked,Web,Mac Desktop,Safari,...,6019.0,11283.0,11806.0,10993.0,4565.0,2906.0,63880.0,8629.0,1562.0,6909.0
3,FEMALE,46.714286,basic,en,direct,direct,untracked,iOS,iPhone,Mobile Safari,...,19322.0,39395.0,34357.0,34188.0,13713.0,8742.0,182625.0,30158.0,5525.0,25122.0
4,MALE,41.0,facebook,en,seo,facebook,linked,Web,Android Phone,Silk,...,18105.0,36038.0,32137.0,31839.0,12920.0,8105.0,171966.0,27678.0,5130.0,23270.0


## Feature Scaling
**Feature scaling** is notoriously effective in linear models and neural networks. As it is a rather conventional aspect of the data processing pipeline, we decided to include it here no matter what. An importing thing to note is that, we have to apply scaling (sometimes also referred to as **normalization**) only to *originally numerical* columns at first. Later in the notebook when we encode categorical variables, we will apply another scaling there as well. Let's first observe the different numeric and categorical variables we have.

In [4]:
numeric_variables = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_variables = [x for x in X_train.columns.tolist() if x not in numeric_variables]
print("NUMERIC VARIABLES: ", numeric_variables)
print("CATEGORICAL VARIABLES: ", categorical_variables)

NUMERIC VARIABLES:  ['age', 'signup_flow', 'LAST(sessions.secs_elapsed)', 'NUM_UNIQUE(sessions.action)', 'NUM_UNIQUE(sessions.action_type)', 'NUM_UNIQUE(sessions.action_detail)', 'NUM_UNIQUE(sessions.device_type)', 'SKEW(sessions.secs_elapsed)', 'MIN(sessions.secs_elapsed)', 'MEAN(sessions.secs_elapsed)', 'STD(sessions.secs_elapsed)', 'MAX(sessions.secs_elapsed)', 'MEDIAN(sessions.secs_elapsed)', 'HOUR(date_account_created)', 'HOUR(timestamp_first_active)', 'DAY(date_account_created)', 'DAY(timestamp_first_active)', 'WEEK(date_account_created)', 'WEEK(timestamp_first_active)', 'MONTH(date_account_created)', 'MONTH(timestamp_first_active)', 'YEAR(date_account_created)', 'YEAR(timestamp_first_active)', 'CUM_SUM(age)', 'buckets.CA', 'buckets.DE', 'buckets.FR', 'buckets.GB', 'buckets.AU', 'buckets.NL', 'buckets.US', 'buckets.IT', 'buckets.PT', 'buckets.ES', 'LAST(sessions.CUM_SUM(secs_elapsed))', 'SKEW(sessions.CUM_SUM(secs_elapsed))', 'MIN(sessions.CUM_SUM(secs_elapsed))', 'MEAN(sessions.

### Scaler Choice
You can check the most common feature scaling methods from [here](https://en.wikipedia.org/wiki/Feature_scaling). The most effective algorithms are **Min-Max Scaling**, **Mean Normalization**, **Gaussian (Standard) Scaling**, **Unit-Lenth Scaling**, **Robust Scaling**, **Logarithmic Scaling**, and **Exponential Scaling** depending on the application. Here, we went with a fairly safe method: Gaussian (Standard) scaling based on mean and standard deviation of the variable sample. 

In [5]:
scaler = StandardScaler()
scaler = scaler.fit(X_train[numeric_variables])
X_train[numeric_variables] = scaler.transform(X_train[numeric_variables])
X_test[numeric_variables] = scaler.transform(X_test[numeric_variables])
# Observe scaling effect
X_train.head()

Unnamed: 0,gender,age,signup_method,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,...,buckets.CUM_SUM(CA),buckets.CUM_SUM(DE),buckets.CUM_SUM(FR),buckets.CUM_SUM(GB),buckets.CUM_SUM(AU),buckets.CUM_SUM(NL),buckets.CUM_SUM(US),buckets.CUM_SUM(IT),buckets.CUM_SUM(PT),buckets.CUM_SUM(ES)
0,FEMALE,-0.644729,basic,en,direct,direct,omg,Web,Mac Desktop,Safari,...,-0.632065,-0.622391,-0.619435,-0.62944,-0.637644,-0.620349,-0.623034,-0.664049,-0.687217,-0.710146
1,FEMALE,0.504372,basic,en,direct,direct,untracked,Web,iPad,Mobile Safari,...,0.296425,0.237997,0.273689,0.285184,0.326004,0.25072,0.295706,0.283107,0.334959,0.342713
2,FEMALE,0.683121,basic,en,direct,direct,untracked,Web,Windows Desktop,Firefox,...,0.75874,0.751521,0.770716,0.77674,0.794609,0.764677,0.751388,0.834417,0.846047,0.87274
3,MALE,0.159642,basic,en,direct,direct,untracked,Web,Mac Desktop,Safari,...,0.523916,0.463725,0.521754,0.517265,0.564228,0.496104,0.518075,0.555842,0.59712,0.621303
4,FEMALE,-0.402141,basic,en,direct,direct,untracked,Moweb,Windows Desktop,Chrome,...,-0.632065,-0.622391,-0.619435,-0.62944,-0.637644,-0.620349,-0.623034,-0.664049,-0.687217,-0.710146


## Feature Selection
Not only that feature selection yields smaller training & test sets and decreases training time vastly, it also prevents models from overfitting and allows them to generalize better. Moreover, as different models react to high numbers of feature spaces differently, we decided that feature selection might allow us to compare these models in a more fair way.

### Setting Back Unknown Categorical Levels & Setting 0's for Numeric NaNs
In the previous notebook, we have converted the newly introduced missing values (from automated feature engineering) in all columns to NaNs for uniformity in representation. However, NaNs for numeric variables tell us that the corresponding users didn't have any related session information. For the majority of the numeric features we have generated, plugging in 0s as missing values seems to be logical here. On the other hand, NaNs for categorical variables tell us that corresponding users have untracked information or they used a tool/utility/method that is not recognized. Defining in a new categorical level of 'UNKNOWN' seems to be logical here. We also need these operations to eliminate missing values before applying feature selection algorithms.

In [6]:
X_train[categorical_variables] = X_train[categorical_variables].fillna('UNKNOWN')
X_test[categorical_variables] = X_test[categorical_variables].fillna('UNKNOWN')
X_train[numeric_variables] = X_train[numeric_variables].fillna(0.0)
X_test[numeric_variables] = X_test[numeric_variables].fillna(0.0)

### One-Hot Encoding Categorical Variables
Before we proceed any further, we will have to *temporarily* one-hot encode our categorical variables so that they fit in with our feature selection methods. (NOTE: Particulary, we are referring to the estimation of *chi-squared test statistic* and the utilization of *logistic regression* models.)

In [7]:
# Transform training set to one-hot encoded representation & get encoded columns
X_train_onehot_encoded = categorical_to_onehot_columns(df=X_train)
encoded_columns = X_train_onehot_encoded.columns.values.tolist()
# Transform test set to one-hot encoded representation
X_test_onehot_encoded = categorical_to_onehot_columns(df=X_test)
# Add categorical levels that exist in training set to test set
for fitted_column in encoded_columns:
    if fitted_column not in X_test_onehot_encoded.columns.values.tolist():
        X_test_onehot_encoded[fitted_column] = 0
# Drop categorical levels that don't exist in training set from test set
for column in X_test_onehot_encoded.columns.values.tolist():
    if column not in encoded_columns:
        X_test_onehot_encoded.drop(column, axis=1, inplace=True)
# Ensure that training and test sets have the same column-wise order
X_test_onehot_encoded = X_test_onehot_encoded[encoded_columns]
assert len(X_train_onehot_encoded.columns) == len(X_test_onehot_encoded.columns)

### Feature Selection Algorithms
Common feature selection algorithms include:
* **Pearson Correlation Coefficient**: Perphaps the most common and easiest way to perform feature selection. However, we will have to try different methods as our response variables are categorical rather than numerical. 
* **Chi-Squared Test Statistic**: The chi-square test is a statistical test of independence to determine the dependency of two variables. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is very important. It's a pretty conventional method.
* **Recursive Feature Elimination (RFE)**: Feature ranking with recursive feature elimination.
* **Variable Importances from a Baseline Model**: Applies common algorithms such as Random Forests, Logistic Regression, and XGBoost.

For the problem at hand, we have focused on the last three algorithms.

#### Chi-Squared Test Statistic
The chi-squared test does not apply to negative values,  because it assumes a distribution of frequencies. Hence, we will first check if any negative values exist in the data, and normalize values to [0, 1] if so.

In [8]:
# Normalize values to [0, 1] interval, column-wise
if len(X_train_onehot_encoded[X_train_onehot_encoded < 0]) > 0:
    min_max_scaler = MinMaxScaler()
    normalized_values = min_max_scaler.fit_transform(X_train_onehot_encoded.T)
    X_train_onehot_encoded_normalized = pd.DataFrame(normalized_values.T,
                                                     columns=X_train_onehot_encoded.columns,
                                                     index=X_train_onehot_encoded.index)
else:
    X_train_onehot_encoded_normalized = X_train_onehot_encoded
    
chi_model = SelectKBest(chi2, k=200)
chi_model.fit(X_train_onehot_encoded_normalized, Y_train)
chi_support = chi_model.get_support()
chi_selected_columns = X_train_onehot_encoded.loc[:, chi_support].columns.tolist()

#### Recursive Feature Elimination (RFE) by Logistic Regression (L2)

In [9]:
rfe_model = RFE(estimator=LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=150, 
                                             penalty='l2', random_state=SEED, n_jobs=-1), 
                n_features_to_select=200, step=100, verbose=5)
rfe_model.fit(X_train_onehot_encoded, Y_train)
rfe_support = rfe_model.get_support()
rfe_selected_columns = X_train_onehot_encoded.loc[:, rfe_support].columns.tolist()

Fitting estimator with 1031 features.
Fitting estimator with 931 features.
Fitting estimator with 831 features.
Fitting estimator with 731 features.
Fitting estimator with 631 features.
Fitting estimator with 531 features.
Fitting estimator with 431 features.
Fitting estimator with 331 features.
Fitting estimator with 231 features.


#### Variable Importances by Logistic Regression (L2)

In [10]:
lr_model = SelectFromModel(LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=150,
                                              penalty='l2', random_state=SEED, n_jobs=-1), 
                           threshold='1.25*median')
lr_model.fit(X_train_onehot_encoded, Y_train)
lr_support = lr_model.get_support()
lr_selected_columns = X_train_onehot_encoded.loc[:, lr_support].columns.tolist()

### Combining All Methods
By observing the *union* and *intersection* of the different features we have selected with the given methods above, we have decided that limiting the training & test sets to the *intersection* features seems like best option. This was partly because the *union* features were still large in number. It should be noted that the feature selection process adopted here is not optimized in any way. We could have used different *models*, *parameters*, and *numbers of independent features to select* in each method.

In [11]:
# Get the union of all features
all_selected_columns = list(set(chi_selected_columns + rfe_selected_columns + lr_selected_columns))
print("TOTAL NUMBER OF FEATURES SELECTED: ", len(all_selected_columns))
print("ALL SELECTED FEATURES: ", all_selected_columns)

TOTAL NUMBER OF FEATURES SELECTED:  487
ALL SELECTED FEATURES:  ['buckets.LAST(users.gender)=UNKNOWN', 'buckets.LAST(users.first_device_type)=Other/Unknown', 'MODE(sessions.action)=update', 'buckets.MODE(users.first_browser)=IE', 'buckets.MODE(users.first_affiliate_tracked)=UNKNOWN', 'LAST(sessions.action)=authenticate', 'buckets.MAX(sessions.secs_elapsed)', 'buckets.LAST(users.affiliate_provider)=facebook', 'LAST(sessions.action)=faq_category', 'LAST(sessions.action_detail)=oauth_login', 'buckets.NUM_UNIQUE(users.affiliate_channel)', 'WEEK(timestamp_first_active)', 'LAST(sessions.action)=edit_verification', 'language=en', 'buckets.LAST(sessions.action_type)=submit', 'LAST(sessions.action)=header_userpic', 'age_gender_bucket=65-69female', 'first_device_type=Desktop (Other)', 'MAX(sessions.secs_elapsed)', 'MODE(sessions.action_type)=data', 'buckets.MODE(sessions.device_type)=Windows Desktop', 'MODE(sessions.action)=faq_category', 'age_gender_bucket=30-34other', 'LAST(sessions.action_det

In [12]:
# Get the intersection of all features
common_selected_columns = list(set(chi_selected_columns).intersection(set(rfe_selected_columns), set(lr_selected_columns)))
print("NUMBER OF COMMON FEATURES SELECTED: ", len(common_selected_columns))
print("COMMON SELECTED FEATURES: ", common_selected_columns)

NUMBER OF COMMON FEATURES SELECTED:  83
COMMON SELECTED FEATURES:  ['MODE(sessions.action)=update', 'LAST(sessions.CUM_SUM(secs_elapsed))', 'MODE(sessions.action)=show', 'buckets.LAST(sessions.action_type)=view', 'buckets.LAST(users.affiliate_provider)=facebook', 'affiliate_provider=google', 'first_device_type=Windows Desktop', 'MAX(sessions.secs_elapsed)', 'MODE(sessions.action_type)=data', 'NUM_UNIQUE(sessions.action_type)', 'MODE(sessions.device_type)=iPhone', 'buckets.MODE(sessions.device_type)=Windows Desktop', 'first_device_type=iPhone', 'signup_method=basic', 'first_device_type=Mac Desktop', 'buckets.MIN(users.age)', 'buckets.MODE(users.first_affiliate_tracked)=omg', 'LAST(sessions.action_detail)=view_search_results', 'LAST(sessions.action_type)=data', 'buckets.MEDIAN(users.age)', 'MODE(sessions.device_type)=iPad Tablet', 'LAST(sessions.action_detail)=UNKNOWN', 'buckets.MODE(users.language)=en', 'age_gender_bucket=20-24female', 'first_browser=Firefox', 'affiliate_channel=sem-bra

The columns mentioned above are selected from the one-hot encoded representation of our training & test sets. We can convert them to base features corresponding to the original columns in our data when we read it in the beginning of this notebook. 

In [13]:
all_base_selected_columns = list(set([column.split('=')[0] for column in all_selected_columns]))
common_base_selected_columns = list(set([column.split('=')[0] for column in common_selected_columns]))
print("NUMBER OF BASE COLUMNS SELECTED: ", len(common_base_selected_columns))
print("BASE COLUMNS SELECTED: ", common_base_selected_columns)

NUMBER OF BASE COLUMNS SELECTED:  53
BASE COLUMNS SELECTED:  ['buckets.LAST(sessions.action_type)', 'age_gender_bucket', 'LAST(sessions.CUM_SUM(secs_elapsed))', 'buckets.MEDIAN(users.age)', 'buckets.MODE(sessions.action_detail)', 'signup_app', 'MODE(sessions.action_type)', 'buckets.LAST(sessions.action_detail)', 'buckets.LAST(users.signup_app)', 'buckets.LAST(users.affiliate_provider)', 'NUM_UNIQUE(sessions.action_detail)', 'buckets.MODE(users.signup_method)', 'MONTH(timestamp_first_active)', 'buckets.MODE(users.first_affiliate_tracked)', 'LAST(sessions.action_detail)', 'buckets.LAST(sessions.device_type)', 'first_device_type', 'MODE(sessions.action)', 'buckets.MODE(users.signup_app)', 'MAX(sessions.secs_elapsed)', 'buckets.SKEW(sessions.secs_elapsed)', 'buckets.NUM_UNIQUE(sessions.action_type)', 'NUM_UNIQUE(sessions.action_type)', 'signup_method', 'first_affiliate_tracked', 'affiliate_provider', 'buckets.MAX(users.age)', 'buckets.LAST(users.first_affiliate_tracked)', 'first_browser', 

Now, let's create the trimmed versions of our data frames.

In [14]:
X_train_trimmed = X_train[common_base_selected_columns]
X_test_trimmed = X_test[common_base_selected_columns]
assert X_train_trimmed.shape[1] == X_test_trimmed.shape[1]
X_train_trimmed.head()

Unnamed: 0,buckets.LAST(sessions.action_type),age_gender_bucket,LAST(sessions.CUM_SUM(secs_elapsed)),buckets.MEDIAN(users.age),buckets.MODE(sessions.action_detail),signup_app,MODE(sessions.action_type),buckets.LAST(sessions.action_detail),buckets.LAST(users.signup_app),buckets.LAST(users.affiliate_provider),...,LAST(sessions.action),age,YEAR(date_account_created),NUM_UNIQUE(sessions.action),buckets.LAST(users.first_browser),LAST(sessions.action_type),buckets.MODE(users.language),buckets.MIN(users.age),SKEW(sessions.secs_elapsed),buckets.LAST(users.affiliate_channel)
0,view,30-34female,0.577406,-0.55661,view_search_results,Web,view,p3,Web,google,...,personalize,-0.644729,1.039225,0.057369,Chrome,data,en,-0.532341,0.501181,seo
1,view,40-44female,0.0,0.36276,view_search_results,Web,UNKNOWN,user_profile,Web,direct,...,UNKNOWN,0.504372,-0.027888,0.0,Chrome,UNKNOWN,en,0.366522,0.0,direct
2,view,45-49female,0.0,0.699431,view_search_results,Web,UNKNOWN,message_thread,Web,direct,...,UNKNOWN,0.683121,-1.095002,0.0,Mobile Safari,UNKNOWN,en,0.815954,0.0,direct
3,UNKNOWN,40-44male,0.0,0.349811,view_search_results,Web,UNKNOWN,UNKNOWN,Web,google,...,UNKNOWN,0.159642,-0.027888,0.0,Firefox,UNKNOWN,en,0.366522,0.0,sem-brand
4,view,30-34female,0.0,-0.55661,view_search_results,Moweb,UNKNOWN,p3,Web,google,...,UNKNOWN,-0.402141,-0.027888,0.0,Chrome,UNKNOWN,en,-0.532341,0.0,seo


## Save Progress with Raw & Trimmed Data

In [15]:
# Add back the response variable to sets & save progress
X_train_trimmed.loc[:, 'country_destination'] = Y_train.loc[:]
X_train_trimmed.to_csv('(3)data_trimmed/raw/train_users.csv', index=None)
X_test_trimmed.to_csv('(3)data_trimmed/raw/test_users.csv', index=None)
# Drop response variable again, as we will continue processing features
X_train_trimmed.drop('country_destination', axis=1, inplace=True)

## Encoding Categorical Variables
Although we had chosen **one-hot encoding** as our strategy previously, those encodings were solely used for the purposes of feature selection and abondoned afterwards. Besides, there exists other and more efficient strategies in the literature that can be explored:

* **Labeled Encoding**: Interprets categories as ordered integers. The extracted ordinality is almost always wrong, hence this is not really a preferred method.
* **Frequency Encoding**: Encodes categorical levels of each feature to values between 0.0 and 1.0 based on their relative frequincy. This method especially works when there is a high number of categorical levels that are somewhat imbalanced in distribution.
* **Target Mean Encoding**: Encodes categorical levels of each feature to the mean of the response. This method works best with binary classification, but it often yields *data leakage*.

It should be noted that one-hot encoding variables increases the feature space vastly, and this may decrease the potential performance of the *tree-based models*. This is why we don't want to use this type of encoding to get our final data form. Instead, we have chosen to go with labeled encoding as an experiment. As previously discussed, `h2o`'s automated modelling method applies different kind of encodings, but labeled encoding was one strategy that wasn't automatically applied.

In [16]:
remaining_categorical_vars = []
for column in X_train_trimmed.columns.values.tolist():
    if not is_numeric_dtype(X_train_trimmed[column]):
        print("Currently encoding column: ", column)
        remaining_categorical_vars.append(column)
        
        encoder = LabelEncoder()
        encoder.fit(X_train_trimmed[column])
        available_levels = list(encoder.classes_)
        for test_level in set(X_test_trimmed[column].values.tolist()):
            if test_level not in available_levels:
                X_test_trimmed.loc[X_test_trimmed[column] == test_level, column] = X_train_trimmed[column].mode()[0]
        
        X_train_trimmed.loc[:, column] = encoder.transform(X_train_trimmed[column])
        X_test_trimmed.loc[:, column] = encoder.transform(X_test_trimmed[column])

Currently encoding column:  buckets.LAST(sessions.action_type)
Currently encoding column:  age_gender_bucket
Currently encoding column:  buckets.MODE(sessions.action_detail)
Currently encoding column:  signup_app
Currently encoding column:  MODE(sessions.action_type)
Currently encoding column:  buckets.LAST(sessions.action_detail)
Currently encoding column:  buckets.LAST(users.signup_app)
Currently encoding column:  buckets.LAST(users.affiliate_provider)
Currently encoding column:  buckets.MODE(users.signup_method)
Currently encoding column:  buckets.MODE(users.first_affiliate_tracked)
Currently encoding column:  LAST(sessions.action_detail)
Currently encoding column:  buckets.LAST(sessions.device_type)
Currently encoding column:  first_device_type
Currently encoding column:  MODE(sessions.action)
Currently encoding column:  buckets.MODE(users.signup_app)
Currently encoding column:  signup_method
Currently encoding column:  first_affiliate_tracked
Currently encoding column:  affiliate_

In [17]:
X_train_trimmed.head()

Unnamed: 0,buckets.LAST(sessions.action_type),age_gender_bucket,LAST(sessions.CUM_SUM(secs_elapsed)),buckets.MEDIAN(users.age),buckets.MODE(sessions.action_detail),signup_app,MODE(sessions.action_type),buckets.LAST(sessions.action_detail),buckets.LAST(users.signup_app),buckets.LAST(users.affiliate_provider),...,LAST(sessions.action),age,YEAR(date_account_created),NUM_UNIQUE(sessions.action),buckets.LAST(users.first_browser),LAST(sessions.action_type),buckets.MODE(users.language),buckets.MIN(users.age),SKEW(sessions.secs_elapsed),buckets.LAST(users.affiliate_channel)
0,6,11,0.577406,-0.55661,3,2,7,7,2,4,...,143,-0.644729,1.039225,0.057369,1,3,1,-0.532341,0.501181,6
1,6,17,0.0,0.36276,3,2,0,14,2,2,...,4,0.504372,-0.027888,0.0,1,0,1,0.366522,0.0,2
2,6,20,0.0,0.699431,3,2,0,6,2,2,...,4,0.683121,-1.095002,0.0,4,0,1,0.815954,0.0,2
3,0,18,0.0,0.349811,3,2,0,0,2,4,...,4,0.159642,-0.027888,0.0,2,0,1,0.366522,0.0,4
4,6,11,0.0,-0.55661,3,1,0,7,2,4,...,4,-0.402141,-0.027888,0.0,1,0,1,-0.532341,0.0,6


## Feature Scaling for Remaining Categorical Variables

After this step, both our training & test sets are fully scaled and ready for being passed on to predictive models.

In [18]:
scaler = StandardScaler()
scaler = scaler.fit(X_train_trimmed[remaining_categorical_vars])
X_train_trimmed.loc[:, remaining_categorical_vars] = scaler.transform(X_train_trimmed.loc[:, remaining_categorical_vars])
X_test_trimmed.loc[:, remaining_categorical_vars] = scaler.transform(X_test_trimmed.loc[:, remaining_categorical_vars])

X_train_trimmed.head()

Unnamed: 0,buckets.LAST(sessions.action_type),age_gender_bucket,LAST(sessions.CUM_SUM(secs_elapsed)),buckets.MEDIAN(users.age),buckets.MODE(sessions.action_detail),signup_app,MODE(sessions.action_type),buckets.LAST(sessions.action_detail),buckets.LAST(users.signup_app),buckets.LAST(users.affiliate_provider),...,LAST(sessions.action),age,YEAR(date_account_created),NUM_UNIQUE(sessions.action),buckets.LAST(users.first_browser),LAST(sessions.action_type),buckets.MODE(users.language),buckets.MIN(users.age),SKEW(sessions.secs_elapsed),buckets.LAST(users.affiliate_channel)
0,0.673844,-0.611161,0.577406,-0.55661,0.066229,-0.018717,2.199263,-0.569216,-0.380246,1.331396,...,1.404108,-0.644729,1.039225,0.057369,-0.803753,0.743575,0.03625,-0.532341,0.501181,2.129106
1,0.673844,0.406107,0.0,0.36276,0.066229,-0.018717,-0.575552,0.773677,-0.380246,-0.697524,...,-0.599231,0.504372,-0.027888,0.0,-0.803753,-0.540773,0.03625,0.366522,0.0,-0.646366
2,0.673844,0.914741,0.0,0.699431,0.066229,-0.018717,-0.575552,-0.761058,-0.380246,-0.697524,...,-0.599231,0.683121,-1.095002,0.0,1.100618,-0.540773,0.03625,0.815954,0.0,-0.646366
3,-2.192308,0.575652,0.0,0.349811,0.066229,-0.018717,-0.575552,-1.912109,-0.380246,1.331396,...,-0.599231,0.159642,-0.027888,0.0,-0.168962,-0.540773,0.03625,0.366522,0.0,0.74137
4,0.673844,-0.611161,0.0,-0.55661,0.066229,-2.145454,-0.575552,-0.569216,-0.380246,1.331396,...,-0.599231,-0.402141,-0.027888,0.0,-0.803753,-0.540773,0.03625,-0.532341,0.0,2.129106


## Save Progress with Label Encoded & Trimmed Data

In [19]:
# Add back the response variable to sets & save progress
X_train_trimmed.loc[:, 'country_destination'] = Y_train.loc[:]
X_train_trimmed.to_csv('(3)data_trimmed/label_encoded/train_users.csv', index=None)
X_test_trimmed.to_csv('(3)data_trimmed/label_encoded/test_users.csv', index=None)