### Feature Engineering

* Feature Engineering other wise called it as Data preprocessing 
* In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:
* Types of Encodings:
    * Count Encoding
    * Target Encoding
    * Leave-one-out Encoding
    * CatBoost Encoding
    * Feature embedding with SVD

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

In [2]:
clicks = pd.read_parquet("Datasets/Feature Enginering/baseline_data.pqt")

In [3]:
clicks

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1
...,...,...,...,...,...,...,...,...,...,...,...,...
2300556,9791,2,1,19,166,2017-11-09 15:59:59,,0,9,15,59,59
2300557,6240,14,1,13,146,2017-11-09 15:59:59,,0,9,15,59,59
2300558,15098,12,2,17,50,2017-11-09 16:00:00,,0,9,16,0,0
2300559,10538,12,1,15,41,2017-11-09 16:00:00,,0,9,16,0,0


In [None]:
clicks.head()

In [None]:
clicks.tail()

In [None]:
clicks.describe()

In [None]:
import pandas_profiling as pp
pp.ProfileReport(clicks)

In [4]:
def get_data_splits(dataframe,valid_fraction=0.1):
    
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """
    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe)*valid_fraction)
    train = dataframe[:-valid_rows*2]
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train,valid,test

In [5]:
def train_model(train, valid, test=None,feature_cols= None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time','attributed_time','is_attributed'])
        
        dtrian = lgb.Dataset(train[feature_cols],label= train['is_attributed'])
        dvalid = lgb.Dataset(valid[feature_cols],label= valid['is_attributed'])
        
        param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
        
        num_round = 1000
        print("Training model!")
        
        bst = lgb.train(param,
                        dtrian,
                        num_round,
                        valid_sets=[dvalid],
                        early_stopping_rounds=20,
                        verbose_eval=False)
        valid_pred = bst.predict(valid[feature_cols])
        # Score of model
        valid_score = metrics.roc_auc_score(valid['is_attributed'],valid_pred)
        print(f"Validation AUC score: {valid_score}")
        
        if test is not None:
            test_pred = bst.predict(test[feature_cols])
            test_score = metrics.roc_auc_score(test['is_attributed'],test_pred)
            return bst,valid_score,test_score
        else:
            return bst, valid_score
        

In [6]:
print("Baseline model")
train,valid,test =get_data_splits(clicks)
_ = train_model(train,valid)

Baseline model
Training model!
Validation AUC score: 0.9622743228943659


In [7]:
import category_encoders as ce

In [8]:
clicks.columns

Index(['ip', 'app', 'device', 'os', 'channel', 'click_time', 'attributed_time',
       'is_attributed', 'day', 'hour', 'minute', 'second'],
      dtype='object')

In [9]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)
# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

In [10]:
train_encoded.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_count,app_count,device_count,os_count,channel_count
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,68,292254,1648091,370652,26760
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,4,60114,1648091,370652,41256
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,118,19564,1648091,370652,31221
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,29,292254,1648091,370652,26760
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,31,292254,1648091,370652,26760


In [11]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9653051135205329


**Count encoding improved our model's score!**

### Target encoding

* Supervised encodings that use the labels (the targets) to transform categorical features. 
* The first one is target encoding. Create the target encoder from the category_encoders library. 
* Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [14]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

target_enc = ce.TargetEncoder(cols = cat_features)
target_enc.fit(train[cat_features],train["is_attributed"])

TargetEncoder(cols=['ip', 'app', 'device', 'os', 'channel'])

In [15]:
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

In [16]:
_ = train_model(train_encoded,valid_encoded)

Training model!
Validation AUC score: 0.9540530347873288


### Leave-one-out Encoding 
**We Try to Remove ip Feature From the Data**


In [17]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

target_enc = ce.TargetEncoder(cols = cat_features)
target_enc.fit(train[cat_features],train["is_attributed"])

train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

In [18]:
_ = train_model(train_encoded,valid_encoded)

Training model!
Validation AUC score: 0.9627457957514338


In [27]:
clicks.shape[0]

2300561

In [26]:
clicks['ip'].nunique()/clicks.shape[0]

0.1131193652330888

Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the "true" mean, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren't in the training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying different encodings.

### CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with CatBoostEncoder and train the model on the encoded data again.


In [28]:
train, valid, test = get_data_splits(clicks)

cb_enc = ce.CatBoostEncoder(cols = cat_features,random_state=7)
cb_enc.fit(train[cat_features],train["is_attributed"])

train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

In [29]:
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.9622743228943659


* To Fine Best Featurres in  CatBoostEncoder


In [33]:
encode = cb_enc.transform(clicks[cat_features])
for i in encode:
  clicks.insert(len(clicks.columns),i+'_cb',encode[i])

In [34]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,app_cb,device_cb,os_cb,channel_cb
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,0.028329,0.152087,0.138712,0.034049
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,0.995828,0.152087,0.138712,0.950244
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,0.009261,0.152087,0.138712,0.019384
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,0.028329,0.152087,0.138712,0.034049
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,0.028329,0.152087,0.138712,0.034049


### Feature embedding with SVD