In [3]:
import pandas as pd
import numpy as np

## Trick 1: Numeric feature discretization (bucketing)

What is feature discretization? - Transform numeric features into categorical features.

Advantages: 
* Faster computation speed: optimization speed and  spped
* Prevent overfitting: more 
* Add robustness: easy to handle outliers

Methods:
* Quantile bucketing 
* Bucketing based on domain knowledge

### By quantile

In [2]:
df = pd.DataFrame({
   'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30, 50, 78]
})
df['age_group'] = pd.qcut(df['age'], 3)

In [3]:
df

Unnamed: 0,age,age_group
0,2,"(1.999, 28.667]"
1,67,"(55.667, 99.0]"
2,40,"(28.667, 55.667]"
3,32,"(28.667, 55.667]"
4,4,"(1.999, 28.667]"
5,15,"(1.999, 28.667]"
6,82,"(55.667, 99.0]"
7,99,"(55.667, 99.0]"
8,26,"(1.999, 28.667]"
9,30,"(28.667, 55.667]"


### Customized

In [6]:
bins = np.array([0,3,12,18,45,60,100])

In [7]:
df['age_group'] = pd.cut(df['age'], bins)
df

Unnamed: 0,age,age_group
0,2,"(0, 3]"
1,67,"(60, 100]"
2,40,"(18, 45]"
3,32,"(18, 45]"
4,4,"(3, 12]"
5,15,"(12, 18]"
6,82,"(60, 100]"
7,99,"(60, 100]"
8,26,"(18, 45]"
9,30,"(18, 45]"


## Trick 2: Log Transformation 

What is feature discretization? - convert a skewed distribution to a normal distribution

Advantages: 
* Better prediction power dealing with skewed data (e.g. revenue)
* Robustness: easier dealing with outliers

In [11]:
df = pd.DataFrame({
   'income': [10, 5, 10000, 300, 500, 0.5, 5000, 80000]
})
df['income_transformed'] = np.log(df['income'])
df

Unnamed: 0,income,income_transformed
0,10.0,2.302585
1,5.0,1.609438
2,10000.0,9.21034
3,300.0,5.703782
4,500.0,6.214608
5,0.5,-0.693147
6,5000.0,8.517193
7,80000.0,11.289782


## Trick 3: Pseudo labeling

What is feature discretization? - Semi-supervised learning, generate labels for unlabeled data then train the model, frequently used in computer vision tasks.

Advantages: 
* When you have a small amount of labeld data and large amount of unlabeled data
* PAutomatically label data with minimal cost 

Methods:
* Quantile bucketing 
* Bucketing based on domain knowledge

## Trick 4: Regularization

Goal: prevent overfitting and improve the generalization performance. L1 regularization can shrink the weights of less important features towards zero, effectively performing feature selection 
* L1: Lasso, Manhatan norm
* L2: Ridge, Euclidean norm
* Combination of L1 and L2: Elastic Net, weighted combination of L1 and L2
* Dropout in neural network
* Early stopping based on training and validation performance

## Trick 5: Target Encoding

Also called mean encoding, we replace each category of a variable, by the mean value of the target for the observations that show a certain category. 

Key idea: Monotonic relationships between variable and target tend to improve linear model performance.

cons: potential data leakage, overfitting

In [4]:
data = {
  "gender": ['male', 'male', 'female','female'],
  "value": [50, 40, 45, 60]
}
data = pd.DataFrame(data)
data["gender_encoded"] = data.groupby("gender")["value"].transform("mean")
data

Unnamed: 0,gender,value,gender_encoded
0,male,50,45.0
1,male,40,45.0
2,female,45,52.5
3,female,60,52.5


category_encoders's TargetEncoder use different method to encode categorical variables

In [8]:
# from sklearn.preprocessing import TargetEncoder (archived)
from category_encoders import *
data = {
  "gender": ['male', 'male', 'female','female'],
  "value": [50, 40, 45, 60]
}
data = pd.DataFrame(data)

encoder = TargetEncoder()
data["gender_encoded"] = encoder.fit_transform(data["gender"], data["value"])
data

Unnamed: 0,gender,value,gender_encoded
0,male,50,48.218059
1,male,40,48.218059
2,female,45,49.281941
3,female,60,49.281941


## XGBoost Tricks

https://www.kaggle.com/general/197466