In [1]:
# This sample only introduces two commonly used encoders, 
# there are many for categorical variables.
# We may cover more when we talk about features engineering.
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

# Sizes
n = 1000

# Simulate the non-ordinal categorical variable X1 with 3 types
X1 = np.random.choice(['Type1', 'Type2', 'Type3'], size=n)

# Simulate the ordinal categorical variable X2 with 3 types
X2 = np.random.choice(['Low', 'Medium', 'High'], size=n, p=[0.3, 0.4, 0.3])

# Simulate the target variable Y based on X1 and X2
# For simplicity, using a simple interaction model here. Adjust as needed.
Y = np.array([np.random.normal(0, 1) + (1 if x1 == 'Type2' else 0) + 
              (0.5 if x2 == 'Medium' else 1 if x2 == 'High' else -0.5) for x1, x2 in zip(X1, X2)])

# Create a DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'Y': Y})


In [2]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Encoder for X1 (non-ordinal)
encoder_X1 = OneHotEncoder(sparse=False)
X1_encoded = encoder_X1.fit_transform(df[['X1']])

# Encoder for X2 (ordinal), see the bottom for the difference of two encoders.
encoder_X2 = LabelEncoder()
df['X2_encoded'] = encoder_X2.fit_transform(df['X2'])

# Combine X1_encoded and X2_encoded
X = np.hstack((X1_encoded, df[['X2_encoded']]))

# Target variable
Y = df['Y'].values




In [3]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [3, 5, 7],
}

# Initialize the model
model = LGBMRegressor()

# Grid search
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, Y_train)

# Make predictions and evaluate the model
Y_pred = grid_search.predict(X_test)
rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))

print(f'Best parameters: {grid_search.best_params_}')
print(f'RMSE: {rmse}')


Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
RMSE: 1.0162537818095863


# Categorical Encoders in Python

Categorical data are common in data science and can represent types of categories, like countries, colors, brands, etc. Handling categorical data is not always straightforward because most machine learning algorithms require numerical input. Therefore, encoding categorical variables is a crucial step. Below are several methods for encoding categorical variables in Python:

## 1. Label Encoding

Label Encoding is a simple and straightforward method. It involves converting each value in a column to a number.

- **Use Case**: It is useful with ordinal data (e.g., low, medium, high).
- **Scikit-learn Implementation**: `LabelEncoder`
- **Limitations**: Implies ordinality when there is none, and can lead to misinterpretation by algorithms.

```python
from sklearn.preprocessing import LabelEncoder

# Example
label_encoder = LabelEncoder()
data['Category'] = label_encoder.fit_transform(data['Category'])


# Differences between LabelEncoder and OrdinalEncoder

When dealing with categorical data in machine learning, it's crucial to encode categorical variables properly. `LabelEncoder` and `OrdinalEncoder` are two common encoders provided by the scikit-learn library in Python. Though they might seem similar at first glance, they serve different purposes and are used in different scenarios.

## LabelEncoder

`LabelEncoder` is a utility class to help normalize labels such that they contain only values between 0 and `n_classes-1`. It's used to transform non-numerical labels to numerical labels (or nominal categorical variables).

### Key Characteristics of LabelEncoder:
- It encodes labels with a value between 0 and `n_classes-1`.
- It's used for encoding target values (y), i.e., the response variable, not the input (X) variables.
- It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

### Usage:
Typically, `LabelEncoder` is used in scenarios where the categorical labels are used for encoding target values (y). For instance, in classification problems where you have labels like "spam" and "not spam".

```python
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['Target'] = label_encoder.fit_transform(data['Target'])


## OrdinalEncoder

`OrdinalEncoder` is used to convert categorical features to ordinal, and it is specifically designed for feature encoding (X) rather than target variable encoding (y).

### Key Characteristics of OrdinalEncoder:
- **Encodes categorical features as an integer array**: The encoder converts the data into a format that is interpretable by the machine learning model, assigning each unique category in the feature column an integer value.
- **Meant for feature encoding (X)**: Unlike `LabelEncoder` which is often used for the target variable (y), `OrdinalEncoder` is designed to preprocess the input features (X).
- **Handles 2D data**: The input to this transformer should be a 2D array-like structure representing multiple features.

### Usage:
`OrdinalEncoder` is particularly useful for encoding categorical features where an ordinal relationship exists and the order of the categories is important. This is commonly seen in ordinal categorical variables such as ratings (e.g., "bad", "average", "good").

```python
from sklearn.preprocessing import OrdinalEncoder

# Suppose 'Feature' is an ordinal categorical column in the DataFrame 'data'
ordinal_encoder = OrdinalEncoder()
data_encoded = ordinal_encoder.fit_transform(data[['Feature']])


## Comparison between LabelEncoder and OrdinalEncoder

| Aspect                  | LabelEncoder                               | OrdinalEncoder                          |
|-------------------------|--------------------------------------------|-----------------------------------------|
| Use Case                | Encoding target variables (y)              | Encoding input features (X)             |
| Data Type               | 1D array                                   | 2D array                                |
| Encoded Output          | 1D array                                   | 2D array                                |
| Suitable for            | Nominal Categorical Variables              | Ordinal Categorical Variables           |
| Relationship Imposed    | Ordinal (though often used for Nominal)    | Ordinal                                 |

In summary, while both `LabelEncoder` and `OrdinalEncoder` can be used to transform categorical variables into numbers, they are intended for different types of data and serve different purposes in a machine learning workflow. `LabelEncoder` is primarily for the target variable, and `OrdinalEncoder` is for input features, especially when those features are ordinal.
