# Example of how one-hot encoding works

Author: Umberto Michelucci

In [8]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

We will use the Iris dataset for this small test. Here is a brief description.

## Iris Dataset Overview

The Iris dataset is a classic and widely used dataset in the field of machine learning and statistics. It was introduced by the British statistician and biologist Ronald Fisher in 1936.

### Key Features

- **Data Points**: The dataset contains 150 data points, divided equally among three species of Iris flowers: Iris setosa, Iris versicolor, and Iris virginica.
- **Features**: There are four features measured in centimeters for each sample:
  - Sepal Length
  - Sepal Width
  - Petal Length
  - Petal Width
- **Target Variable**: The species of the Iris plant. The dataset includes three classes:
  1. Iris setosa
  2. Iris versicolor
  3. Iris virginica

### Usage

The Iris dataset is often used in data science and machine learning, particularly for:
- Demonstrating classification algorithms.
- Teaching data visualization and exploratory data analysis.
- Testing data preprocessing techniques.



In [10]:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

In [11]:
y[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Did you notice that the labels are not shuffled but sorted? Do you think this may be a problem?

## One-hot Encoding the Labels

### One-hot encoding Process

One-hot encoding is a process used to convert categorical data into a numerical format that can be understood by machine learning algorithms. It is particularly useful for handling nominal data, where there is no inherent order in the categories. Here's a brief description of the process:

### Steps in One-Hot Encoding
1. **Identify Categorical Data**:
   - The first step is to identify which features in your dataset are categorical. These could be in string format or represented as discrete values.

2. **List Unique Categories**:
   - For each categorical feature, list all the unique categories it contains.

3. **Create Binary Features**:
   - For each unique category, create a new binary feature (column) in your dataset.
   - The number of binary features created for each categorical feature will be equal to the number of unique categories in that feature.

4. **Assign Binary Values**:
   - For each record, assign a value of 1 in the binary feature that corresponds to the category present in the original feature and 0 in all other binary features created for that category.
   - Essentially, only one of these binary features will have a 1 for each record, and the rest will have 0s.

### Example
Consider a categorical feature "Color" with three categories: Red, Blue, and Green. One-hot encoding will create three new features: "Color_Red", "Color_Blue", and "Color_Green". For a record with the color Red, the values for these features will be:

- Color_Red: 1
- Color_Blue: 0
- Color_Green: 0

### Why Use One-Hot Encoding?
- **Machine Learning Compatibility**: Many machine learning models, especially those based on mathematical algorithms, require numerical input. One-hot encoding converts categorical data into a numerical format these models can work with.
- **Removing Ordinality**: It helps to remove any ordinal relationship that might be wrongly interpreted by the algorithm (important for nominal categories).

### Considerations
- **Dimensionality Increase**: One-hot encoding can significantly increase the number of features in your dataset, especially if the categorical variables have many unique categories (known as the "curse of dimensionality").
- **Dummy Variable Trap**: It can lead to multicollinearity due to the addition of multiple binary features. To avoid this, sometimes one category is dropped (reducing the number of binary features by one).


### ```OneHotEncoder()``` Function

`OneHotEncoder()` is a function from scikit-learn, a popular Python library for machine learning. It is used for preprocessing categorical data. This function encodes categorical variables as a one-hot numeric array, which is a common requirement for many machine learning algorithms to work effectively. Here's a detailed description:

### Overview of `OneHotEncoder()`
1. **Purpose**:
   - The primary purpose of `OneHotEncoder()` is to convert categorical data into a format that can be more easily processed by machine learning algorithms. It creates a binary column for each category and returns a sparse matrix or dense array.

2. **Functionality**:
   - For each unique category in a feature, `OneHotEncoder()` creates a new binary feature that indicates whether the category is present in each row.

### Key Features and Parameters
1. **Parameters**:
   - `categories`: Specifies the categories to be encoded. By default, categories are automatically determined from the training data.
   - `drop`: Whether to drop one of the categories to avoid creating collinear features.
   - `sparse`: If set to `True`, the encoder returns a sparse matrix; if `False`, a dense array is returned.
   - `handle_unknown`: Specifies how to handle categories not seen during training (can be set to 'error' or 'ignore').

2. **Methods**:
   - `fit(X[, y])`: Fit the OneHotEncoder to X.
   - `transform(X)`: Transform X using the one-hot encoding.
   - `fit_transform(X[, y])`: Fit to data, then transform it.

3. **Output**:
   - After transformation, each unique category value of the feature is converted into a binary column, with `1` indicating the presence of the category and `0` indicating the absence.

### Example Usage
```python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample data
data = np.array([['red'], ['green'], ['blue']])

# Create the encoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform data
encoded_data = encoder.fit_transform(data)

print(encoded_data)
```

### Considerations
- **Dummy Variable Trap**: One-hot encoding creates multiple binary columns for each category, which can lead to multicollinearity. Sometimes, one of the binary columns is dropped to avoid this issue.
- **Handling Unknown Categories**: The parameter `handle_unknown` can be used to handle categories that were not seen during training.
- **Memory Usage**: For a large number of categories, one-hot encoding can significantly increase the dataset's size. The `sparse=True` option can be used to mitigate memory usage issues.


In [12]:
# The y array currently contains integers representing the classes.
# We will now convert these class labels into one-hot encoded format.
# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)  # sparse=False to get a dense matrix

# Reshape y to 2D array as fit_transform expects 2D array
y = y.reshape(-1, 1)

# Apply the encoder on the reshaped y
y_one_hot = encoder.fit_transform(y)

# y_one_hot now contains the one-hot encoded labels
print(y_one_hot[0:10])


[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


