# Example of how one-hot encoding works

Author: Umberto Michelucci

In [8]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

We will use the Iris dataset for this small test. Here is a brief description.

---
## Iris Dataset Overview

The Iris dataset is a classic and widely used dataset in the field of machine learning and statistics. It was introduced by the British statistician and biologist Ronald Fisher in 1936.

### Key Features

- **Data Points**: The dataset contains 150 data points, divided equally among three species of Iris flowers: Iris setosa, Iris versicolor, and Iris virginica.
- **Features**: There are four features measured in centimeters for each sample:
  - Sepal Length
  - Sepal Width
  - Petal Length
  - Petal Width
- **Target Variable**: The species of the Iris plant. The dataset includes three classes:
  1. Iris setosa
  2. Iris versicolor
  3. Iris virginica

### Usage

The Iris dataset is often used in data science and machine learning, particularly for:
- Demonstrating classification algorithms.
- Teaching data visualization and exploratory data analysis.
- Testing data preprocessing techniques.

### Significance

Its simplicity and small size make the Iris dataset ideal for beginners to learn the fundamentals of machine learning and pattern recognition.

---

In [10]:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

In [11]:
y[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Did you notice that the labels are not shuffled but sorted? Do you think this may be a problem?

In [12]:
# The y array currently contains integers representing the classes.
# We will now convert these class labels into one-hot encoded format.
# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)  # sparse=False to get a dense matrix

# Reshape y to 2D array as fit_transform expects 2D array
y = y.reshape(-1, 1)

# Apply the encoder on the reshaped y
y_one_hot = encoder.fit_transform(y)

# y_one_hot now contains the one-hot encoded labels
print(y_one_hot[0:10])


[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


