# Module 1: Data Analysis and Data Preprocessing

## Section 3: Encoding categorical variables

### Part 1: Label Encoding

Label Encoding is a data preprocessing technique used to convert categorical variables into numerical labels. Label Encoding is particularly useful when working with algorithms that require numeric inputs.

### 1.1 Understanding Label Encoding

Label Encoding assigns a unique integer value to each category in a categorical variable, allowing algorithms to work with the numerical representations. Label Encoding is suitable for categorical variables with an inherent ordinal relationship, such as "low," "medium," and "high."

The key idea behind Label Encoding is to transform categorical variables into a numerical format that algorithms can process. It does not introduce any additional columns or dimensions like One-Hot Encoding. Instead, it replaces each category with a unique integer label.

Label encoding is appropriate for ordinal data (data with an inherent order) or nominal data (data without any inherent order).

For nominal data, label encoding might introduce ordinality, leading to incorrect interpretations by some algorithms. In such cases, one-hot encoding (dummy variables) is a preferred method.

### 1.2 Using Label Encoding

To apply Label Encoding, we need a dataset with categorical variables. The encoding process involves mapping each category in a categorical variable to a unique integer label. The mapping is typically based on the alphabetical order of the categories or the order of appearance in the dataset.

Here's an example of how to use it:

In [None]:
from sklearn.preprocessing import LabelEncoder

# Sample categorical data
data = ['red', 'blue', 'green', 'green', 'red', 'blue']

# Create the LabelEncoder object
encoder = LabelEncoder()
# Fit and transform the data using LabelEncoder
encoded_data = encoder.fit_transform(data)

print("Original Data:", data)
print("Encoded Data:", encoded_data)

In this example, we have a list of colors ('red', 'blue', and 'green'). After applying Label Encoding, the colors are encoded into integers: 'red' becomes 2, 'blue' becomes 0, and 'green' becomes 1.

### 1.3 Inverse Transformation

In some cases, you may need to convert the encoded labels back to their original categorical form. Scikit-Learn's LabelEncoder provides the inverse_transform method for reversing the encoding process and obtaining the original categories from the encoded labels.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Sample categorical data
data = ['red', 'blue', 'green', 'green', 'red', 'blue']
print("Original Data:", data)

# Create the LabelEncoder object
encoder = LabelEncoder()
# Fit and transform the data using LabelEncoder
encoded_data = encoder.fit_transform(data)
# Print the encoded data
print("Encoded Data:", encoded_data)

# Inverse transform the encoded data back to original categorical values
decoded_data = encoder.inverse_transform(encoded_data)
# Print the decoded data
print("Decoded Data:", decoded_data)

In this example, we have a list of colors ('red', 'blue', and 'green'). After applying Label Encoding, the colors are encoded into integers: 'red' becomes 2, 'blue' becomes 0, and 'green' becomes 1.
The inverse_transform method of LabelEncoder can be used to transform the encoded data back to the original categorical values. As we can see in the output, the decoded data is the same as the original data before encoding.

### 1.4 Conclusion

Label Encoding is a data preprocessing technique used to convert categorical variables into numerical labels. It assigns a unique integer value to each category, allowing algorithms to work with the numerical representations. Scikit-Learn provides the LabelEncoder class for performing Label Encoding easily. Understanding the concepts, training, and handling of categorical variables is crucial for effectively using Label Encoding in practice.