# Module 1: Introduction to Scikit-Learn

## Section 2: Exploratory Data Analysis (EDA) and Data Preprocessing

### Part 3: Dealing with Categorical Variables

In this section, we will explore various techniques for handling categorical variables in our dataset. Categorical variables are non-numeric variables that represent different categories or groups. They require special treatment during the data preprocessing phase. Let's dive in!

### 3.1 Understanding Categorical Variables

Categorical variables can be divided into two types: nominal and ordinal.

Nominal variables do not have an inherent order or ranking. Examples include colors, countries, or types of objects.
Ordinal variables have a natural ordering or ranking. Examples include education levels (e.g., high school, bachelor's, master's) or survey responses (e.g., strongly agree, agree, neutral, disagree, strongly disagree).

### 3.2 One-Hot Encoding

One-Hot Encoding is a technique used to transform categorical variables into numerical format that can be easily understood by machine learning algorithms. Each category is converted into a binary column, where a value of 1 indicates the presence of that category and 0 indicates its absence.

Scikit-Learn provides the OneHotEncoder class to perform one-hot encoding. Here's an example of how to use it:

```python
from sklearn.preprocessing import OneHotEncoder

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the categorical feature
encoded_features = encoder.fit_transform(categorical_feature)
```

### 3.3 Label Encoding

Label Encoding is another technique used to convert categorical variables into numerical format. It assigns a unique numeric label to each category in the variable. However, label encoding introduces an inherent ordering, which might be misleading for some machine learning algorithms.

Scikit-Learn provides the LabelEncoder class to perform label encoding. Here's an example:

```python
from sklearn.preprocessing import LabelEncoder

# Create an instance of the LabelEncoder
encoder = LabelEncoder()

# Fit and transform the categorical feature
encoded_feature = encoder.fit_transform(categorical_feature)
```

### 3.4 Ordinal Encoding
Ordinal Encoding is used when dealing with ordinal categorical variables. It assigns integer values to each category based on their order or ranking. This technique preserves the ordinal relationship between the categories.

Scikit-Learn does not provide a specific class for ordinal encoding, but you can use the LabelEncoder and create a mapping dictionary that maps the ordered categories to their corresponding integer values.

### 3.5 Handling High Cardinality Categorical Variables
High cardinality categorical variables refer to variables with a large number of unique categories. One-hot encoding these variables can lead to a significant increase in feature dimensionality. In such cases, it is common to perform feature hashing or feature embedding techniques to represent these variables more efficiently.

### 3.6 Dealing with Unknown or Unseen Categories
Sometimes, new categories may appear in the test or production data that were not present in the training data. To handle this, it is important to have a strategy in place. You can either ignore these new categories, treat them as a separate category, or use special techniques like feature hashing to handle unknown categories.

### 3.7 Summary

Dealing with categorical variables is an important step in data preprocessing. One-Hot Encoding, Label Encoding, and Ordinal Encoding are common techniques used to convert categorical variables into numerical format. It is crucial to choose the appropriate encoding method based on the nature of the variable and the requirements of the machine learning algorithm.

In the next part, we will explore feature scaling and normalization techniques to prepare our data for modeling.

Feel free to practice handling categorical variables using the techniques discussed in this section. It is important to adapt the strategies to your specific dataset and problem domain.