## **Encoding Techniques in Machine Learning**

In machine learning, most algorithms require numeric input. However, categorical data, like city names or colors, is often represented as text or labels. **Encoding techniques** are used to transform these categorical features into a format that can be provided to machine learning algorithms.

### 1. **Label Encoding**
Label Encoding assigns a unique integer to each category in the data. This method transforms categorical values into numerical labels. It’s useful when the categorical data has some order or ranking. However, it can sometimes create unintended ordinal relationships between categories.

#### How it works:
- Each unique category is assigned an integer value starting from 0.
- No new features are created, and the original categorical feature is simply replaced by integer values.

#### Example:
Imagine you have a column `Colors` with three categories: "Red," "Blue," and "Green."

| Color  | Label Encoded |
|--------|---------------|
| Red    | 0             |
| Blue   | 1             |
| Green  | 2             |

#### Pros:
- Simple and memory-efficient as it does not increase the dimensionality.
  
#### Cons:
- Imposes an ordinal relationship between categories, which might mislead the model if there is no actual ranking.

#### Code Example:
```python
from sklearn.preprocessing import LabelEncoder

# Example data
data = ['Red', 'Blue', 'Green', 'Blue', 'Red']

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_data = label_encoder.fit_transform(data)

print(encoded_data)
```

### 2. **One-Hot Encoding**
One-Hot Encoding transforms categorical variables into multiple binary columns, where each column represents a unique category. If a category is present in a row, the corresponding column gets a value of 1, and all other columns get a value of 0.

#### How it works:
- For each unique category, a new column is created.
- Each column represents one of the categories, and it has binary values (0 or 1).

#### Example:
Using the same `Colors` column with values: "Red," "Blue," and "Green":

| Color  | Red | Blue | Green |
|--------|-----|------|-------|
| Red    |  1  |   0  |   0   |
| Blue   |  0  |   1  |   0   |
| Green  |  0  |   0  |   1   |

#### Pros:
- Avoids introducing ordinal relationships, making it better suited for nominal categorical features.
  
#### Cons:
- Can lead to **high-dimensionality** if there are many unique categories (curse of dimensionality).
- Memory-inefficient when dealing with a large number of categories.

#### Code Example:
```python
import pandas as pd

# Example data
data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Perform One-Hot Encoding
one_hot_encoded_data = pd.get_dummies(df['Color'])

print(one_hot_encoded_data)
```

### When to Use:
- **Label Encoding**: Best suited for ordinal categorical variables where the categories have a meaningful order. For example, "Low," "Medium," and "High."
- **One-Hot Encoding**: Ideal for **nominal** categorical variables (no natural order), such as colors, product categories, or city names. It’s commonly used in tree-based models and deep learning algorithms.

### Which Technique to Use?

- **Use Label Encoding** when there’s an **ordinal relationship** between categories (i.e., the categories have some inherent ranking, like `low`, `medium`, `high`).
  
- **Use One-Hot Encoding** when the categories are **nominal** (no order between them) and there’s no relationship or ranking between the categories.

### Summary:
- **Label Encoding** is simple and works well when the categorical feature has a natural order.
- **One-Hot Encoding** is preferable for features that do not have an inherent order but can increase the feature space significantly.

Each technique is useful in different situations, and the choice depends on the specific nature of the categorical data and the machine learning model being used.

In [1]:
import pandas as pd
data = {'Color':['Red', 'Blue', 'Green', 'Blue', 'Red']}
dataframe = pd.DataFrame(data)
print(dataframe)

   Color
0    Red
1   Blue
2  Green
3   Blue
4    Red


In [2]:
one_hot_encoded_df = pd.get_dummies(dataframe, columns=['Color'])
print(one_hot_encoded_df)

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


Dummy Variable Trap

In [3]:
## avoid the redundant information and get rid of the multicollinearity
one_hot_encoded_df = pd.get_dummies(dataframe, columns=['Color'], drop_first=True)
print(one_hot_encoded_df)

   Color_Green  Color_Red
0        False       True
1        False      False
2         True      False
3        False      False
4        False       True


In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the dataset
file_path = './carprices.csv'
car_data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
print("Original Data:\n", car_data.head(14))

Original Data:
                 Car Model  Mileage  Sell Price($)  Age(yrs)
0                  BMW X5    69000          18000         6
1                  BMW X5    35000          34000         3
2                  BMW X5    57000          26100         5
3                  BMW X5    22500          40000         2
4                  BMW X5    46000          31500         4
5                 Audi A5    59000          29400         5
6                 Audi A5    52000          32000         5
7                 Audi A5    72000          19300         6
8                 Audi A5    91000          12000         8
9   Mercedez Benz C class    67000          22000         6
10  Mercedez Benz C class    83000          20000         7
11  Mercedez Benz C class    79000          21000         7
12  Mercedez Benz C class    59000          33000         5


In [6]:
car_data.dtypes

Car Model        object
Mileage           int64
Sell Price($)     int64
Age(yrs)          int64
dtype: object

In [7]:
car_data.shape

(13, 4)

In [9]:
# Extract the 'Car Model' column
car_models = car_data[['Car Model']]

# Apply One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = one_hot_encoder.fit_transform(car_models)

# Convert one-hot encoding result to DataFrame
one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(['Car Model']))

# Combine the one-hot encoded columns with the original data
car_data_one_hot_encoded = pd.concat([car_data, one_hot_encoded_df], axis=1)

# Display the one-hot encoded data
print("\nOne-Hot Encoded Data:\n", car_data_one_hot_encoded.head(14))


One-Hot Encoded Data:
                 Car Model  Mileage  Sell Price($)  Age(yrs)  \
0                  BMW X5    69000          18000         6   
1                  BMW X5    35000          34000         3   
2                  BMW X5    57000          26100         5   
3                  BMW X5    22500          40000         2   
4                  BMW X5    46000          31500         4   
5                 Audi A5    59000          29400         5   
6                 Audi A5    52000          32000         5   
7                 Audi A5    72000          19300         6   
8                 Audi A5    91000          12000         8   
9   Mercedez Benz C class    67000          22000         6   
10  Mercedez Benz C class    83000          20000         7   
11  Mercedez Benz C class    79000          21000         7   
12  Mercedez Benz C class    59000          33000         5   

    Car Model_Audi A5  Car Model_BMW X5  Car Model_Mercedez Benz C class  
0                 

In [10]:
# Apply Label Encoding
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(car_models['Car Model'])

# Add the label encoded column to the original data
car_data_label_encoded = car_data.copy()
car_data_label_encoded['Car Model (Label Encoded)'] = label_encoded

# Display the label encoded data
print("\nLabel Encoded Data:\n", car_data_label_encoded.head(14))


Label Encoded Data:
                 Car Model  Mileage  Sell Price($)  Age(yrs)  \
0                  BMW X5    69000          18000         6   
1                  BMW X5    35000          34000         3   
2                  BMW X5    57000          26100         5   
3                  BMW X5    22500          40000         2   
4                  BMW X5    46000          31500         4   
5                 Audi A5    59000          29400         5   
6                 Audi A5    52000          32000         5   
7                 Audi A5    72000          19300         6   
8                 Audi A5    91000          12000         8   
9   Mercedez Benz C class    67000          22000         6   
10  Mercedez Benz C class    83000          20000         7   
11  Mercedez Benz C class    79000          21000         7   
12  Mercedez Benz C class    59000          33000         5   

    Car Model (Label Encoded)  
0                           1  
1                           1  