Feature Encoding techniques (label encoding, one hot encoding)

Why Feature Encoding is Needed

   - Machine learning algorithms work with numbers, not text. So if your dataset has categorical features (like “Color = Red, Blue, Green” or “Gender = Male/Female”), we need to convert them into numeric values. This process is called feature encoding.

1. Label Encoding

   - Definition:
        - Label encoding converts each unique category into a unique integer.

   - How it works:
         - Suppose we have a column “Color”:

Color
Red
Blue
Green
Blue

    After Label Encoding:

         Color	Encoded
          Red	  2
          Blue	  0
          Green	  1
          Blue	  0


Pros:

 - Simple and fast.
 - Works well for ordinal data (where order matters, e.g., “Low, Medium, High”).

Cons:

 - Can mislead algorithms if categories are non-ordinal, because numbers imply order. For example, 0 < 1 < 2 might be misinterpreted as “Blue < Green < Red” numerically.

2. One-Hot Encoding

Definition:
One-hot encoding creates a binary column for each category. A value of 1 indicates the presence of that category, and 0 otherwise.

Example: Using the same “Color” column:

Color
Red
Blue
Green
Blue

  After One-Hot Encoding:

     Red	Blue	Green
      1	      0	      0
      0	      1       0
      0	      0	      1
      0	      1	      0

Pros:

  - Avoids misleading algorithms since it doesn’t assume any order.
   - Works well for nominal data (no inherent order, like color, city, brand).

Cons:
   - Can increase feature space a lot if the category has many unique values (high cardinality).

In [2]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.17.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp312-cp312-win_amd64.whl (8.0 MB)
   ---------------------------------------- 0.0/8.0 MB ? eta -:--:--
   ------- -------------------------------- 1.6/8.0 MB 9.4 MB/s eta 0:00:01
   ------------- -------------------------- 2.6/8.0 MB 6.9 MB/s eta 0:00:01
   ---------------- ----------------------- 3.4/8.0 MB 6.9 MB/s eta 0:00:01
   ------------------ --------------------- 3.7/8.0 MB 5.6 MB/s eta 0:00:01
   -------------------- ------------------- 4.2/8.0 MB 4.2 MB/s eta 0:00:01
   ---------------------- -----------------


[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# Label encoding

from sklearn.preprocessing import LabelEncoder

data = ['Low','Medium','High','Medium']

le = LabelEncoder()
encoded = le.fit_transform(data)

print(encoded)


[1 2 0 2]


In [6]:
# one hot encoding

import pandas as pd

data = pd.DataFrame({'city':['Pune','Mumbai','Delhi','Pune']})
encoded = pd.get_dummies(data)
print(encoded)

   city_Delhi  city_Mumbai  city_Pune
0       False        False       True
1       False         True      False
2        True        False      False
3       False        False       True
