##  Overview of One-Hot Encoding Task

###  **Goal**
The main goal of this task is to **convert categorical (text-based) columns** in the dataset into **numeric form** using One-Hot Encoding, so that machine learning models can process the data effectively.

---

###  **Dataset Used**
| buyer | fruits | gender | value |
|:------|:--------|:--------|:------|
| Hari | apple | male | 2 |
| Sri | mango | male | 3 |
| Samyu | orange | female | 4 |
| Manu | banana | male | 4 |

- **Categorical columns:** `buyer`, `fruits`, `gender`  
- **Numeric column:** `value`

---

###  **Why Encoding is Needed**
Machine learning models cannot interpret text values like ‚Äúapple‚Äù or ‚Äúmale‚Äù.  
Hence, we convert these categorical values into **binary numeric columns (0/1)** using **One-Hot Encoding**.

---



In [4]:
# Create a DataFrame with the given data
import pandas as pd

# Creating a dictionary with data
data = {
    'buyer': ['Hari', 'Sri', 'Samyu', 'Manu'],
    'fruits': ['apple', 'mango', 'orange', 'banana'],
    'gender': ['male', 'male', 'female', 'male'],
    'value': [2, 3, 4, 4]
}

# Converting dictionary into a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df


Unnamed: 0,buyer,fruits,gender,value
0,Hari,apple,male,2
1,Sri,mango,male,3
2,Samyu,orange,female,4
3,Manu,banana,male,4


### üîπ **Techniques Used**

#### 1Ô∏è‚É£ Using `pandas.get_dummies()`
- Simple method in pandas for encoding multiple categorical columns.

- ### pd.get_dummies(data=df, columns=['buyer', 'fruits', 'gender'], dtype=int)

- Converts the categorical columns ['buyer', 'fruits', 'gender'] into dummy (one-hot encoded) numeric columns.  
- Each unique category in these columns becomes a new column with 0s and 1s as values.  
- The parameter `dtype=int` ensures all dummy values are stored as integers instead of boolean (True/False).  
- Does not drop any category, so it creates dummy columns for every unique value in each column.  
- Useful for data analysis and visualization to represent categorical data in numerical form.  
- Example:  
  If `gender` = ['male', 'female'], two columns are created ‚Äî `gender_male` and `gender_female`.  
  Each will have 1 if that row belongs to the respective category, otherwise 0.



In [13]:
#pd.get_dummies(data=df,columns=['buyer','fruits','gender'],dtype=int)

### pd.get_dummies(data=df, columns=['buyer', 'fruits', 'gender'], dtype=int, drop_first='first')

- Converts the categorical columns ['buyer', 'fruits', 'gender'] into numeric dummy variables (0s and 1s).  
- Each unique category becomes a new column, but the first category from each column is dropped.  
- The parameter `drop_first='first'` helps avoid the dummy variable trap (redundant columns causing multicollinearity).  
- The `dtype=int` ensures that the dummy values are stored as integers (0 or 1).  
- For example:  
  If `gender` = ['male', 'female'], only one column `gender_male` is created.  
  `gender_male` = 1 for male and 0 for female.  
- Commonly used when preparing data for machine learning models to ensure efficient and non-redundant encoding.


In [14]:
#pd.get_dummies(data=df,columns=['buyer','fruits','gender'],dtype=int,drop_first='first')

In [15]:
# Importing OneHotEncoder from sklearn
from sklearn.preprocessing import OneHotEncoder

# Creating an object of OneHotEncoder
# drop='first' ‚Üí drops the first category of each feature to avoid the dummy variable trap
# dtype=int ‚Üí ensures the encoded output values are integers (0s and 1s)
# sparse_output=False ‚Üí returns a dense array instead of a sparse matrix (easier to view as a DataFrame)
ohe = OneHotEncoder(drop='first', dtype=int, sparse_output=False)


In [16]:
# Applying OneHotEncoder to the selected categorical columns
# fit_transform() ‚Üí first learns the unique categories (fit) and then converts them into encoded form (transform)
# df[['buyer','fruits','gender']] ‚Üí selects only the categorical columns to be encoded
# The result is a NumPy array containing 0s and 1s representing each category (after dropping the first category)
One_Hot_Encode_scaled = ohe.fit_transform(df[['buyer', 'fruits', 'gender']])


In [17]:
One_Hot_Encode_scaled

array([[0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 1]])

In [18]:
# Getting the feature (column) names created by OneHotEncoder
# get_feature_names_out() ‚Üí returns the names of all encoded columns after transformation
# These names correspond to each category (except the dropped ones)
cols = ohe.get_feature_names_out()

# Display the list of new encoded column names
cols


array(['buyer_Manu', 'buyer_Samyu', 'buyer_Sri', 'fruits_banana',
       'fruits_mango', 'fruits_orange', 'gender_male'], dtype=object)

###  Conclusion:
- The categorical columns **'buyer'**, **'fruits'**, and **'gender'** have been successfully converted into numeric form using **OneHotEncoder**.  
- Since we used `drop='first'`, the first category from each column was dropped to prevent multicollinearity (dummy variable trap).  
- The output (`One_Hot_Encode_scaled`) is a **NumPy array** of 0s and 1s representing the presence of each remaining category.  
- This encoded data can now be combined with the numerical columns (like 'value') and used as input for machine learning models.
