
**Encoding is the process of converting categorical data into numerical format. This is done because many machine learning algorithms and statistical models require numerical inputs.**

**One-Hot Encoding:Male: [1, 0]**
**Female: [0, 1]**
**With one-hot encoding, you create a separate column for each category. If it's male, you put a 1 in the "male" column and 0 in the "female" column. If it's female, you put a 1 in the "female" column and 0 in the "male" column.**

**Think of it like checkboxes: in one-hot encoding, you only check the box that represents the category, leaving the rest unchecked. For example, if it's male, you check the "male" box and leave the "female" box unchecked. If it's female, you check the "female" box and leave the "male" box unchecked**

**/*One Hot Encoding*/**

In [22]:
import pandas as pd
import numpy as np
dataset=pd.read_csv("Mall_Customers.csv")

In [23]:
dataset

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


In [24]:
for i in dataset:
    print(i)

CustomerID
Gender
Age
Annual Income (k$)
Spending Score (1-100)


In [25]:
dataset.select_dtypes(include='object').columns

Index(['Gender'], dtype='object')

In [26]:
dataset.isnull().sum().sum()

0

**assign column to variable on which we want to apply one hot encoding**

In [27]:
en_data=dataset[["Gender"]]

**With the help of get_dummies we did one hot encoding but it provided us data in bool value so we have to apply operations on it to convert it into binary form,but rather than such long process we can  use OneHotEncoder from sklearn.preprocessing**

In [42]:
one_hot_encoding=pd.get_dummies(en_data)
one_hot_encoding

Unnamed: 0,Gender_Female,Gender_Male
0,False,True
1,False,True
2,True,False
3,True,False
4,True,False
...,...,...
195,True,False
196,True,False
197,False,True
198,False,True


**Here we can see Datatype is bool**

In [29]:
pd.get_dummies(en_data).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Gender_Female  200 non-null    bool 
 1   Gender_Male    200 non-null    bool 
dtypes: bool(2)
memory usage: 532.0 bytes


In [30]:
dataset

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


**To do changes in orignal dataset**

In [31]:

# Drop the original "Gender" column from the dataset
dataset = dataset.drop(columns=["Gender"])

# Join the one-hot encoded columns with the original dataset to make changes in orignal dataset
dataset = dataset.join(one_hot_encoding)
dataset

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Gender_Female,Gender_Male
0,1,19,15,39,False,True
1,2,21,15,81,False,True
2,3,20,16,6,True,False
3,4,23,16,77,True,False
4,5,31,17,40,True,False
...,...,...,...,...,...,...
195,196,35,120,79,True,False
196,197,45,126,28,True,False
197,198,32,126,74,False,True
198,199,32,137,18,False,True


**Now we will do one hot encoding with another method which will provide result in binary form**

In [32]:
dataset=pd.read_csv("Mall_Customers.csv")
dataset

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


**OneHotEncoder method of sklearn.preprocessing use to give result in binary form**

In [33]:
from sklearn.preprocessing import OneHotEncoder


**Assign it to variable**

In [34]:
ohe=OneHotEncoder()

**Apply to column**

**Fit is use to train model and transform is used to convert the model**

In [35]:
ohe.fit_transform(en_data)

<200x2 sparse matrix of type '<class 'numpy.float64'>'
	with 200 stored elements in Compressed Sparse Row format>

**To Convert result into array**

In [36]:
ohe.fit_transform(en_data).toarray()

array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.

**Assign it to variable so that we can use it in dataframe**

In [37]:
ar=ohe.fit_transform(en_data).toarray()

In [38]:
pd.DataFrame(ar,columns=["Gender_Female","Gender_Male"])

Unnamed: 0,Gender_Female,Gender_Male
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0
...,...,...
195,1.0,0.0
196,1.0,0.0
197,0.0,1.0
198,0.0,1.0


**Store changes in variable**

In [39]:
change=pd.DataFrame(ar,columns=["Gender_Female","Gender_Male"])

**To do changes in orignal dataset**

In [40]:

# Drop the original "Gender" column from the dataset
dataset = dataset.drop(columns=["Gender"])

# Join the one-hot encoded columns with the original dataset to make changes in orignal dataset
dataset = dataset.join(change)
dataset

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Gender_Female,Gender_Male
0,1,19,15,39,0.0,1.0
1,2,21,15,81,0.0,1.0
2,3,20,16,6,1.0,0.0
3,4,23,16,77,1.0,0.0
4,5,31,17,40,1.0,0.0
...,...,...,...,...,...,...
195,196,35,120,79,1.0,0.0
196,197,45,126,28,1.0,0.0
197,198,32,126,74,0.0,1.0
198,199,32,137,18,0.0,1.0


**If you have a lot of categories, this can create a ton of new columns, making your dataset very wide and potentially difficult to work with. By using drop="first", you remove one of these columns, helping to reduce the number of columns and simplifying your dataset, which can be especially helpful when dealing with large amounts of data.**

In [43]:
ohe=OneHotEncoder(drop="first")

**Fit is use to train model and transform is used to convert the model,,we can use it either seperately or together,,weuse it seperately when we do deployement of model**

In [44]:
ar=ohe.fit_transform(en_data).toarray()

In [47]:
pd.DataFrame(ar,columns=["Gender_Female"])

Unnamed: 0,Gender_Female
0,1.0
1,1.0
2,0.0
3,0.0
4,0.0
...,...
195,0.0
196,0.0
197,1.0
198,1.0
