# Encoding :
Machine learning algorithms, however, require numerical input, making it essential to convert categorical data into a numerical format. This process is known as encoding.

Why Encode Categorical Data?
Before diving into the encoding techniques, it's important to understand why encoding is necessary:

1. Machine Learning Algorithms: Most machine learning algorithms, such as linear regression, support vector machines, and neural networks, require numerical input. Categorical data needs to be converted into a numerical format to be used effectively.
2. Model Performance: Proper encoding can significantly impact the performance of a machine learning model. Incorrect or suboptimal encoding can lead to poor model performance and inaccurate predictions.
3. Data Preprocessing: Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in a suitable format for training and evaluation.

Types of Categorical Data: 
Categorical data can be broadly classified into two types:

1. Nominal Data: This type of data represents categories without any inherent order. Examples include gender (male, female), color (red, blue, green), and country (USA, India, UK).
2. Ordinal Data: This type of data represents categories with a meaningful order or ranking. Examples include education level (high school, bachelor's, master's, PhD) and customer satisfaction (low, medium, high).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("c:/users/sakshi yadav/Downloads/Iris_dataset.csv")

In [3]:
df.shape

(150, 6)

In [4]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
df.Species

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [6]:
df.Species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [7]:
new_df = pd.get_dummies(df.Species)

In [8]:
new_df

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False
...,...,...,...
145,False,False,True
146,False,False,True
147,False,False,True
148,False,False,True


1. One-Hot Encoding: 
It converts categorical data into a binary matrix, where each category is represented by a binary vector. This method is suitable for nominal data.

In [9]:
# One Hot Encoding
from sklearn.preprocessing import OneHotEncoder

In [10]:
ohe = OneHotEncoder(sparse_output=False)

In [11]:
df2 = ohe.fit_transform(pd.DataFrame(df['Species']))

In [12]:
df2

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0

In [13]:
new_df2 = pd.DataFrame(df2, columns=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

In [14]:
new_df2

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
145,0.0,0.0,1.0
146,0.0,0.0,1.0
147,0.0,0.0,1.0
148,0.0,0.0,1.0


2. Ordinal Encoding:
It assigns a unique integer to each category, similar to Label Encoding, but it is specifically designed for ordinal data. It ensures that the order of categories is preserved.

In [15]:
# Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [16]:
ode = OrdinalEncoder()

In [17]:
new_df = ode.fit_transform(pd.DataFrame(df['Species']))

In [18]:
new_df

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],

In [19]:
df3 = pd.DataFrame(new_df)

In [20]:
df3

Unnamed: 0,0
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
145,2.0
146,2.0
147,2.0
148,2.0


In [21]:
fdf = pd.read_excel("c:/users/sakshi yadav/Downloads/Financial_Sample.xlsx").bfill()

In [22]:
fdf.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
0,Government,Canada,Carretera,Low,1618.5,3,20.0,32370.0,0.0,32370.0,16185.0,16185.0,2014-01-01,1,January,2014
1,Government,Germany,Carretera,Low,1321.0,3,20.0,26420.0,0.0,26420.0,13210.0,13210.0,2014-01-01,1,January,2014
2,Midmarket,France,Carretera,Low,2178.0,3,15.0,32670.0,0.0,32670.0,21780.0,10890.0,2014-06-01,6,June,2014
3,Midmarket,Germany,Carretera,Low,888.0,3,15.0,13320.0,0.0,13320.0,8880.0,4440.0,2014-06-01,6,June,2014
4,Midmarket,Mexico,Carretera,Low,2470.0,3,15.0,37050.0,0.0,37050.0,24700.0,12350.0,2014-06-01,6,June,2014


In [23]:
fdf.shape

(700, 16)

In [24]:
fdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Segment              700 non-null    object        
 1   Country              700 non-null    object        
 2   Product              700 non-null    object        
 3   Discount Band        700 non-null    object        
 4   Units Sold           700 non-null    float64       
 5   Manufacturing Price  700 non-null    int64         
 6   Sale Price           700 non-null    float64       
 7   Gross Sales          700 non-null    float64       
 8   Discounts            700 non-null    float64       
 9    Sales               700 non-null    float64       
 10  COGS                 700 non-null    float64       
 11  Profit               700 non-null    float64       
 12  Date                 700 non-null    datetime64[ns]
 13  Month Number         700 non-null  

In [25]:
from sklearn.preprocessing import OrdinalEncoder

In [26]:
ode = OrdinalEncoder()

In [27]:
new_df3 = ode.fit_transform(fdf.select_dtypes(include="object"))

In [28]:
new_df3

array([[2., 0., 1., 1., 4.],
       [2., 2., 1., 1., 4.],
       [3., 1., 1., 1., 6.],
       ...,
       [2., 3., 2., 0., 3.],
       [2., 0., 3., 0., 0.],
       [0., 4., 4., 0., 8.]], shape=(700, 5))

In [29]:
new_df = pd.DataFrame(new_df3, columns=fdf.select_dtypes(include="object").columns)

In [30]:
new_df.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Month Name
0,2.0,0.0,1.0,1.0,4.0
1,2.0,2.0,1.0,1.0,4.0
2,3.0,1.0,1.0,1.0,6.0
3,3.0,2.0,1.0,1.0,6.0
4,3.0,3.0,1.0,1.0,6.0


3. Label Encoding: 
It is a simple and straightforward method that assigns a unique integer to each category. This method is suitable for ordinal data where the order of categories is meaningful.

In [31]:
# Label Encoding 
from sklearn.preprocessing import LabelEncoder

In [32]:
le = LabelEncoder()

In [33]:
new_df = le.fit_transform(df['Species'])

In [34]:
new_df

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])