### DATA ENCODING

#### 1)Nominal / One Hot Encoding
#### 2)Label and Ordinal Encoding
#### 3)Target Guided Ordinal Encoding  

#### 1)Nominal / One Hot Encoding

#This method involves creating a binary vector for each category, with a 
#value of 1 indicating the presence of the category and a value of 0 
#indicating the absence of the category. This method works well for 
#categorical features with a small number of categories.

#One hot encoding, also known as nominal encoding, is a technique used 
#to represent categorical data as numerical data, which is more suitable 
#for machine learning algorithms. 
#In this technique, each category is represented as a binary vector
#where each bit corresponds to a unique category. For example, if we 
#have a categorical variable "color" with three possible values 
#(red, green, blue), we can represent it using one hot encoding as follows:

#1. Red: [1, 0, 0]
#2. Green: [0, 1, 0]
#3. Blue: [0, 0, 1]

In [4]:
import pandas as pd

In [6]:
from sklearn.preprocessing import OneHotEncoder

In [25]:
#Create a simple Dataframe
df=pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [26]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [9]:
##Create an instance of OneHotEncoder

In [10]:
encoder=OneHotEncoder()

In [11]:
##Fit and the transform or we can even use together

In [15]:
arr=encoder.fit_transform(df[['color']]).toarray()

In [18]:
arr

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [13]:
#Therefore they did by alphabetical sorting
#(Blue,Green,Red)

In [27]:
df1=pd.DataFrame(arr,columns=encoder.get_feature_names_out())

In [28]:
df1

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [29]:
pd.concat([df,df1],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


#### DIFFERENCE BETWEEN NoMINAL AND ONE HOT ENCODING


In [32]:
#DIFFERENCE BETWEEN NIMINAL AND ONE HOT ENCODING


#Nominal encoding and one-hot encoding are both techniques used to transform categorical data into numerical form that can be used in 
#machine learning algorithms. However, there are some key differences between the two techniques:

#Nominal encoding: Nominal encoding is a technique that assigns a unique numerical value to each category in a categorical variable. 
#These numerical values are typically integers starting from 0, and there is no inherent order or ranking to the values.
#Nominal encoding is useful when the categories do not have any inherent order or ranking.

#One-hot encoding: One-hot encoding is a technique that creates a set of binary variables, one for each category in a categorical variable.
#Each binary variable represents a category, and the value is 1 if the category is present and 0 if it is not. One-hot encoding is
#useful when the categories do not have any inherent order or ranking and when the number of categories is small.

#The main difference between nominal encoding and one-hot encoding is in the resulting encoded variables. Nominal encoding produces a 
#single numerical variable with multiple values, one for each category, while one-hot encoding produces multiple binary variables, 
#one for each category.

#Here's an example to illustrate the difference between nominal and one-hot encoding. Let's say we have a categorical variable "Color"
#with the following categories: "Red", "Green", and "Blue".

#Nominal encoding: We can assign numerical values to each category as follows: "Red" = 0, "Green" = 1, and "Blue" = 2. 
#The resulting variable would be a single numerical variable with values ranging from 0 to 2.

#One-hot encoding: We can create three binary variables, one for each category, as follows: "Red" = [1, 0, 0], "Green" = [0, 1, 0], and 
#"Blue" = [0, 0, 1]. The resulting variables would be three binary variables, with each variable indicating whether a particular 
#category is present or not.

#### Differece between One Hot Encoding and Binary Encoding Technique

In [33]:
#One-hot encoding and binary encoding are both techniques used to transform categorical data into a numerical format suitable for machine learning algorithms. 
#The main difference between these techniques lies in the number of binary features used to represent each categorical value.

#In one-hot encoding, each unique value in the categorical feature is represented by a binary vector with as many elements as there are unique values. 
#For example, if a feature "color" has three unique values: "red", "green", and "blue", one-hot encoding would represent these values as [1, 0, 0], [0, 1, 0], and [0, 0, 1], 
#respectively. Only one element in the vector is 1, and the rest are 0s.

#In binary encoding, each unique value in the categorical feature is represented by a binary code of a fixed length, typically log2(n), where n is the number of unique values.
#For example, if the feature "color" has three unique values, binary encoding would represent them as [00], [01], and [10]. The main difference is that in binary encoding, each 
#unique value is represented by a binary code with fewer bits than the number of unique values, whereas in one-hot encoding, each unique value is represented by a binary vector 
#with the same length as the number of unique values.

### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

Labeling, also known as label encoding, is the process of converting categorical values into numerical values. Each unique category is assigned a numerical value, starting from 0 and increasing sequentially. For example, in a dataset with the categorical feature "fruit" and categories "apple," "banana," and "orange," label encoding would convert "apple" to 0, "banana" to 1, and "orange" to 2.

Ordinal encoding, on the other hand, is a similar process that assigns numerical values to categorical variables, but the values are assigned based on the order or rank of the categories. For example, if we have a dataset with a categorical feature "size" and categories "small," "medium," and "large," ordinal encoding would assign "small" a lower value than "medium," and "medium" a lower value than "large."

In summary, labeling assigns numerical values to categories without considering any order or rank, while ordinal encoding assigns values based on the order or rank of the categories.

In [2]:
import pandas  as pd

In [19]:
df=pd.DataFrame({"colours":["Red","Green","Blue","Green","Red","Blue","Red","Green"]})

In [20]:
df

Unnamed: 0,colours
0,Red
1,Green
2,Blue
3,Green
4,Red
5,Blue
6,Red
7,Green


In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
scaler=LabelEncoder()

In [14]:
import numpy as np

In [21]:
arr=scaler.fit_transform(df[["colours"]])

  y = column_or_1d(y, warn=True)


In [22]:
arr

array([2, 1, 0, 1, 2, 0, 2, 1])

In [12]:
df

Unnamed: 0,colours
0,Red
1,Green
2,Blue


In [26]:
scaler.transform([["Red"]]) #We can even check or add values separetely

array([2])

In [27]:
scaler.transform([["Blue"]])

array([0])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [28]:
from sklearn.preprocessing import OrdinalEncoder

In [29]:
scaler=OrdinalEncoder()

In [30]:
df=pd.DataFrame({"Degree":["BTECH","BE","BCA","BBA","BSC","BBA","PHD","BTECH"]})

In [31]:
df

Unnamed: 0,Degree
0,BTECH
1,BE
2,BCA
3,BBA
4,BSC
5,BBA
6,PHD
7,BTECH


In [45]:
#create an instance for giving ranks of ordinalencoder and then fit_transform
scaler=OrdinalEncoder(categories=[["BBA","BCA","BSC","BE","BTECH","PHD"]])

In [46]:
scaler.fit_transform(df[["Degree"]]) #So we got data encoding based on ranks

array([[4.],
       [3.],
       [1.],
       [0.],
       [2.],
       [0.],
       [5.],
       [4.]])

In [43]:
scaler.categories_

[array(['BBA', 'BCA', 'BSC', 'BE', 'BTECH', 'PHD'], dtype=object)]

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [67]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [68]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [69]:
df.groupby('city')['price'].mean()

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [70]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [85]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [76]:
df['city_encoded']=df['city'].map(mean_price)

In [77]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [60]:
import pandas as pd

# create a sample dataset
data = {'education': ['high school', 'college', 'graduate school', 'high school', 'college'],
        'default': [1, 0, 0, 1, 0]}
df = pd.DataFrame(data)

# calculate the mean target variable for each category of 'education'
target_means = df.groupby('education')['default'].mean().sort_values()

# create a dictionary to map categories to numerical values based on target means
target_dict = {k: i for i, k in enumerate(target_means.index)}

# apply the target-guided ordinal encoding to the 'education' column
df['education_encoded'] = df['education'].map(target_dict)

print(df)


         education  default  education_encoded
0      high school        1                  2
1          college        0                  0
2  graduate school        0                  1
3      high school        1                  2
4          college        0                  0
