## Data Encoding

Data encoding is an important step in preparing data for machine learning. It involves transforming data from its original format into a format that can be understood by machine learning algorithms.

The encoding process can vary depending on the type of data being used. Some common types of encoding include:

1. One-hot encoding: This technique is used for categorical data. It creates a binary vector for each category in the data, where all values are zero except for the index of the category, which is set to one.

2. Label encoding: This technique is also used for categorical data. It assigns a unique numerical value to each category in the data.

3. Normalization: This technique is used for numerical data. It scales the data so that it falls within a specific range, such as between 0 and 1.

4. Standardization: This technique is also used for numerical data. It transforms the data so that it has a mean of zero and a standard deviation of one.

5. Embedding: This technique is used for text data. It maps each word in the data to a high-dimensional vector, allowing the machine learning algorithm to process the text data.

The choice of encoding technique depends on the nature of the data and the machine learning algorithm being used. A well-encoded dataset can improve the accuracy and performance of the machine learning model.

**Data encoding is a process of converting categorical or nominal data into numerical format, which can be used as input to machine learning models. Here are three commonly used data encoding techniques in machine learning:**

- Nominal/OHE Encoding: In this technique, each category of a nominal variable is assigned a unique number, and each category is converted into a binary vector. This technique is also known as one-hot encoding (OHE), and it is useful for variables with a small number of categories.

- Label and Ordinal Encoding: Label encoding is used when the categories of a variable have a natural ordering. In this technique, each category is assigned a unique number based on its position in the ordering. Ordinal encoding is similar to label encoding but is more flexible, allowing you to assign any arbitrary numerical value to each category.

- Target Guided Ordinal Encoding: This technique is used when there is a strong correlation between the target variable and the categorical variable. In this technique, each category is assigned a numerical value based on the mean or median of the target variable for that category. This technique is useful when the number of categories is large, and OHE is not feasible.

**Nominal/OHE Encoding: Nominal encoding is also known as one-hot encoding (OHE). It is used to convert categorical variables into a numerical format that can be used in machine learning models. In this encoding, each unique value in a categorical feature is assigned a binary value, with only one bit set to 1 and the rest set to 0. For example, suppose we have a categorical feature "color" with three possible values: red, blue, and green. We can use one-hot encoding to represent each of these values as a binary vector: red = [1,0,0], blue = [0,1,0], and green = [0,0,1].**

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create DataFrame

df = pd.DataFrame({"color": ['red', 'blue', 'green', 'green', 'red', 'blue']})
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [3]:
##create an instance of Onehotencoder
encoder = OneHotEncoder()

In [4]:
## perform fit and transform
encoded = encoder.fit_transform(df[["color"]]).toarray()

In [5]:
encoder_df = pd.DataFrame(encoded, columns = encoder.get_feature_names_out())

In [6]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [7]:
encoder.transform([["blue"]]).toarray()



array([[1., 0., 0.]])

In [8]:
pd.concat([df,encoder_df], axis = 1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


**Label Encoding: Label encoding is another way of encoding categorical variables in machine learning. In this encoding, each unique value in a categorical feature is assigned a unique integer value. The labels are usually assigned in alphabetical order or based on the frequency of the categories.**

**For example, suppose we have a categorical feature "size" with three possible values: small, medium, and large. We can use label encoding to represent each of these values as a unique integer: small = 0, medium = 1, and large = 2.**

In [9]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
lbl = LabelEncoder()

In [12]:
lbl.fit_transform(df[["color"]])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [13]:
lbl.transform([["red"]])

  y = column_or_1d(y, warn=True)


array([2])

**Ordinal encoding is similar to label encoding, but it is used for features that have an inherent order or ranking, such as ratings or levels. In this encoding, each unique value in a categorical feature is assigned a unique integer value based on its rank or order. For example, suppose we have a categorical feature "rating" with four possible values: poor, fair, good, and excellent. We can use ordinal encoding to represent each of these values as a unique integer based on their rank: poor = 0, fair = 1, good = 2, and excellent = 3.**

In [14]:
from sklearn.preprocessing import OrdinalEncoder

In [15]:
df1 = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [16]:
df1

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [17]:
## create an instance of OrdinalEncoder and then fit_transform
ordi = OrdinalEncoder(categories=[['small', 'medium', 'large']])

In [18]:
ordi.fit_transform(df1[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [19]:
ordi.transform([['small']])



array([[0.]])

In [20]:
ordi.transform([['large']])



array([[2.]])

**Target Guided Ordinal Encoding: Target-guided ordinal encoding is a type of encoding that is used for categorical features that have a strong correlation with the target variable. In this encoding, each unique value in a categorical feature is assigned a unique integer value based on its correlation with the target variable. In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.**

**For example, suppose we have a categorical feature "city" with five possible values: New York, Los Angeles, Chicago, Houston, and Miami. We can use target-guided ordinal encoding to assign a unique integer value to each city based on the average target variable value for each city. This can help the machine learning model capture the relationship between the categorical feature and the target variable more effectively.**

In [21]:
df2 = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [22]:
df2

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [25]:
Mean_Price = df2.groupby('city')['price'].mean().to_dict()

In [26]:
Mean_Price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [27]:
df2['City_Encoded'] = df2['city'].map(Mean_Price)

In [29]:
df2[['city','City_Encoded']]

Unnamed: 0,city,City_Encoded
0,New York,190.0
1,London,150.0
2,Paris,310.0
3,Tokyo,250.0
4,New York,190.0
5,Paris,310.0
