# Data Encoding & Nominal Encoding

## Introduction to Data Encoding
* Data encoding refers to the process of converting categorical or qualitative data into a numerical format that can be easily processed by machine learning algorithms.
* It plays a crucial role in preparing and structuring data for analysis and model training.
* Many machine learning algorithms, especially those based on mathematical equations, require numerical input.

## Understanding Nominal Encoding
* Define nominal encoding as a technique used to represent categorical data where no order or ranking is implied among the categories.
* Emphasize that nominal encoding transforms categories into unique numerical representations, allowing machine learning algorithms to understand and process them effectively.



In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Load the Iris dataset
iris = load_iris()
data = iris.data
target = iris.target

# Create a DataFrame from the dataset
columns = [f"feature_{i}" for i in range(data.shape[1])]
df = pd.DataFrame(data, columns=columns)
df["target"] = target

# Display the original DataFrame
print("Original DataFrame:")
print(df.head())

# Label Encoding for the 'target' column
le = LabelEncoder()
df['target_encoded'] = le.fit_transform(df['target'])

# Display the DataFrame after Label Encoding
print("\nDataFrame after Label Encoding:")
print(df.head())



Original DataFrame:
   feature_0  feature_1  feature_2  feature_3  target
0        5.1        3.5        1.4        0.2       0
1        4.9        3.0        1.4        0.2       0
2        4.7        3.2        1.3        0.2       0
3        4.6        3.1        1.5        0.2       0
4        5.0        3.6        1.4        0.2       0

DataFrame after Label Encoding:
   feature_0  feature_1  feature_2  feature_3  target  target_encoded
0        5.1        3.5        1.4        0.2       0               0
1        4.9        3.0        1.4        0.2       0               0
2        4.7        3.2        1.3        0.2       0               0
3        4.6        3.1        1.5        0.2       0               0
4        5.0        3.6        1.4        0.2       0               0


In [None]:

# One-Hot Encoding for the 'target' column
ohe = OneHotEncoder(sparse=False)
ohe_result = ohe.fit_transform(df[['target']])

# Create a new DataFrame for One-Hot Encoding results
ohe_df = pd.DataFrame(ohe_result, columns=[f"target_{i}" for i in range(ohe_result.shape[1])])

# Concatenate the One-Hot Encoding DataFrame with the original DataFrame
df = pd.concat([df, ohe_df], axis=1)

# Display the DataFrame after One-Hot Encoding
print("\nDataFrame after One-Hot Encoding:")
print(df.head())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(['target', 'target_encoded'] + ohe_df.columns.tolist(), axis=1),
                                                    df['target_encoded'], test_size=0.2, random_state=42)




DataFrame after One-Hot Encoding:
   feature_0  feature_1  feature_2  feature_3  target  target_encoded  \
0        5.1        3.5        1.4        0.2       0               0   
1        4.9        3.0        1.4        0.2       0               0   
2        4.7        3.2        1.3        0.2       0               0   
3        4.6        3.1        1.5        0.2       0               0   
4        5.0        3.6        1.4        0.2       0               0   

   target_0  target_1  target_2  
0       1.0       0.0       0.0  
1       1.0       0.0       0.0  
2       1.0       0.0       0.0  
3       1.0       0.0       0.0  
4       1.0       0.0       0.0  




In [None]:
# Display the training and testing sets
print("\nTraining Set:")
print(X_train.head())
print("\nTesting Set:")
print(X_test.head())



Training Set:
    feature_0  feature_1  feature_2  feature_3
22        4.6        3.6        1.0        0.2
15        5.7        4.4        1.5        0.4
65        6.7        3.1        4.4        1.4
11        4.8        3.4        1.6        0.2
42        4.4        3.2        1.3        0.2

Testing Set:
     feature_0  feature_1  feature_2  feature_3
73         6.1        2.8        4.7        1.2
18         5.7        3.8        1.7        0.3
118        7.7        2.6        6.9        2.3
78         6.0        2.9        4.5        1.5
76         6.8        2.8        4.8        1.4
