In [None]:
Data encoding and decoding play a crucial role in data science, as they act as a bridge between raw data and actionable insights. 
They enable us to: Prepare data for analysis by transforming it into a suitable format that can be processed by algorithms or models.


In [None]:
Nominal encoding, also known as label encoding, is a technique used to convert categorical variables into numerical format by assigning a unique integer label to each category. Unlike ordinal encoding, 
the assigned labels do not imply any ordinal relationship between the categories; they are merely used to represent different categories numerically.

In [None]:
Nominal encoding, also known as label encoding, is preferred over one-hot encoding in situations where the categorical variable has a large number of unique categories and when the order of categories doesn't matter. One-hot encoding, on the other hand, creates binary columns for each category, which can lead to a high-dimensional and sparse feature space, especially when dealing with a large number of categories.

A practical example where nominal encoding might be preferred is in text classification tasks, such as sentiment analysis or topic classification, where you have a large vocabulary of words. Each word in the vocabulary represents a unique category, and using one-hot encoding would result in a very high-dimensional feature space, making the model training computationally expensive and prone to overfitting.

In [15]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample dataset with categorical data
data = {'category': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)

# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(df[['category']])
# Convert to DataFrame for display
one_hot_encoded_df = pd.DataFrame(one_hot_encoded.toarray(), columns=one_hot_encoder.categories_[0])

print("One-Hot Encoding:")
print(one_hot_encoded_df)


One-Hot Encoding:
     A    B    C    D    E
0  1.0  0.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0  0.0
2  0.0  0.0  1.0  0.0  0.0
3  0.0  0.0  0.0  1.0  0.0
4  0.0  0.0  0.0  0.0  1.0


In [16]:
import pandas as pd

# Sample dataset dimensions
num_rows = 1000
num_categorical_columns = 2

# Assuming number of unique categories for each categorical column
unique_categories_col1 = 5
unique_categories_col2 = 7

# Calculate the number of new columns created for each categorical column
new_columns_col1 = unique_categories_col1 - 1  # Subtract 1 because one category becomes the reference level
new_columns_col2 = unique_categories_col2 - 1

# Total number of new columns created
total_new_columns = new_columns_col1 + new_columns_col2

print("Total number of new columns created for nominal encoding:", total_new_columns)


Total number of new columns created for nominal encoding: 10


In [17]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample animal dataset
data = {
    'Species': ['Lion', 'Elephant', 'Monkey', 'Lion', 'Monkey'],
    'Habitat': ['Savannah', 'Jungle', 'Forest', 'Savannah', 'Forest'],
    'Diet': ['Carnivore', 'Herbivore', 'Omnivore', 'Carnivore', 'Omnivore']
}

df = pd.DataFrame(data)

# Apply label encoding to each categorical column
label_encoder = LabelEncoder()
df['Species_encoded'] = label_encoder.fit_transform(df['Species'])
df['Habitat_encoded'] = label_encoder.fit_transform(df['Habitat'])
df['Diet_encoded'] = label_encoder.fit_transform(df['Diet'])

# Display the encoded DataFrame
print(df)


    Species   Habitat       Diet  Species_encoded  Habitat_encoded  \
0      Lion  Savannah  Carnivore                1                2   
1  Elephant    Jungle  Herbivore                0                1   
2    Monkey    Forest   Omnivore                2                0   
3      Lion  Savannah  Carnivore                1                2   
4    Monkey    Forest   Omnivore                2                0   

   Diet_encoded  
0             0  
1             1  
2             2  
3             0  
4             2  


In [18]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample dataset
data = {
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [35, 42, 28, 55, 30],
    'contract_type': ['Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year'],
    'monthly_charges': [65.5, 80.2, 45.3, 75.1, 85.6],
    'tenure': [10, 24, 5, 12, 36]
}

df = pd.DataFrame(data)
encoder=OneHotEncoder() 
encoded=encoder.fit_transform(df[['contract_type']])
encoded_df=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
encoded_df,df

(   contract_type_Month-to-month  contract_type_One year  \
 0                           1.0                     0.0   
 1                           0.0                     0.0   
 2                           0.0                     1.0   
 3                           1.0                     0.0   
 4                           0.0                     0.0   
 
    contract_type_Two year  
 0                     0.0  
 1                     1.0  
 2                     0.0  
 3                     0.0  
 4                     1.0  ,
    gender  age   contract_type  monthly_charges  tenure
 0    Male   35  Month-to-month             65.5      10
 1  Female   42        Two year             80.2      24
 2    Male   28        One year             45.3       5
 3  Female   55  Month-to-month             75.1      12
 4    Male   30        Two year             85.6      36)