<a href="https://colab.research.google.com/github/reitezuz/18NES1-2025-/blob/main/week4/categorical_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Introduction to feature engineering: Categorical Values
- import the dataset from a CSV file:

In [39]:
import pandas as pd


# Load image data from a CSV file hosted on GitHub into a DataFrame
github_url = "https://github.com/reitezuz/18NES1-2025-/blob/55cb9d26187c960873bf576ffdfdff0f658e662a/week2/OcrData.csv"
url = github_url.replace("github.com", "raw.githubusercontent.com").replace("blob/", "")

# Load CSV file into a DataFrame
#df_orig = pd.read_csv("categorical.csv")
df_orig = pd.read_csv(url)
df_orig['Fur'] = df_orig['Fur'].fillna('None') # read 'None' as 'None'
df_orig


Unnamed: 0,Size,Fur,Speaks,Movement,Class
0,Small,Short,No,Runs,Cat
1,Large,Long,No,Runs,Dog
2,Small,,Yes,Flies,Parrot
3,Medium,Short,No,Runs,Cat
4,Small,,No,Swims,Carp
5,Medium,Long,No,Runs,Dog
6,Large,Short,No,Runs,Dog
7,Small,Long,No,Runs,Cat
8,Large,,No,Swims,Carp
9,Medium,,No,Flies,Parrot


- convert some of the categorical values to ordinal encoding (if order matters):

In [65]:
# ORDINAL ENCODING using pandas `.astype('category')`
df = df_orig.copy()
#df['Size'] = df['Size'].astype('category').cat.codes


# Define custom category order
size_order = pd.api.types.CategoricalDtype(categories=['Small', 'Medium', 'Large'], ordered=True)
fur_order = pd.api.types.CategoricalDtype(categories=['None', 'Short', 'Long'], ordered=True)
speaks_order = pd.api.types.CategoricalDtype(categories=['No', 'Yes'], ordered=True)

df['Size'] = df['Size'].astype(size_order).cat.codes
df['Fur'] = df['Fur'].astype(fur_order).cat.codes
df['Speaks'] = df['Speaks'].astype(speaks_order).cat.codes

# Normalization
#ordinal_cols = ['Size', 'Fur', 'Speaks']
#df[ordinal_cols] = df[ordinal_cols].apply(lambda x: -1 + 2*(x - x.min()) / (x.max() - x.min()))

#df.head()
df

Unnamed: 0,Size,Fur,Speaks,Movement,Class
0,0,1,0,Runs,Cat
1,2,2,0,Runs,Dog
2,0,0,1,Flies,Parrot
3,1,1,0,Runs,Cat
4,0,0,0,Swims,Carp
5,1,2,0,Runs,Dog
6,2,1,0,Runs,Dog
7,0,2,0,Runs,Cat
8,2,0,0,Swims,Carp
9,1,0,0,Flies,Parrot


- convert the categorical values to one-hot encoding (if order doesn't matter):

In [75]:
# ONE-HOT ENCODING using pandas `.get_dummies()`
one_hot_columns = ['Movement', 'Class']
df_onehot = pd.get_dummies(df, columns=one_hot_columns)
df_onehot = df_onehot.astype(int)

df_onehot.head()



Unnamed: 0,Size,Fur,Speaks,Movement_Flies,Movement_Runs,Movement_Swims,Class_Carp,Class_Cat,Class_Dog,Class_Parrot
0,0,1,0,0,1,0,0,1,0,0
1,2,2,0,0,1,0,0,0,1,0
2,0,0,1,1,0,0,0,0,0,1
3,1,1,0,0,1,0,0,1,0,0
4,0,0,0,0,0,1,1,0,0,0


- normalize the data (optional)

In [68]:
# Normalization
df_onehot = df_onehot.apply(lambda x: -1 + 2*(x - x.min()) / (x.max() - x.min()))
df_onehot = df_onehot.astype(int)

df_onehot.head()

Unnamed: 0,Size,Fur,Speaks,Movement_Flies,Movement_Runs,Movement_Swims,Class_Carp,Class_Cat,Class_Dog,Class_Parrot
0,-1,0,-1,-1,1,-1,-1,1,-1,-1
1,1,1,-1,-1,1,-1,-1,-1,1,-1
2,-1,-1,1,1,-1,-1,-1,-1,-1,1
3,0,0,-1,-1,1,-1,-1,1,-1,-1
4,-1,-1,-1,-1,-1,1,1,-1,-1,-1


- split the data into training inputs and real outputs:

In [70]:
classes = ['Class_Carp', 'Class_Cat', 'Class_Dog', 'Class_Parrot']
X = df_onehot.drop(columns=classes)
y = df_onehot[classes]
X.head()


Unnamed: 0,Size,Fur,Speaks,Movement_Flies,Movement_Runs,Movement_Swims
0,-1,0,-1,-1,1,-1
1,1,1,-1,-1,1,-1
2,-1,-1,1,1,-1,-1
3,0,0,-1,-1,1,-1
4,-1,-1,-1,-1,-1,1


- split the dataset into training and test sets


In [76]:
from sklearn.model_selection import train_test_split
import numpy as np

# Split dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data Shape:", X_train.shape, y_train.shape)
print("Testing Data Shape:", X_test.shape, y_test.shape)



Training Data Shape: (8, 6) (8, 4)
Testing Data Shape: (2, 6) (2, 4)


- convert the data into numpy array:

In [77]:
# Convert the data to numpy array
X_train, X_test = np.array(X_train), np.array(X_test)
y_train, y_test = np.array(y_train), np.array(y_test)

print("\nConverted to NumPy:")
print("Training Data Shape:", X_train.shape, y_train.shape)
print("Testing Data Shape:", X_test.shape, y_test.shape)


Converted to NumPy:
Training Data Shape: (8, 6) (8, 4)
Testing Data Shape: (2, 6) (2, 4)


- now we are ready to train the model