<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/13-encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.datasets import fetch_openml
import seaborn as sns
import pandas as pd
import numpy as np

## Transforming Continuous (Numeric) Featuers

#### Standardization
Standardization is the process of scaling features to have a mean of 0 and a standard deviation of 1. The formula for standardization is:

$$ z = \frac{x - \mu}{\sigma} $$

where:
- $z$ is the standardized value  
- $x$ is the original value  
- $\mu$ is the mean of the feature  
- $\sigma$ is the standard deviation of the feature  

#### Normalization
Normalization is the process of scaling features to a range of [0, 1]. The formula for normalization is:

$$ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $$

where:
- $x'$ is the normalized value  
- $x$ is the original value  
- $x_{\min}$ is the minimum value of the feature  
- $x_{\max}$ is the maximum value of the feature  


## Normalization vs. Standardization

### **Use Normalization (Scaling to [0, 1] or [-1, 1]) When:**
- **Bounded Data**: Features have a fixed range (e.g., pixel values [0, 255]).
- **Deep Learning**: Neural networks perform better with small, scaled inputs.
- **Distance-Based Models**: k-NN, K-Means, and clustering methods rely on consistent feature scales.
- **Non-Gaussian Data**: Works even when data isn't normally distributed.
- **Interpretability**: Easier to understand in real-world terms.

### **Use Standardization (Zero Mean, Unit Variance) When:**
- **Gaussian-Like Data**: Ideal for normally distributed features.
- **Linear Models & PCA**: Regression, SVM, and PCA assume standardized inputs.
- **Outlier Robustness**: Less sensitive to extreme values than normalization.
- **Different Units**: Useful when features have varying scales (e.g., income vs. age).
- **Optimization Stability**: Gradient-based models (SGD, Adam) converge better.

### **Key Takeaways:**
- **Normalization**: Best for bounded data, deep learning, and distance-based models.
- **Standardization**: Best for Gaussian-like data, linear models, and handling different units.

### Sklearn Scaling / Normalizing

#### Scaling

In [None]:
X = np.random.normal(loc = 10, scale = 3, size = 1000)

In [None]:
np.mean(X), np.std(X)

In [None]:
scaler = StandardScaler()

# Note: Sklearn requires at least one column; the reshape ensures a column vector
X_scaled = scaler.fit_transform(X.reshape(-1, 1))

In [None]:
np.mean(X_scaled), np.std(X_scaled)

#### Normalizing

In [None]:
normalizer = MinMaxScaler()

X_normalized = normalizer.fit_transform(X.reshape(-1, 1))

In [None]:
np.min(X_normalized), np.max(X_normalized)

### Pandas Scaling

In [None]:
df = sns.load_dataset('iris')

In [None]:
# Sklearn StandardScaler converts to array
scaler = StandardScaler()
scaler.fit_transform(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

In [None]:
# Pandas apply to keep as dataframe; filter by float columns
df_standardized = df.apply(lambda x: (x - x.mean()) / x.std() if x.dtype == 'float64' else x)
df_standardized.head()

In [None]:
# Pandas normalization
df_normalized = df.apply(lambda x: (x - x.min()) / (x.max() - x.min()) if x.dtype == 'float64' else x)
df_normalized.head()

## Processing Categorical Features

### Label Encoding

Typically used to encode the labels or targets when labels are categories.  

`LabelEncoder` from `sklearn.preprocessing` maps from categories (strings) to integer values.


### One-Hot Encoding

One-hot encoding splits up a single categorical feature (e.g., `['cat', 'dog', 'fish']`) into several columns which represent binary values, 1 mapped to the category of the observation, and 0 for the other categories.

For example, the animal column with values `['cat', 'dog', 'fish', 'cat']` Would map to

| cat | dog | fish |
|-----|-----|------|
|  1  |  0  |  0   |
|  0  |  1  |  0   |
|  0  |  0  |  1   |
|  1  |  0  |  0   |




### Sklearn Encoding

#### Label Encoding

In [None]:
y_str = ['zebra', 'dog', 'cat', 'fish', 'dog', 'cat', 'fish']

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_str)

In [None]:
print(y_encoded)

#### One Hot Encoding

In [None]:
one_hot_encoder = OneHotEncoder(sparse_output = False)
y_one_hot = one_hot_encoder.fit_transform(y_encoded.reshape(-1, 1))

In [None]:
print(y_one_hot)

### Pandas Encoding

#### Label Encoding

In [None]:
# Load in the titanic dataset
data = fetch_openml(data_id=40945, parser = 'auto')
titanic = data.frame
titanic.drop(['body', 'boat', 'name', 'ticket', 'home.dest', 'cabin'], axis = 1, inplace = True)
titanic.dropna(inplace = True)

In [None]:
titanic.head()

In [None]:
titanic.info()

In [None]:
titanic_encoded = titanic.apply(lambda x: pd.Categorical(x).codes if x.dtype == 'category' else x)

In [None]:
titanic_encoded.head()

#### One-Hot Encoding

In [None]:
titanic_one_hot = pd.get_dummies(titanic)

In [None]:
titanic_one_hot.head()

# **In-Class Activity: Predicting Obesity Levels from Eating Habits and Physical Condition**

In this activity, you will work with a dataset designed to predict obesity levels based on various eating habits and physical conditions. Your goal is to preprocess the data, experiment with different encoding strategies, and compare classification models.

---

## **Review the Dataset**
Before beginning, take some time to familiarize yourself with the dataset and its features. Feature descriptions can be found [here](https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition).

Consider the following as you review the dataset:
- What types of features are present? *(Numerical, ordinal, categorical?)*  
- How should these features be encoded for use in machine learning models?

---

## **Data Preprocessing**
- **Encoding:** Decide how to encode categorical and ordinal variables appropriately.
- **Splitting:** Divide the dataset into **80% training** and **20% testing** using:


## **Model Training & Cross-Validation**
- Apply **cross-validation** on the training set to fine-tune hyperparameters and evaluate model performance.
- Compare the results of **$k$-Nearest Neighbors (k-NN) and Logistic Regression** using cross-validation scores.

### **Evaluation:**
1. Compare the models based on accuracy.
2. Consider hyperparameter tuning for both models:
   - For **k-NN**, experiment with different values of k, metrics, and weighting.
   - For **Logistic Regression**, consider trying different penalties. (View the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

## **Wait Before Testing!**
🚨 **Do NOT evaluate your model on the test set until instructed to do so!** 🚨  

- The test set should remain **unseen** throughout training and validation.
- We will use it **only once** to assess the final model’s performance.
- Keep track of your cross-validation results to decide which model to use for final testing.

### **Why is this important?**
Evaluating too early on the test set can lead to **data leakage** and **overfitting**, giving misleading performance estimates. The test set should serve as a final, unbiased evaluation of the model.




In [None]:
# Here is the data:
df = pd.read_csv('https://raw.githubusercontent.com/rhodes-byu/cs180-winter25/refs/heads/main/data/obesity.csv')
df.head()