# Categorical Data & Pandas Input

In this notebook, we learn about how to encode categorical data and work with heterogeneous from Panda's DataFrames.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/02-categorical-data.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intermediate-v2/main/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

## Categorical Data

In [None]:
import pandas as pd

df_train = pd.DataFrame({
    "pet": ["snake", "dog", "cat", "cow"],
})
df_train

## OrdinalEncoder

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
ord_encoder = OrdinalEncoder()
ord_encoder.fit_transform(df_train)

In [None]:
ord_encoder.categories_

In [None]:
df_test = pd.DataFrame({
    "pet": ["cow", "cat"]
})
df_test

In [None]:
ord_encoder.transform(df_test)

### Categories that are unknown during `fit`

In [None]:
df_test_unknown = pd.DataFrame({
    "pet": ["bear"]
})
df_test_unknown

In [None]:
try:
    ord_encoder.transform(df_test_unknown)
except ValueError as err:
    print(err)

### How to handle unknown categories in OrdinalEncoder?

In [None]:
ord_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

In [None]:
ord_encoder.fit_transform(df_train)

In [None]:
ord_encoder.transform(df_test_unknown)

## OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
X_trans = ohe.fit_transform(df_train)
X_trans

By default, its sparse!

In [None]:
X_trans.toarray()

In [None]:
ohe.get_feature_names_out()

### Pandas output requires non-sparse data

In [None]:
ohe = OneHotEncoder(sparse_output=False)
ohe.fit_transform(df_train)

In [None]:
ohe.set_output(transform="pandas")

In [None]:
ohe.fit_transform(df_train)

### Unknown categories during transform?

In [None]:
df_test_unknown

In [None]:
# this will fail
try:
    ohe.transform(df_test_unknown)
except ValueError as exc:
    print(exc)

### OHE can handle unknowns

In [None]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
ohe.set_output(transform="pandas")
ohe.fit(df_train)

In [None]:
ohe.transform(df_test_unknown)

### Two categorical features

In [None]:
df_train = pd.DataFrame({
    "pet": ["cat", "dog", "snake"],
    "city": ["New York", "London", "London"]
})

In [None]:
ohe.fit(df_train)

In [None]:
ohe.transform(df_train)

## Back to slides!

# ColumnTransformer

In [None]:
X_df = pd.DataFrame({
    'age': [10, 20, 15, 5, 20, 14],
    'height': [5, 7, 6.5, 4.1, 5.4, 5.4],
    'pet': ['dog', 'snake', 'cat', 'dog', 'cat', 'cat']
})
X_df

## With OrdinalEncoder

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

In [None]:
ct = ColumnTransformer([
    ('numerical', StandardScaler(), ['age', 'height']),
    ('categorical', OrdinalEncoder(), ['pet'])
])

ct.fit_transform(X_df)

### Pandas output

In [None]:
ct = ColumnTransformer([
    ('numerical', StandardScaler(), ['age', 'height']),
    ('categorical', OrdinalEncoder(), ['pet'])
], verbose_feature_names_out=False)

ct.set_output(transform="pandas")

In [None]:
ct.fit_transform(X_df)

## With OneHotEncoder

In [None]:
ct = ColumnTransformer([
    ('numerical', StandardScaler(), ['age', 'height']),
    ('categorical', OneHotEncoder(sparse_output=False), ['pet'])
], verbose_feature_names_out=False)

ct.set_output(transform="pandas")

In [None]:
ct.fit_transform(X_df)

## In a ML Pipeline

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
titanic = fetch_openml(data_id=40945, as_frame=True, parser="pandas")

In [None]:
X, y = titanic.data, titanic.target

In [None]:
X.head()

## Are there categories already encoded in the dataset?

In [None]:
X.dtypes

In [None]:
X.shape[0]

## Are there missing values in the dataset?

In [None]:
missing_values = pd.concat(
    {
        "missing_count": X.isna().sum(),
        "dtypes": X.dtypes,
    },
    axis='columns',
)
missing_values

## Split data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

In [None]:
X_train

## Numerical preprocessing

In [None]:
missing_values

In [None]:
numerical_features = ["fare", "body", "age", "pclass", "sibsp", "parch"]

### Global pandas output!

In [None]:
import sklearn

In [None]:
sklearn.set_config(transform_output="pandas")

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_prep = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())
])

In [None]:
num_prep.fit_transform(X_train[numerical_features])

### Categorical features

In [None]:
missing_values

In [None]:
categorical_features = ["sex", "embarked"]

In [None]:
cat_prep = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

In [None]:
cat_prep.fit_transform(X_train[categorical_features])

## ColumnTransformer!

In [None]:
ct = ColumnTransformer([
    ("numerical", num_prep, numerical_features),
    ("categorical", cat_prep, categorical_features),
])

In [None]:
ct.fit_transform(X_train)

In [None]:
ct = ColumnTransformer([
    ("numerical", num_prep, numerical_features),
    ("categorical", cat_prep, categorical_features),
], verbose_feature_names_out=False)

In [None]:
ct.fit_transform(X_train)

## ML Pipeline

In [None]:
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = Pipeline([
    ('preprocess', ct),
    ('log_reg', LogisticRegression(random_state=42))
])

In [None]:
log_reg.fit(X_train, y_train)

In [None]:
log_reg.score(X_test, y_test)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = Pipeline([
    ('preprocess', ct),
    ('log_reg', RandomForestClassifier(random_state=42))
])

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf.score(X_train, y_train)

## Exercise 1

1. The Penguin dataset is loaded into `X` and `y`. Is `y` a classification or regression problem?
1. What are the feature names in the dataset?
1. Which features have missing values?
1. What are the categorical features and the numerical features? Store them into `categorical_features` and `numerical_features` respectively.
    - **Hint:** Use `df.select_dtypes(include="category").columns` to find the cateogrical features
    - **Hint:** Use `df.select_dtypes(include="number").columns` to find the numerical features

In [None]:
from sklearn.datasets import fetch_openml

penguins = fetch_openml(data_id=42585, as_frame=True, parser="pandas")

X, y = penguins.data, penguins.target

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/02-ex01-solutions.py). 

In [None]:
# %load solutions/02-ex01-solutions.py

## Exercise 2

1. Use `train_test_split` to split data into a training and test set. **Hint:** Use `random_state=0` and `stratify`
1. Build a `ColumnTransformer` for the penguin dataset with the following transformers:
    - For the numerical features use a `SimpleImputer`
    - For the categorical features use a `OrdinalEncoder` with `encoded_missing_value=-1`.
    - **Hint:** Use `verbose_feature_names_out=False`
1. Build a pipeline with the `ColumnTransformer` from the previous step and a `RandomForestClassifier` with `random_state=0`.
1. Evalute the model on the test set.

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/02-ex02-solutions.py). 

In [None]:
# %load solutions/02-ex02-solutions.py