# Pandas output

In this notebook, we review the Pandas output API from scikit-learn v1.2.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/01-pandas-output.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intermediate-v2/main/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

## Loading wine dataset

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

In [None]:
wine = load_wine(as_frame=True)
X, y = wine.data, wine.target

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

## Default Scaler

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit_transform(X_train)

## Scaler with Pandas output

In [None]:
scaler.set_output(transform="pandas")
scaler.fit_transform(X_train)

## In a ML Pipeline

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import make_pipeline

In [None]:
log_reg = make_pipeline(
    StandardScaler(),
    SelectPercentile(percentile=50),
    LogisticRegression()
)

In [None]:
log_reg.set_output(transform="pandas")
log_reg.fit(X_train, y_train)

In [None]:
log_reg[-1]

In [None]:
log_reg[-1].feature_names_in_

## Exercise 1

1. The Wisconsion cancer data set is loaded into `X` and `y`. Split the data set into a training and test set.
    - **Hint**: Remember to use `stratify=y`.
2. Which feature(s) or the dataset are missing?
    - **Hint**: Use panda's `isna().sum()`
3. Use a `SimpleImputer` with `add_indicator=True` and `set_output(transform="pandas")` and run `fit_transform` on the training set. What is the shape of the transformed data?

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer

cancer = fetch_openml(data_id=15, as_frame=True, parser="pandas")
X, y = cancer.data, cancer.target

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/01-ex01-solutions.py). 

In [None]:
# %load solutions/01-ex01-solutions.py

## Exercise 2

1. Build a pipeline with the `StandardScaler`, `KNNImputer(add_indicator=True)`, and `LogisticRegression` and configured for pandas output.
1. Train the pipeline on the Wisconsion cancer training set and evaluate the performance of the model on the test set.
1. Create a pandas series where the values is the coefficients of `LogisticRegression` and index is the `feature_names_in_`.
    - **Hint**: The logistic regression estimator is the final step of the pipeline. 
    - **Hint**: The coefficients are stored as `coef_` in logistic regression estimator. (Use `ravel` to flatten the array)
1. Which feature has a negative impact on cancer?

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/01-ex02-solutions.py). 

In [None]:
# %load solutions/01-ex02-solutions.py

## Global configuration

Output pandas by default!

In [None]:
import sklearn
sklearn.set_config(transform_output="pandas")

In [None]:
cancer = fetch_openml(data_id=15, as_frame=True, parser="pandas")
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

In [None]:
scaler = SimpleImputer(add_indicator=True)

In [None]:
scaler.fit_transform(X_train)