# Polynomial Features and Interactions

With the existing model:

In [None]:
import pandas as pd

# data processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (OneHotEncoder,
                                   PolynomialFeatures, 
                                   KBinsDiscretizer)
from sklearn.impute import SimpleImputer

# pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# models
from sklearn.ensemble import RandomForestClassifier

# metrics
from sklearn.metrics import roc_auc_score

In [None]:
stroke = pd.read_csv('../data/stroke.csv').rename(columns=str.lower)

In [None]:
# Variable definitions
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
numeric_cols = ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level', 'height', 'weight']
missing_cols = ['age','height', 'weight']
drop_cols = ['id','address']

target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

X, y = stroke.pipe(create_Xy, 
                   drop_cols=drop_cols, 
                   target_col=target,
                   )

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

In [None]:
onehot = Pipeline(steps = [
    ('onehot', OneHotEncoder(drop = "if_binary")),
])

impute = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
])

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, numeric_cols)
], remainder = 'passthrough')

base_model = RandomForestClassifier(class_weight='balanced',
                                    max_depth=5,
                                    random_state=123,
                                    )

base_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', base_model)
])

In [None]:
base_pipeline.fit(X_train, y_train)

y_baseline_train_probs = base_pipeline.predict_proba(X_train)[:,1]
y_baseline_test_probs = base_pipeline.predict_proba(X_test)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_baseline_train_probs),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_baseline_test_probs),4)}',
      sep='\n'
      )

<a id='polynomial'></a>
### Polynomial Features and Interactions

***Polynomial features*** are created by raising the original features to a power greater than one. For example, a simple linear model could be expressed as $y(x_i) = a*x_i + b$ with $y$ being the prediction, $x_i$ the feature, and $a$ and $b$ the coefficients. We could now simply add a polynomial feature, like so: $y(x_i) = a*x_i + c*x_i^2 + b$. For some models (especially those that are linear in nature), polynomial features can be a useful addition.

***Interactions*** are essentially just multiplications of two (or more) features. We could, for example, add an interaction to the previous equation like so: $y(x_i) = a*x_i + c*x_i^2 + d*x_i*x_i^2 + b$. Here, the term $d*x_i*x_i^2$ is the interaction between $x_i$ and $x_i^2$. Interactions can also happen between features like $x_1 * x_2$ with $x_1$ being one feature and $x_2$ another.

**Why can they be useful?**

Polynomials and interactions can help:

1. capture non-linear and complex relationships between the independent variable(s) and the target variable,
2. add additional model flexibility by including interactions within and between different degree polynomials,
3. and therefore (potentially) better capture the complexity of the underlying data patterns.

In `sklearn`, adding polynomial features and interactions is as easy as using the `PolynomialFeatures` class ([click for documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)), and adding it to our existing pipeline.

Let's add some 2nd degree polynomials to all features.

In [None]:
bin_cols = ['age','bmi','avg_glucose_level']

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, missing_cols),
], remainder='passthrough')

poly_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    # Since we do it with all features, we could also just add it here
    ('polynomial', PolynomialFeatures(degree = 2, interaction_only = False)),
    ('model', base_model)
])

poly_pipeline.fit(X_train, y_train)

y_poly_train_probs = poly_pipeline.predict_proba(X_train)[:,1]
y_poly_test_probs = poly_pipeline.predict_proba(X_test)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_poly_train_probs),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_poly_test_probs),4)}',
      sep='\n'
      )