<img src='../images/gdd-logo.png' width='250px' align='right' style="padding: 15px">

# Feature Engineering

How good a machine learning model performs is partly dependent on how you choose to represent your data. Feature engineering is the **practice of creating new features** from your **existing** and **additional** data sources to improve model performance and/or model interpretability. It is often where the biggest improvements in model performance happen—not through fancy algorithms, but by giving the model better representations of the data. 

**Program:**
- [Introduction to Feature Engineering](#discussion)
- [About the data](#data)
- [Baseline model](#baseline)
- [Feature Engineering](#engineering)
    - [Indices and ratios](#ratios)
    - [Engineering from texts](#texts)
    - [Discretization](#binning)
    - [Combining outside sources](#outside)
- [Types of Feature Engineering](#types)
- [Conclusion](#conclusion)
- [Next Steps](#next)

<a id=data></a>

## About the data

In [None]:
import pandas as pd

# data processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (OneHotEncoder, KBinsDiscretizer)
from sklearn.impute import SimpleImputer

# pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# metrics
from sklearn.metrics import roc_auc_score

<img src='../images/feature_engineering/stroke_man.png' width='250px' align='right' style="padding: 15px">

According to the World Health Organization (WHO), strokes are the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

In this notebook, you will **predict whether a patient is likely to get a** `stroke`, based on input parameters like gender, age and whether or not they smoke. Each row in the data provides relavant information about the patient.

### Features

1. `id`: unique identifier
1. `address`: A general address (town, state (abbreviation) & postal code)
1. `gender`: "Male", "Female" or "Other"
1. `age`: age of the patient
1. `height`: height of the patient
1. `weight`: height of the patient
1. `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
1. `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
1. `ever_married`: "No" or "Yes"
1. `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
1. `residence_type`: "Rural" or "Urban"
1. `avg_glucose_level`: average glucose level in blood
1. `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"
1. `stroke`: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient.*

Let's start by importing the data and looking at the dataframe:

In [None]:
stroke = pd.read_csv('../data/stroke.csv').rename(columns=str.lower)
stroke.head()

<mark>**Question:**</mark> What kind of features do you think would useful to generate from this data?

<details>

  <summary><span style="color:blue">Show answers</span></summary>
  
Feature engineering often requires a good level of domain knowledge. The following areas could be looked into:
  - **Indices**: From height and weight, we could create a bmi column that is a bit more interpretable.
  - **Grouping continuous features**: There may be features - such as age - that do no show a linear relationship, however when grouping them and treating them as separate categorical groups (e.g., young, middle-aged, elderly) can capture the different risk factors associated with stroke in those different groups.
  - **Location Features**: Extracting information from the address column, such as separating town and state, or deriving features like region or proximity to healthcare facilities, could be useful in understanding geographical influences on stroke occurrence.
  - **Demographic Features**: Creating binary indicators for specific demographics, such as "is_urban" for residence_type or "is_married" for ever_married, can help the model capture demographic-specific patterns.
  - **External features**: If the state for each patient is extracted, an external feature could be mapped to the state, for example *number of healthcare facilities in the state per person*.
    
</details>

<a id='baseline'></a>

## Baseline model

Before you should do any feature engineering, you should again create a baseline model on the already existing features to see how the model is doing from the start.

Below, you can see the usual setup already implemented:
1. Define the features $X$ and target $y$,
2. Split the data into training and test data,
3. Define a `Pipeline` for onehot-encoding, imputing missing values, and the model, and
4. Fit and evaluate a model.

#### 1. Define features X and target y

In [None]:
# Variable definitions
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
numeric_cols = ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level', 'height', 'weight']
missing_cols = ['age','height', 'weight']
drop_cols = ['id','address']

target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

X, y = stroke.pipe(create_Xy, 
                   drop_cols=drop_cols, 
                   target_col=target,
                   )

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

#### 2. Set up pipeline, fit and score model

In [None]:
onehot = Pipeline(steps = [
    ('onehot', OneHotEncoder(drop = "if_binary")),
])

impute = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
])

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, numeric_cols)
], remainder = 'passthrough')

base_model = RandomForestClassifier(class_weight='balanced',
                                    max_depth=3,
                                    random_state=123,
                                    )

base_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', base_model)
])

In [None]:
base_pipeline.fit(X_train, y_train)

In [None]:
y_baseline_train_probs = base_pipeline.predict_proba(X_train)[:,1]
y_baseline_test_probs = base_pipeline.predict_proba(X_test)[:,1]

print(f'Train AUC: {roc_auc_score(y_train, y_baseline_train_probs):.4f}',
      f'Test AUC: {roc_auc_score(y_test, y_baseline_test_probs):.4f}',
      sep='\n'
      )

--- 

<a id='engineering'></a>

## Feature Engineering

Let's start engineering some features!

<a id='ratios'></a>
### Indices and Ratios

***Indices or ratios*** are features that are calculated by combining multiple original features via division, multiplication, addition, or subtraction.

For example, `bmi` is widely adopted as an index in the healthcare domain, which can be created as the combination of `height` and `weight` using this formula:


$$ BMI = \dfrac{weight}{height^2} $$

In reality, indices often do not improve model performance. However, they increase overall model interpretability. For example, it is easier for us to only look at the effect of BMI on stroke rather than having to the interaction of height and weight ourselves.
    
</details>


**Note:** There is no Transformer in sklearn that creates a BMI index, so you will do this in Pandas for now (ideally, you would want to build a custom transformer later on).

#### <mark>Exercise:</mark> Create a BMI/age index

What you should do:
1. Create a new feature matrix and call it `X_bmi` (*Hint: We already provided the scaffolding for this below*)
2. Using the Pandas `.assign()` method, calculate the BMI and call it `bmi`.

In [None]:
X_bmi, _ = (
    stroke
    # create a bmi column here
    .pipe(create_Xy,
          drop_cols=drop_cols,
          target_col=target)
)

**Answers**: Uncomment and run the cells below for answers

In [None]:
# %load ../answers/feature_engineering/create-bmi.py

With the `bmi` column created, you can now re-fit your baseline model to see how much this new feature improved performance.

In [None]:
# Train-test split
X_train_bmi, X_test_bmi, _, _ = train_test_split(X_bmi,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )


# add bmi to list of cols with NA's
missing_cols = ['age','height', 'weight', 'bmi']

# Refit model pipeline
preprocessor_bmi = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, missing_cols),
], remainder='passthrough')

pipeline_bmi = Pipeline(steps=[
    ('preprocessor', preprocessor_bmi),
    ('model', base_model)
])

pipeline_bmi.fit(X_train_bmi, y_train)

y_train_probs_bmi = pipeline_bmi.predict_proba(X_train_bmi)[:,1]
y_test_probs_bmi = pipeline_bmi.predict_proba(X_test_bmi)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_train_probs_bmi),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_test_probs_bmi),4)}',
      sep='\n'
      )


<a id='texts'></a>
### Engineering features from texts or strings

<img src='../images/feature_engineering/confused-robot.png' width='150px' align='right' style="padding: 15px">

In real-life settings a lot of the information available to us is in form of unstructured or unprocessed data, such as texts, longer strings, or pictures. For traditional ML models to understand this data, you will have to engineer meaningful features yourself before feeding it into the model.

<mark>**Question:**</mark> Which string-based feature is currently being ignored and what information would be useful to get from it?

<details>

  <summary><span style="color:blue">Show answers</span></summary>

The `address` column is currently being dropped, due to the fact that it is unique for each patient. However, it contains categories (such as states and zip codes) that may be useful. 

</details>

First look at how many states and zip codes are in the data:

In [None]:
stroke['address'].str.split().str[-1].nunique()

This is not a good use case in terms of grouping the patients since there are a vast number of zip codes. However for state:

In [None]:
stroke['address'].str.split().str[-2].nunique()

There are only 17 unique states. Let's look at how many are in each group:

In [None]:
stroke['address'].str.split().str[-2].value_counts()

<mark>**Question:**</mark> Will zip codes be a useful feature here? Why/Why not?

<details>

  <summary><span style="color:blue">Show answers</span></summary>

It will not be useful because there are almost as many zip codes as patients. This will make for very sparse and uninformative features when one-hot encoded.

</details>

The smallest states only have 51 patients. Considering how little actual stroke victims are in the dataset (~5%), you may have too few samples per state. Ideally, you may want to further group states into larger groups.

#### <mark>Exercise:</mark> Extract the state from the address

What you should do:
1. Create a new feature matrix and call it `X_address`
2. Extract the *state* of each address and add a `state` column to the dataframe.<br>
*Hint: To extract the states from the address, take a look at the code above where we calculated the number of unique states.* 
3. Why could the state be a useful feature?

<details>
  <summary><span style="color:blue">Show answers</span></summary>

3. There may be systematic differences between states in terms of life styles, obesity, genetics, demographics etc., which will make the state an indirect predictor of these differences.
    
</details>

In [None]:
X_state, _ = (
    stroke
    .assign(bmi = lambda df: df['weight'] / df['height']**2)
    # create a state column here
    .pipe(create_Xy, 
          drop_cols=drop_cols, 
          target_col=target,
          )
)

In [None]:
# %load ../answers/feature_engineering/create-states.py

You can now re-fit the model to see if the state data increases model performance:

In [None]:
# Train-test split
X_train_state, X_test_state, _, _ = train_test_split(X_state,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

# include state in list of categorical cols
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type', 'state']

# pipeline using both bmi and state
preprocessor_state = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, missing_cols),
], remainder='passthrough')

pipeline_state = Pipeline(steps=[
    ('preprocessor', preprocessor_state),
    ('model', base_model)
])

pipeline_state.fit(X_train_state, y_train)

y_state_train_probs = pipeline_state.predict_proba(X_train_state)[:,1]
y_state_test_probs = pipeline_state.predict_proba(X_test_state)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_state_train_probs),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_state_test_probs),4)}',
      sep='\n'
      )

<a id='binning'></a>
## Discretization

*Discretization* refers to the idea of categorizing continous data.

For example, you may want to discretize the average glucose level into "too low", "normal", and "too high". 

Discretization helps simplifying variables, making the model more interpretable. In addition, discretization can help to reduce noise and normalize outliers in the data.

In `sklearn`, you can use the [`KBinsDiscretizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) object, which can (you guessed it) be implemented into our existing `preprocessor` pipeline. 

For demonstration, let's discretize `age`, `bmi`, and `avg_glucose_level` into 6 quantiles and see how the model does.

In [None]:
binning_cols = ['age','bmi','avg_glucose_level']

binner = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
    ('bin', KBinsDiscretizer(n_bins = 6, strategy = 'quantile'))
])

preprocessor_bin = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, missing_cols),
    ('bin', binner, binning_cols)
], remainder='passthrough')

pipeline_bin = Pipeline(steps=[
    ('preprocessor', preprocessor_bin),
    ('model', base_model)
])

pipeline_bin.fit(X_train_state, y_train)

<mark>**Question:**</mark> Why does it not make sense to discretize `hypertension` or `heart_disease`?

<details>

  <summary><span style="color:blue">Show answers</span></summary>

1. Because these are binary (1 vs. 0) measures.
    
</details>

In [None]:
y_bin_train_probs = pipeline_bin.predict_proba(X_train_state)[:,1]
y_bin_test_probs = pipeline_bin.predict_proba(X_test_state)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_bin_train_probs),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_bin_test_probs),4)}',
      sep='\n'
)

<a id='outside'></a>
## Combining outside sources
Combining outside sources for feature engineering involves integrating external information with the original dataset to create new, more informative features.

Incorporating outside information provides additional contextual information that may not be available in the original dataset, potentially improving model performance.

#### <mark>Exercise:</mark> Using public stroke statistics to augment your data
<img src='../images/feature_engineering/cdc-stroke-mortality.png' width='400px' align='right' style="padding: 15px">


The ***US Center for Disease Control and Prevention (CDC)*** publishes [yearly stroke mortality rates](https://www.cdc.gov/nchs/pressroom/sosmap/stroke_mortality/stroke.htm) by each US state. You can use this information to link each state's mortality rate with the patients.

1. Incorporate the `rate` data from `2021` into your dataset and call the new feature matrix `X_rate`.
2. Why is the `rate` column more informative than the total `deaths` column?

*Hint: You will need to [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) the `state_rate2021` dataframe with your X_state dataframe.*

<details>

  <summary><span style="color:blue">Show answers</span></summary>

2. Because the `rate` column already normalizes the population of each state. If we went with total `deaths`, Texas (much more populous) would seem much more deadly compared to e.g., New Hampshire.

</details>

In [None]:
state_mortality = pd.read_csv('../data/cdc_mortality.csv').rename(columns=str.lower)

state_mortality.head()

In [None]:
# prune CDC data to 2021 and only state and rate columns
state_rate2021 = state_mortality.loc[lambda df: df['year'] == 2021, ['state','rate']]

# New feature matrix
X_rate, _ = (
    stroke
    .assign(bmi = lambda df: df['weight'] / df['height']**2)
    .assign(state = lambda df: df['address'].str.split().str[-2])

    ### your code here - merge the state_rate2021

    
    .pipe(create_Xy, 
          drop_cols=drop_cols, 
          target_col=target,
          )
)

In [None]:
# %load ../answers/feature_engineering/create-rate.py

Re-fit and evaluate the model pipeline with the new feature:

In [None]:
# Train-test split
X_train_rate, X_test_rate, _, _ = train_test_split(X_rate,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

# Refit model pipeline
preprocessor_state = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('bin', binner, binning_cols)
], remainder='passthrough', verbose_feature_names_out=False)

pipeline_state = Pipeline(steps=[
    ('preprocessor', preprocessor_state),
    ('impute', impute),
    ('model', base_model)
])

pipeline_state.fit(X_train_rate, y_train)

y_rate_train_probs = pipeline_state.predict_proba(X_train_rate)[:,1]
y_rate_test_probs = pipeline_state.predict_proba(X_test_rate)[:,1]

print(f'Train AUC: {round(roc_auc_score(y_train, y_rate_train_probs),4)}',
      f'Test AUC: {round(roc_auc_score(y_test, y_rate_test_probs),4)}',
      sep='\n'
      )

## Summary of the engineered features

Let's have a look at how many new features we have created:

In [None]:
len(base_pipeline[:-1].get_feature_names_out())

In [None]:
len(pipeline_state[:-1].get_feature_names_out())

Now that you have a good amount of new features, in the next step you should prune the feature space to only include the most useful ones.

---

<a id='types'></a>
## Other types of feature engineering

Feature engineering is a way to highlight key information based on your own domain knowledge, which helps the model focus on the most important information. This is by far not an exhaustive list of the types of feature engineering that exist. Other examples include:
* **Date and time features**: Creating features from the dates available, e.g. holidays, time of the day or day of the week.  
* **Grouping sparse classes**: If you have a feature with an individual low sample count, you might group various values together under some other category. For example: We could group our `state` column (which will generate at 49 new columns through the onehot encoder) into *northern* vs. *southern*, and *east-coast* vs. *west-coast* vs. *mid-west/central* states.
* **Group from threshold**: A new grouped variable for other variables, e.g. `obese`, and `normal`, `underweight` based on the `BMI`.
* **Indicator from threshold**: An indicator variable (0 or 1) based on a threshold on a column, e.g. `retired` based on `age`. 
* **Statistical features**: For time-series or otherwise co-dependent data, it is useful to look at statistical features such as the variance or skewness of a data distribution. In financial applications, volatility (the variance) of a feature can be a good predictor for future performance (e.g., volatility in the last 30 days can indicate a trend shift in stock value).   

### Other modalities
We have so far only dealt with tabular data. In practice, you may want to also include other modalities such as text (e.g., patient history or doctor's notes) and images (e.g., MRI scans, X-Rays or CT scans).

---
<a id='conclusion'></a>
## Conclusion

In this notebook, you have learned 
- The essentials of feature engineering and its impact on model accuracy and interpretability
- You have implemented core feature engineering techniques such as indices/interactions, binning, text extraction, and combining outside information to generate new features.
- You also have insight into other techniques, including those used with different types of data
- You can explain that new features do not always increase model performance or interpretability, so it is good practice to add new features while assessing both of these things.