# Feature Engineering: Derive New Input Variables

Apdated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This module covers polynomial feature engineering for predictive modeling tasks. We explore how to create new features by transforming existing input variables using polynomial combinations and interactions between features.

## Learning Objectives

- Learn how machine learning algorithms perform with polynomial input features
- Understand how to use polynomial features transform to create new versions of input variables
- Examine how the degree of polynomial impacts the number of input features created
- Apply polynomial feature transformation to real datasets

### Tasks to complete

- Implement polynomial feature transforms
- Evaluate model performance with different polynomial degrees
- Compare results across transformations
- Create visualizations of feature relationships

## Prerequisites

- Python 3.x environment
- Basic understanding of Python programming
- Basic understanding of machine learning concepts
- Familiarity with NumPy and scikit-learn libraries

## Get Started

Setup steps:

- Install required Python packages:
  - numpy
  - pandas
  - scikit-learn
  - matplotlib

## Get Started

To start, we install required packages, import the necessary libraries, and define a helper function to download data using the `requests` library.

### Install required packages

In [None]:
%pip install matplotlib numpy pandas requests scikit-learn

### Import necessary libraries

In [None]:
import warnings
from pathlib import Path

import requests
from matplotlib import pyplot as plt
from numpy import asarray, mean, std
from pandas import DataFrame, read_csv
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures


### Define utility functions

Define a helper function for downloading example datasets.  

*Note!* It is not essential that you understand the following code.  It is just for getting the example data.

In [None]:
def download(url, to_file):
    """Download content from the given URL and save it to a file.

    Args:
        url (str): The URL to download the content from.
        to_file (str): The name of the file to save the downloaded content to.

    """
    response = requests.get(url, timeout=10, headers={"user-agent": "curl/7.81.0"})
    Path(to_file).write_bytes(response.content)
    print(f"downloaded file '{to_file}'")

## Polynomial Features

Polynomial features are those features created by raising existing features to an exponent. For
example, if a dataset had one input feature X, then a polynomial feature would be the addition
of a new feature (column) where values were calculated by squaring the values in X, e.g. $X^2$.
This process can be repeated for each input variable in the dataset, creating a transformed
version of each. As such, polynomial features are a type of feature engineering, e.g. the creation
of new input features based on the existing features. The degree of the polynomial is used to
control the number of features added, e.g. a degree of 3 will add two new variables for each
input variable. Typically a small degree is used such as 2 or 3.

It is also common to add new variables that represent the interaction between features, e.g. 
a new column that represents one variable multiplied by another. This too can be repeated
for each input variable creating a new interaction variable for each pair of input variables. A
squared or cubed version of an input variable will change the probability distribution, separating
the small and large values, a separation that is increased with the size of the exponent.

This separation can help some machine learning algorithms make better predictions and is
common for regression predictive modeling tasks and generally tasks that have numerical input
variables. Typically linear algorithms, such as linear regression and logistic regression, respond
well to the use of polynomial input variables.

## Polynomial Feature Transform

The polynomial features transform is available in the scikit-learn Python machine learning
library via the `PolynomialFeatures` class.

It generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

In [None]:
# demonstrate the types of polynomial features created

# define the dataset
data = asarray([[2, 3], [2, 3], [2, 3]])
print(data)

# perform a polynomial features transform of the dataset
# generate polynomial and interaction features.
# 'degree' specifies the maximal degree of the polynomial features
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
print(data)

Running the example first reports the raw data with two features (columns) and each feature
has the same value, either 2 or 3. Then the polynomial features are created, resulting in six
features, matching what was described above.

The *degree* argument controls the number of features created and defaults to 2. The
*interaction_only* argument means that only the raw values (degree 1) and the interaction
(pairs of values multiplied with each other) are included, defaulting to False. The *include_bias*
argument defaults to True to include the bias feature.

## Sonar Dataset

The sonar dataset is a standard machine learning dataset for binary classification. It involves
60 real-valued inputs and a two-class target variable. The data set contains
bouncing sonar
signals off a metal cylinder or rocks obtained from a variety of different aspect angles. Each number
represents the energy within a particular frequency band, integrated over
a certain period of time. There are 208 examples in the dataset
and the classes are reasonably balanced. The dataset describes sonar returns of rocks or simulated mines. You can learn more
about the dataset from here:

* Sonar Dataset ([sonar.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv))
* Sonar Dataset Description ([sonar.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.names))

### Download Sonar data files

In [None]:
download(
    url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv",
    to_file="sonar.csv",
)

download(
    url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.names",
    to_file="sonar.names",
)

### Summarizing the variables from the sonar dataset

In [None]:
# load and summarize the sonar dataset

# load dataset
dataset = read_csv("sonar.csv", header=None)
print(dataset.head())

# summarize the shape of the dataset
print(dataset.shape)

# summarize each variable
print(dataset.describe())


This confirms the 60
input variables, one output variable, and 208 rows of data. A statistical summary of the input
variables is provided showing that values are numeric and range approximately from 0 to 1.

In [None]:
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]

# show the plot
plt.show()

Finally a histogram is created for each input variable. If we ignore the clutter of the plots and
focus on the histograms themselves, we can see that many variables have a skewed distribution.
The dataset provides a good candidate for using a quantile transform to make the variables
more-Gaussian.

Next, let's fit and evaluate a machine learning model on the raw dataset. We will use
a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated
stratified k-fold cross-validation.

In [None]:
# evaluate knn on the raw sonar dataset

# KFold is a cross-validator that divides the dataset into k folds.
#
# Stratified is to ensure that each fold of dataset has the same proportion of
#   observations with a given label.
#
# Repeated provides a way to improve the estimated performance of a machine
#   learning model.
#
# This involves simply repeating the cross-validation procedure multiple times
# and reporting the mean result across all folds from all runs. This mean result
# is expected to be a more accurate estimate of the true unknown underlying mean
# performance of the model on the dataset, as calculated using the standard
# error.

# load dataset
dataset = read_csv("sonar.csv", header=None)
data = dataset.values

# separate into input and output columns
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
X = X.astype("float32")
y = LabelEncoder().fit_transform(y.astype("str"))

# define and configure the model
# n_neighbors : int, default=5
model = KNeighborsClassifier()

# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report model performance
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

In this case we can see that the model achieved a mean classification accuracy of about 79.7
percent.

## Polynomial Feature Transform

We can apply the polynomial features transform to the Sonar dataset directly. In this case, we
will use a degree of 3.

In [None]:
# visualize a polynomial features transform of the sonar dataset

# load dataset
dataset = read_csv("sonar.csv", header=None)

# retrieve just the numeric input values
data = dataset.values[:, :-1]

# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=3)
data = trans.fit_transform(data)

# convert the array back to a dataframe
dataset = DataFrame(data)

# summarize
print(dataset.shape)

We
can see that our features increased from 61 (60 input features) for the raw dataset to 39,711
features (39,710 input features).

Next, let's evaluate the same KNN model as the previous section, but in this case on a
polynomial features transform of the dataset.

In [None]:
# evaluate knn on the sonar dataset with polynomial features transform

# load dataset
dataset = read_csv("sonar.csv", header=None)
data = dataset.values

# separate into input and output columns
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
X = X.astype("float32")
y = LabelEncoder().fit_transform(y.astype("str"))

# define the pipeline
trans = PolynomialFeatures(degree=3)
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[("t", trans), ("m", model)])

# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report pipeline performance
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

We can see that the polynomial features transform results in a lift in
performance from 79.7 percent accuracy without the transform to about 80.0 percent with the
transform.

## Effect of Polynomial Degree

The degree of the polynomial dramatically increases the number of input features. To get an
idea of how much this impacts the number of features, we can perform the transform with a
range of different degrees and compare the number of features in the dataset.

In [None]:
# compare the effect of the degree on the number of created features

# get the dataset
def get_dataset(filename):
    # load dataset
    dataset = read_csv(filename, header=None)
    data = dataset.values
    # separate into input and output columns
    X, y = data[:, :-1], data[:, -1]
    # ensure inputs are floats and output is an integer label
    X = X.astype("float32")
    y = LabelEncoder().fit_transform(y.astype("str"))
    return X, y


# define dataset
X, y = get_dataset("sonar.csv")

# calculate change in number of features
num_features = []
degrees = list(range(1, 5))
for d in degrees:
    # create transform
    trans = PolynomialFeatures(degree=d)
    # fit and transform
    data = trans.fit_transform(X)
    # record number of features
    num_features.append(data.shape[1])
    # summarize
    print("Degree: %d, Features: %d" % (d, data.shape[1]))

# plot degree vs number of features
plt.plot(degrees, num_features)
plt.show()

We can see that a degree of 1 has no effect and that the number of features dramatically
increases from 2 through to 4. This highlights that for anything other than very small datasets,
a degree of 2 or 3 should be used to avoid a dramatic increase in input variables.

More features may result in more overfitting, and in turn, worse results. It may be a good
idea to treat the degree for the polynomial features transform as a hyperparameter and test
different values for your dataset. The example below explores degree values from 1 to 4 and
evaluates their effect on classification accuracy with the chosen model.

*Note: The following code block could take a few minutes to complete.*

In [None]:
# explore the effect of degree on accuracy for the polynomial features transform

warnings.filterwarnings("ignore")


# get the dataset
def get_dataset(filename):
    # load dataset
    dataset = read_csv(filename, header=None)
    data = dataset.values
    # separate into input and output columns
    X, y = data[:, :-1], data[:, -1]
    # ensure inputs are floats and output is an integer label
    X = X.astype("float32")
    y = LabelEncoder().fit_transform(y.astype("str"))
    return X, y


# get a list of models to evaluate
def get_models():
    models = {}
    for d in range(1, 5):
        # define the pipeline
        trans = PolynomialFeatures(degree=d)
        model = KNeighborsClassifier()
        models[str(d)] = Pipeline(steps=[("t", trans), ("m", model)])
    return models


# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # Feel free to adjust `n_jobs` to use as many cores as you would like.
    scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=1)
    return scores


# define dataset
X, y = get_dataset("sonar.csv")

# get the models to evaluate
models = get_models()

# evaluate the models and store results
results = []
names = []
for name, model in models.items():
    scores = evaluate_model(model, X, y)
    results.append(scores)
    names.append(name)
    print(">%s %.3f (%.3f)" % (name, mean(scores), std(scores)))


We can see that performance is generally worse than no transform (degree
1) except for a degree 3. It might be interesting to explore scaling the data before or after
performing the transform to see how it impacts model performance.

Box and whisker plots can be created to summarize the classification accuracy scores for each
polynomial degree.

In [None]:
# plot model performance for comparison
plt.boxplot(results, label=names, showmeans=True)
plt.show()

We can see that performance remains 
flat, perhaps with the first signs of
overfitting with a degree of 4.

## Conclusion

Through this module, we learned how to create polynomial features to potentially improve model performance. We explored the impact of different polynomial degrees on feature space dimensionality and model accuracy. The techniques demonstrated show how feature engineering can expose non-linear relationships in data that may improve predictive modeling results.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.￼
