# Structural Patterns

The creational patterns are guidelines suggested to compose objects in a way that allows to perform new functionalities. Those new functionalities would be harder to do without the new class/function created.

The structural patterns at the same time allows dealing with across entities relationships in an easier and managable way.

In simple words: **How to compose complex objects**

The seven structural patterns available are:
1. Adapter
2. Bridge
3. Composite
4. Decorator
5. Facade
6. Flyweight
7. Proxy

Each of those previous patterns are unique, and can be used in different situations and conditions.

# Flyweight

**What is the flyweight?**

This is a design pattern that suggest to classify the inner states of an object between intrinsic and extrinsic states. The **intrinsic states** are common states between across you're code, they're suggested to cached. On the other hand, the **extrinsic states** change over the execution, so it is not worth to save all of them. If you're saving you're extrinsic states, you're likely to run out of memory during your program execution. 

**When should we use it?**

This is suggested to use when our code execution has common steps that can be cached to speed up execution. This will serve like a feature store to provide a way for downstream executions.

**Scenario**

You need to create from scratch an AutoML pipeline that process the data and trains a random forest, a linear model, and a xgboost model. At the end, this should be printing the metrics of each model trained.

Load some toy dataset

In [67]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error as mae

import time

In [68]:
df = load_diabetes(as_frame=True, return_X_y=True)

In [69]:
df = pd.concat([df[0], df[1]], axis=1)

Make an split for train-test

In [70]:
y = df["target"]
x = df.drop(["target"], axis=1)

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=99
)

## Antipattern

The easiest antipattern to reproduce is the one which executes everything from scratch when training the model.

Why is this an antipattern? We introduce an overhead of processing because for all models trained, we will use the same data.

Here our intrinsic states are driven by the data processing until before the model training. Why? All the data until this point will be the same. 

What happens after this point? Well, from this point we're reaching the extrinsic states. Each of the possible models can take different configurations and everytime we're training a model, this might change even due to a generator seed given.

Let's start with out antipattern class

In [71]:
class AutoMLPipeline:
    def __init__(self, model_names: list):
        self.model_names = model_names
        self.model_instances = {}

    def fit(self, x_train: pd.DataFrame, y_train: pd.DataFrame) -> pd.DataFrame:
        """
        Fits the models for the AutoML pipeline
        """
        x_train = x_train.copy()
        for model in self.model_names:
            self.model_instances[model] = self.__fit_model(model, x_train, y_train)

    def __fit_model(self, model_name: str, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Fits a model based on the model name provided.
        """
        x_train = self._minmax_scaler(x_train)
        if model_name == "rf":
            return self._train_rf(x_train, y_train)
        elif model_name == "lm":
            return self._train_lm(x_train, y_train)
        elif model_name == "xgboost":
            return self._train_xgboost(x_train, y_train)
        else:
            raise ValueError(f"Model {model_name} not recognized.")

    def _minmax_scaler(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Scales all columns given a MinMaxScaler instance.
        """
        scaler = MinMaxScaler()
        df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        return df

    def _train_rf(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains a RandomForestRegressor model.
        """
        model = RandomForestRegressor()
        model.fit(x_train, y_train)
        return model

    def _train_lm(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains a LinearRegression model.
        """
        from sklearn.linear_model import LinearRegression

        model = LinearRegression()
        model.fit(x_train, y_train)
        return model

    def _train_xgboost(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains an XGBRegressor model.
        """
        from xgboost import XGBRegressor

        model = XGBRegressor()
        model.fit(x_train, y_train)
        return model

    def get_scores(self, x_test: pd.DataFrame, y_test: pd.DataFrame) -> None:
        """
        Gets MAE for each model trained
        """
        scores = {}
        for model in self.model_names:
            y_pred = self.model_instances[model].predict(x_test)
            scores[model] = mae(y_true=y_test, y_pred=y_pred)
        return scores

Let's execute the pipeline, but let's create a timer to count the time-needed for the execution

In [72]:
start_time = time.time()
automl = AutoMLPipeline(model_names=["rf", "lm", "xgboost"])
automl.fit(x_train, y_train)
end_time = time.time()
scores = automl.get_scores(x_test, y_test)

In [73]:
print(end_time - start_time)

0.24162721633911133


As you saw previously, it took around 0.21 seconds. It is less than a second, but if we start to increase the preprocessing steps into our AutoML class, this will be several times longer. Even though we can't see right now a big problem, if the pipeline grows up int the future, this will take a long time for execution.

**How to solve the antipattern?**

1. Perform the data processing outside the each model training (extrinsic -> instrinsic).
2. Cache the results of the preprocessing.
3. Use the cached version of data for model training

## Pattern

Let's create a new class with the modifications.

In [74]:
class NewAutoMLPipeline:
    def __init__(self, model_names: list):
        self.model_names = model_names
        self.model_instances = {}
        self.df_cache = None

    def fit(self, x_train: pd.DataFrame, y_train: pd.DataFrame) -> pd.DataFrame:
        """
        Fits the models for the AutoML pipeline
        """
        x_train = x_train.copy()
        self.df_cache = self._minmax_scaler(x_train)
        for model in self.model_names:
            self.model_instances[model] = self.__fit_model(
                model, self.df_cache, y_train
            )

    def __fit_model(self, model_name: str, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Fits a model based on the model name provided.
        """
        if model_name == "rf":
            return self._train_rf(x_train, y_train)
        elif model_name == "lm":
            return self._train_lm(x_train, y_train)
        elif model_name == "xgboost":
            return self._train_xgboost(x_train, y_train)
        else:
            raise ValueError(f"Model {model_name} not recognized.")

    def _minmax_scaler(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Scales all columns given a MinMaxScaler instance.
        """
        scaler = MinMaxScaler()
        df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        return df

    def _train_rf(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains a RandomForestRegressor model.
        """
        model = RandomForestRegressor()
        model.fit(x_train, y_train)
        return model

    def _train_lm(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains a LinearRegression model.
        """
        from sklearn.linear_model import LinearRegression

        model = LinearRegression()
        model.fit(x_train, y_train)
        return model

    def _train_xgboost(self, x_train: pd.DataFrame, y_train: pd.Series):
        """
        Trains an XGBRegressor model.
        """
        model = XGBRegressor()
        model.fit(x_train, y_train)
        return model

    def get_scores(self, x_test: pd.DataFrame, y_test: pd.DataFrame) -> None:
        """
        Gets MAE for each model trained
        """
        scores = {}
        for model in self.model_names:
            y_pred = self.model_instances[model].predict(x_test)
            scores[model] = mae(y_true=y_test, y_pred=y_pred)
        return scores

In [75]:
start_time = time.time()
automl = NewAutoMLPipeline(model_names=["rf", "lm", "xgboost"])
automl.fit(x_train, y_train)
end_time = time.time()
scores = automl.get_scores(x_test, y_test)

In [76]:
print(end_time - start_time)

0.21244382858276367


As you saw before, we reduced some miliseconds of preprocessing (around 0.03). However, if this is exposed to bigger setups or datasets, this is where the flyweigths scenario will bright!!! 