# Linear Pipeline Example

A Linear Pipeline is a sequence of steps that are applied sequentially to a dataset. Each step should be an instantiated class with both `fit` and `transform` methods. The final step should be an instantiated class with both `fit` and `predict` methods.

The below diagram shows the high level structure of a Linear Pipeline:


<center><img src="images/linear_pipeline_structure.png"/></center>

It is used to carry out multiple processes sequentially, using the output of the last step as the input to the next step. One such use case is when we want to:

1. Generate a set of rules.
2. Filter the rule set to remove those which are:
    * Poorly performing.
    * Correlated.
3. Use the resulting rule set to optimise a decision engine.

An example of this workflow is shown below:

<center><img src="images/linear_pipeline_example.png"/></center>

To create this workflow in Iguanas, each step will be added to a Linear Pipeline (whilst maintaining the order shown).

**We'll see how this workflow can be generated in the following example.**

---

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.rule_selection import SimpleFilter, CorrelatedFilter
from iguanas.metrics import FScore, Precision, JaccardSimilarity
from iguanas.rbs import RBSOptimiser, RBSPipeline
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.pipeline import LinearPipeline
from iguanas.pipeline.class_accessor import ClassAccessor

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

## Read in data

Let's read in the famous Titanic data set and split it into training and test sets:

In [2]:
df = pd.read_csv('../../../examples/dummy_data/titanic.csv', index_col='PassengerId')
target_col = 'Survived'
cols_to_drop = ['Name', 'Ticket', 'Cabin']
X = df.drop([target_col] + cols_to_drop, axis=1)
y = df[target_col]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

## Data processing

Let's apply the following simple steps to process the data:
* One hot encode categorical variables (accounting for nulls)
* Impute numeric features with -1

In [4]:
# OHE
encoder = OneHotEncoder(
    use_cat_names=True
)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

# Impute
X_train.fillna(-1, inplace=True)
X_test.fillna(-1, inplace=True)

  elif pd.api.types.is_categorical(cols):


----

## Set up pipeline

To create the worflow shown at the beginning of the notebook, let's first instantiate the classes for each step. We'll optimise our decision engine to maximise the **F1 score**:

In [5]:
# Optimisation metric
f1 = FScore(beta=1)

# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=4,
    tree_ensemble=RandomForestClassifier(
        n_estimators=2,
        random_state=0
    ),
    target_feat_corr_types='Infer'
)

# Rule filter (performance-based)
simple_filterer = SimpleFilter(
    threshold=0.1, # Filter out rules with an F1 score < 0.1
    operator='>=', 
    metric=f1.fit
)

# Rule filter (correlation-based)
js = JaccardSimilarity()
corr_filterer = CorrelatedFilter(
    correlation_reduction_class=AgglomerativeClusteringReducer(
        threshold=0.9, # Filter out rules in the same cluster with a Jaccard Similarity >= 0.9 
        strategy='top_down', 
        similarity_function=js.fit, 
        metric=f1.fit
    )
)

# Decision engine (to be optimised)
rbs_pipeline = RBSPipeline(
    config=[],
    final_decision=0
)

# Decision engine optimiser
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='corr_filterer', 
        class_attribute='rules_to_keep'
    ),
    rules=ClassAccessor(
        class_tag='generator',
        class_attribute='rules'
    ),
    n_iter=10
)

**Note:** The arguments passed to the `pos_pred_rules` and `rules` parameters in the `RBSOptimiser` class are `ClassAccessor` objects. This object extracts the specified attribute from the given class in the pipeline. This allows users to pass attributes from earlier steps in the pipeline as parameters of later steps in the pipeline.

In this example, the names of the rules that are present after the `corr_filterer` step are passed to the `pos_pred_rules` parameter of the `RBSOptimiser` class - this is so the `RBSOptimiser` knows which rules predict positive cases (which, one might argue, doesn't need to be specified in this example, as we only have one type of rule set. However, when you have a set of rules - some that predict positive cases and some that predict negative cases - you must specify which rules predict what case, using the `pos_pred_rules` and `neg_pred_rules` parameters).

Also, the `rules` attribute created in the `generator` step are passed to the `rules` parameter of the `RBSOptimiser` class. This is so the rules remaining after the decision engine optimisation can be easily extracted from the trained pipeline.

Now we can create the steps of our pipeline. Each step should be a tuple of two elements:

1. The first element should be a string which refers to the step.
2. The second element should be the instantiated class which runs at that step.

In [6]:
steps = [
    ('generator', generator),
    ('simple_filterer', simple_filterer),
    ('corr_filterer', corr_filterer),
    ('rbs_optimiser', rbs_optimiser)
]

Finally, we can instantiate our pipeline:

In [7]:
lp = LinearPipeline(
    steps=steps,
    verbose=1 # Set to 1 to see overall progress
)

## Using the pipeline

### `fit` method

By running the `fit` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit` method is run:

In [8]:
lp.fit(
    X=X_train, 
    y=y_train,
    sample_weight=None
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 44.24it/s]


#### Outputs

The `fit` method doesn't return anything. However, you can access the attributes of the fitted classes using the `get_params` method.

To see the rules that remain after the decision engine optimisation, we can extract the `rules` attribute from the `rbs_optimiser` step in the trained pipeline:

In [9]:
rules = lp.get_params()['rbs_optimiser']['rules']

Then access the `rule_strings` attribute to see the logic of each rule:

In [10]:
rules.rule_strings

{'RGDT_Rule_20220221_0': "(X['Age']>14.5)&(X['Pclass']<=2)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_5': "(X['Embarked_C']==True)&(X['Pclass']<=1)",
 'RGDT_Rule_20220221_8': "(X['Embarked_S']==False)&(X['Fare']>14.85)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_9': "(X['Embarked_S']==False)&(X['Fare']>14.9771)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_13': "(X['Fare']>52.2771)",
 'RGDT_Rule_20220221_16': "(X['Pclass']<=1)&(X['Sex_male']==False)",
 'RGDT_Rule_20220221_17': "(X['Pclass']<=2)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_18': "(X['Pclass']<=2)&(X['Sex_male']==False)",
 'RGDT_Rule_20220221_19': "(X['Sex_female']==True)",
 'RGDT_Rule_20220221_20': "(X['Sex_male']==False)",
 'RGDT_Rule_20220221_21': "(X['SibSp']>=1)"}

### `fit_predict` method

By running the `fit_predict` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit_predict` method is run:

In [11]:
y_pred_train = lp.fit_predict(
    X=X_train, 
    y=y_train,
    sample_weight=None
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 63.26it/s]


#### Outputs

The `fit_predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [12]:
y_pred_train

PassengerId
7      0
719    0
686    1
74     1
883    1
      ..
107    1
271    0
861    1
436    1
103    1
Length: 596, dtype: int64

### `predict` method

By running the `predict` method, we sequentially run the `transform` methods of each step in the pipeline, except for the last step, where the `predict` method is run. Note that before using this method, you should first run either the `fit` or `fit_predict` methods:

In [13]:
y_pred_test = lp.predict(X_test)

#### Outputs

The `predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [14]:
y_pred_test

PassengerId
710    1
440    0
841    0
721    1
40     1
      ..
716    0
526    0
382    1
141    1
174    0
Length: 295, dtype: int64

We can now calculate the F1 score of our pipeline using the test data:

In [15]:
f1.fit(y_pred_test, y_test)

0.7038327526132405

This approach is very powerful when optimising hyperparameters for the overall performance of a Rules-Based System - see the `BayesSearchCV` class in the `rule_selection` module for more information.

---

## Using the initial data in a subsequent pipeline step

The example above assumes that each subsequent step is expecting the input to be the output from the former step. However, there are cases where you may want a subsequent step in a pipeline to use an attribute created from a former step, but the input to be the initial data. 

For example, consider the below workflow:

![title](images/linear_pipeline_use_init_data_incorrect.png)

This workflow would fail as part of a standard Linear Pipeline, since the rule optimisation step requires the initial dataset to optimise the rules - in this workflow, the output of the rule generation step (which is a dataset of the binary columns of the rules) would be passed to the `X` parameter of the rule optimiser's `fit_transform` method, which would fail as the features in the rules would not be present in that dataset.

Instead, we need to construct a workflow like so:

![title](images/linear_pipeline_use_init_data.png)

Here, we generate a rule set using the the processed data and the binary target. We then pass this rule set to the rule optimiser, but use the initial data as the input.

We can create this workflow using the `use_init_data` parameter in the `LinearPipeline` class - let's see how this can be done below:

In [16]:
# Optimisation metric
f1 = FScore(beta=1)

# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=4,
    tree_ensemble=RandomForestClassifier(
        n_estimators=2,
        random_state=0
    ),
    target_feat_corr_types='Infer'
)

# Rule optimisation
optimiser = BayesianOptimiser(
    rule_lambdas=ClassAccessor( # Use a ClassAccessor to extract rule_lambdas from the generator step
        class_tag='generator',
        class_attribute='rule_lambdas'
    ),
    lambda_kwargs=ClassAccessor( # Use a ClassAccessor to extract lambda_kwargs from the generator step
        class_tag='generator',
        class_attribute='lambda_kwargs'
    ),
    metric=f1.fit,
    n_iter=10
)

# Decision engine (to be optimised)
rbs_pipeline = RBSPipeline(
    config=[],
    final_decision=0
)

# Decision engine optimiser
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='optimiser', 
        class_attribute='rule_names'
    ),
    rules=ClassAccessor(
        class_tag='optimiser',
        class_attribute='rules'
    ),
    n_iter=10
)

Now, when we set up our `LinearPipeline`, we need to specify that the `optimiser` step will use the initial dataset. We do this by adding the pipeline step tag to a list and passing that to the `use_init_data` parameter:

In [17]:
steps = [
    ('generator', generator),
    ('optimiser', optimiser),
    ('rbs_optimiser', rbs_optimiser)
]

In [18]:
lp = LinearPipeline(
    steps=steps,
    use_init_data=['optimiser'],
    verbose=2 # Set to 2 to see current step being trained
)

Now we can use the `LinearPipeline` methods outlined earlier in the notebook:

In [19]:
# fit method
lp.fit(
    X=X_train,
    y=y_train,
    sample_weight=None
)

--- Applying `fit_transform` method for step `generator` ---
--- Applying `fit_transform` method for step `optimiser` ---




--- Applying `fit` method or step `rbs_optimiser` ---


In [20]:
rules = lp.get_params()['rbs_optimiser']['rules']

In [21]:
rules.rule_strings

{'RGDT_Rule_20220221_1': "(X['Age']>10.941004768044909)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_2': "(X['Age']>47)&(X['Embarked_S']==False)&(X['Fare']>52.2771)",
 'RGDT_Rule_20220221_4': "(X['Age']>10.941004768044909)",
 'RGDT_Rule_20220221_5': "(X['Embarked_C']==True)&(X['Pclass']<=2)",
 'RGDT_Rule_20220221_8': "(X['Embarked_S']==False)&(X['Fare']>14.85)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_9': "(X['Embarked_S']==False)&(X['Fare']>14.9771)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_10': "(X['Embarked_S']==False)&(X['Fare']>47.162206755851706)",
 'RGDT_Rule_20220221_15': "(X['Pclass']<=2)",
 'RGDT_Rule_20220221_16': "(X['Pclass']<=2)&(X['Sex_male']==False)",
 'RGDT_Rule_20220221_17': "(X['Pclass']<=2)&(X['Sex_female']==True)",
 'RGDT_Rule_20220221_18': "(X['Pclass']<=2)&(X['Sex_male']==False)"}

In [22]:
# fit_predict method
y_pred_train = lp.fit_predict(
    X=X_train, 
    y=y_train,
    sample_weight=None
)

--- Applying `fit_transform` method for step `generator` ---
--- Applying `fit_transform` method for step `optimiser` ---




--- Applying `fit` method or step `rbs_optimiser` ---


In [23]:
# predict method
y_pred_test = lp.predict(X=X_test)