# Parallel Pipeline Example

A Parallel Pipeline is a set of steps which run independently - their outputs are then combined and returned. Each step should be an instantiated class with both `fit` and `transform` methods.

The below diagram shows the high level structure of a Parallel Pipeline:

<center><img src="images/parallel_pipeline_structure.png"/></center>

It is used to carry out multiple processes seperately and combine their outputs. One such use case is when we want to both:

1. Generate a new rule set, and:
2. Optimise an existing rule set, then:
3. Use the combined rule sets to optimise a decision engine

An example of this workflow is shown below:

<center><img src="images/parallel_pipeline_example.png"/></center>

To create this workflow in Iguanas, the rule generation and rule optimisation steps are added to a Parallel Pipeline, so they run separately and their outputs are combined. This Parallel Pipeline is then added to a Linear Pipeline, along with the decision engine optimisation step, so that the output of the Parallel Pipeline is fed into the decision engine optimiser.

**We'll see how this workflow can be generated in the following example.**

---

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.rule_selection import SimpleFilter, CorrelatedFilter
from iguanas.metrics import FScore, Precision, JaccardSimilarity
from iguanas.rbs import RBSOptimiser, RBSPipeline
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.pipeline import LinearPipeline, ParallelPipeline, ClassAccessor
from iguanas.rules import Rules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

## Read in data

Let's read in the famous Titanic data set and split it into training and test sets:

In [2]:
df = pd.read_csv('../../../examples/dummy_data/titanic.csv', index_col='PassengerId')
target_col = 'Survived'
cols_to_drop = ['Name', 'Ticket', 'Cabin']
X = df.drop([target_col] + cols_to_drop, axis=1)
y = df[target_col]

In [3]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

## Data processing

Let's apply the following simple steps to process the data:
* One hot encode categorical variables (accounting for nulls)
* Impute numeric features with -1

In [4]:
# OHE
encoder = OneHotEncoder(
    use_cat_names=True
)
X_train = encoder.fit_transform(X_train_raw)
X_test = encoder.transform(X_test_raw)

# Impute
X_train.fillna(-1, inplace=True)
X_test.fillna(-1, inplace=True)

  elif pd.api.types.is_categorical(cols):


----

## Set up pipeline

To create the worflow shown at the beginning of the notebook, let's first assume we have the following existing rule set (stored in the standard Iguanas string format):

In [5]:
existing_rules = Rules(
    rule_strings = {
        'AgeRule': "(X['Age']>20)|(X['Age'].isna())",
        'CombinedRule': "(X['Pclass']<=2)&(X['Sex']=='female')"
    }
)

**Note:** these rules use the unprocessed data, as they contain conditions that either look for null values or compare to a given category. We'll need to use the unprocessed data when optimising these rules.

To use in a rule optimiser, we need to convert the rules to the standard Iguanas lambda expression format:

In [6]:
existing_rule_lambdas = existing_rules.as_rule_lambdas(
    as_numpy=False, 
    with_kwargs=True
)

Now let's instantiate the classes used in our Parallel Pipeline - this will consist of a rule generation step and a rule optimisation step. We'll optimise both our rules and our decision engine to maximise the **F1 score**:

In [7]:
# Optimisation metric
f1 = FScore(beta=1)

# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=3,
    tree_ensemble=RandomForestClassifier(
        n_estimators=2,
        random_state=0
    ),
    target_feat_corr_types='Infer'
)
# Rule optimisation
optimiser = BayesianOptimiser(
    rule_lambdas=existing_rule_lambdas,
    lambda_kwargs=existing_rules.lambda_kwargs,
    metric=f1.fit,
    n_iter=10,
)

Now we can create the steps of our Parallel Pipeline. Each step should be a tuple of two elements:

1. The first element should be a string which refers to the step.
2. The second element should be the instantiated class which runs at that step.

In [8]:
steps = [
    ('generator', generator),
    ('optimiser', optimiser)
]

Then we can instantiate our Parallel Pipeline:

In [9]:
# Instantiate the ParallelPipeline
pp = ParallelPipeline(
    steps=steps,
    verbose=1 # Set to 1 to see overall progress
)

Now that we have our Parallel Pipeline defined, we can define our Linear Pipeline, the first step of which will be our Parallel Pipeline. This will run the rule generation and optimisation steps separately, combine their outputs, and feed it into the decision engine optimiser (our second step in the Linear Pipeline):

In [10]:
# Decision engine (to be optimised)
rbs_pipeline = RBSPipeline(
    config=[], # Use an empty list here - the RBSOptimiser will create the config
    final_decision=0
)

# Decision engine optimiser
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='pp', 
        class_attribute='rule_names'
    ),
    rules=ClassAccessor(
        class_tag='pp', 
        class_attribute='rules'    
    ),
    n_iter=20
)

**Note:** The arguments passed to the `pos_pred_rules` and `rules` parameters in the `RBSOptimiser` class are `ClassAccessor` objects. This object extracts the specified attribute from the given class in the pipeline. This allows users to pass attributes from earlier steps in the pipeline as parameters of later steps in the pipeline.

In this example, the names of the rules that are present in the concatenated output produced by the `ParallelPipeline` are passed to the `pos_pred_rules` parameter of the `RBSOptimiser` class - this is so the `RBSOptimiser` knows which rules predict positive cases (which, one might argue, doesn't need to be specified in this example, as we only have one type of rule set. However, when you have a set of rules - some that predict positive cases and some that predict negative cases - you must specify which rules predict what case, using the `pos_pred_rules` and `neg_pred_rules` parameters).

Also, the `rules` attribute created in the `ParallelPipeline` are passed to the `rules` parameter of the `RBSOptimiser` class. This is so the rules remaining after the decision engine optimisation can be easily extracted from the trained pipeline.

Now we can define the steps of our Linear Pipeline:

In [11]:
steps=[
    ('pp', pp),
    ('rbs_optimiser', rbs_optimiser)
]

Finally, we can instantiate our Linear Pipeline:

In [12]:
# Instatiate the LinearPipeline
lp = LinearPipeline(
    steps=steps,
    verbose=2 # Set to 2 to see current step being trained
)

## Using the pipeline

### `fit` method

By running the `fit` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit` method is run. 

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [13]:
lp.fit(
    X={
        'generator': X_train,
        'optimiser': X_train_raw,
    }, 
    y=y_train, 
    sample_weight=None
)

--- Applying `fit_transform` method for step `pp` ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 19.23it/s]
--- Applying `fit` method or step `rbs_optimiser` ---


#### Outputs

The `fit` method doesn't return anything. However, you can access the attributes of the fitted classes using the `get_params` method.

To see the rules that remain after the decision engine optimisation, we can extract the `rules` attribute from the `rbs_optimiser` step in the trained pipeline:

In [14]:
rules = lp.get_params()['rbs_optimiser']['rules']

Then access the `rule_strings` attribute to see the logic of each rule:

In [15]:
rules.rule_strings

{'RGDT_Rule_20220221_1': "(X['Embarked_Q']==True)&(X['Embarked_S']==False)&(X['Sex_female']==True)",
 'CombinedRule': "(X['Pclass']<=2)&(X['Sex']=='female')"}

### `fit_predict` method

By running the `fit_predict` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit_predict` method is run.

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [16]:
y_pred_train = lp.fit_predict(
    X={
        'generator': X_train,
        'optimiser': X_train_raw,
    }, 
    y=y_train, 
    sample_weight=None
)

--- Applying `fit_transform` method for step `pp` ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 23.22it/s]
--- Applying `fit` method or step `rbs_optimiser` ---


#### Outputs

The `fit_predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [17]:
y_pred_train

PassengerId
7      0
719    0
686    0
74     0
883    0
      ..
107    0
271    0
861    0
436    1
103    0
Length: 596, dtype: int64

### `predict` method

By running the `predict` method, we sequentially run the `transform` methods of each step in the pipeline, except for the last step, where the `predict` method is run. Note that before using this method, you should first run either the `fit` or `fit_predict` methods:

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [18]:
y_pred_test = lp.predict(
    X={
        'generator': X_test,
        'optimiser': X_test_raw,
    }
)

#### Outputs

The `predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [19]:
y_pred_test

PassengerId
710    0
440    0
841    0
721    1
40     0
      ..
716    0
526    0
382    0
141    0
174    0
Length: 295, dtype: int64

This approach is very powerful when optimising hyperparameters for the overall performance of a Rules-Based System - see the `BayesSearchCV` class in the `rule_selection` module for more information.

---