# Linear Pipeline Example

A Linear Pipeline is a sequence of steps that are applied sequentially to a dataset. Each step should be an instantiated class with both `fit` and `transform` methods. The final step should be an instantiated class with both `fit` and `predict` methods.

Pipelines will usually have the following high-level structure:

1. Rule generation/optimisation step
2. Rule processing steps
3. Rule predictor step

**Note that currently, it is not possible to have a rule optimisation step after a rule generation step (i.e. to optimise rules that have been generated), but this feature is being developed!**

This example shows how to create a pipeline to perform a set of given steps.

---

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_selection import SimpleFilter, CorrelatedFilter
from iguanas.metrics import FScore, Precision, JaccardSimilarity
from iguanas.rbs import RBSOptimiser, RBSPipeline
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.pipeline import LinearPipeline
from iguanas.pipeline.class_accessor import ClassAccessor

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

## Read in data

Let's read in the famous Titanic data set and split it into training and test sets:

In [2]:
df = pd.read_csv('../../../examples/dummy_data/titanic.csv', index_col='PassengerId')
target_col = 'Survived'
cols_to_drop = ['Name', 'Ticket', 'Cabin']
X = df.drop([target_col] + cols_to_drop, axis=1)
y = df[target_col]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

## Data processing

Let's apply the following simple steps to process the data:
* One hot encode categorical variables (accounting for nulls)
* Impute numeric features with -1

In [4]:
# OHE
encoder = OneHotEncoder(
    use_cat_names=True
)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

# Impute
X_train.fillna(-1, inplace=True)
X_test.fillna(-1, inplace=True)

  elif pd.api.types.is_categorical(cols):


----

## Set up pipeline

Let's say that we want to apply the following processes as part of our pipeline:

1. Rule generation step
    * Use `RuleGeneratorDT` to generate rules using the processed data.
2. Rule processing step
    * Apply `SimpleFilter`, keeping rules with F1 score >= 0.1
    * Apply `CorrelatedFilter`, removing rules with a Jaccard Similarity >= 0.9
3. Rule predictor step
    * Use the `RBSOptimiser` to optimise an `RBSPipeline` for F1 score. This will create a rule predictor.

To create a pipeline to do this, let's first instantiate the relevant classes:

In [16]:
f1 = FScore(beta=1)
js = JaccardSimilarity()

In [6]:
# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=4,
    tree_ensemble=RandomForestClassifier(
        n_estimators=10,
        random_state=0
    )
)
# Rule processing
simple_filterer = SimpleFilter(
    threshold=0.1, 
    operator='>=', 
    metric=f1.fit
)
corr_filterer = CorrelatedFilter(
    correlation_reduction_class=AgglomerativeClusteringReducer(
        threshold=0.9, 
        strategy='top_down', 
        similarity_function=js.fit, 
        metric=f1.fit
    )
)
# Rule prediction
rbs_pipeline = RBSPipeline(
    config=[],
    final_decision=0
)
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='corr_filterer', 
        class_attribute='rules_to_keep'
    ),
    n_iter=10
)

**Note:** The argument passed to the `pos_pred_rules` parameter in the `RBSOptimiser` class is a `ClassAccessor` object. This takes the names of the rules that remain after the `CorrelatedFilter` has been applied and passes it to the `pos_pred_rules` parameter of the `RBSOptimiser` class.

Now we can create the steps of our pipeline. Each step should be a tuple of two elements:

1. The first element should be a string which refers to the step.
2. The second element should be the instantiated class which runs at that step.

In [7]:
steps = [
    ('generator', generator),
    ('simple_filterer', simple_filterer),
    ('corr_filterer', corr_filterer),
    ('rbs_optimiser', rbs_optimiser)
]

Finally, we can instantiate our pipeline:

In [8]:
lp = LinearPipeline(steps=steps)

## Using the pipeline

### `fit` method

By running the `fit` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit` method is run:

In [9]:
lp.fit(X_train, y_train, None)

#### Outputs

The `fit` method doesn't return anything. However, you can access the attributes of the fitted classes using the `get_params` method.

### `fit_predict` method

By running the `fit_predict` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit_predict` method is run:

In [11]:
y_pred_train = lp.fit_predict(X_train, y_train, None)

#### Outputs

The `fit_predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [12]:
y_pred_train

PassengerId
7      1
719    1
686    1
74     1
883    1
      ..
107    1
271    1
861    1
436    1
103    0
Name: Stage=0, Decision=1, Length: 596, dtype: int64

### `predict` method

By running the `predict` method, we sequentially run the `transform` methods of each step in the pipeline, except for the last step, where the `predict` method is run. Note that before using this method, you should first run either the `fit` or `fit_predict` methods:

In [13]:
y_pred_test = lp.predict(X_test)

#### Outputs

The `predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [14]:
y_pred_test

PassengerId
710    1
440    1
841    1
721    1
40     1
      ..
716    1
526    1
382    1
141    1
174    1
Name: Stage=0, Decision=1, Length: 295, dtype: int64

We can now calculate the F1 score of our pipeline using the test data:

In [15]:
f1.fit(y_pred_test, y_test)

0.5651105651105651

This approach is very powerful when optimising hyperparameters for the overall performance of a Rules-Based System - see the `BayesSearchCV` class in the `rule_selection` module for more information.

---