# Maggy Ablation: Feature Ablation for the Titanic Dataset

In this notebook, we demonstrate Maggy's Feature Ablation API, while using a TensorFlow Keras Sequential model trained on the [Titanic Dataset](https://www.kaggle.com/c/titanic/data). To be able to follow along, make sure you have the Titanic training dataset registered on your Project's Feature Store, as explained [in this example notebook](https://github.com/logicalclocks/hops-examples/blob/master/notebooks/featurestore/datasets/TitanicTrainingDatasetPython.ipynb).

## Wait ... What is an *Ablation Study*?

An Ablation Study, in medical and psychological research, is a research method in which the roles and functions of an organ, tissue, or any part of a living organism, is examined through its surgical removal and observing the behaviour of the organism in its absence. This method, also known as experimental ablation, was pioneered by the French physiologist [Marie Jean Pierre Flourens](https://en.wikipedia.org/wiki/Jean_Pierre_Flourens) in the early nineteenth century. Flourens would perform ablative brain surgeries on animals, removing different parts of their nervous systems and observing the effects on their behaviour. This method has since been used in a variety of disciplines, but most prominently in medical and psychological research and neuroscience.

## What Does it Have to Do with Machine Learning?

In the context of machine learning, we can define ablation study as *“a scientific examination of a machine learning system by removing its building blocks in order to gain insight on their effects on its overall performance”*. Dataset features and model components are notable examples of these building blocks (hence we use their corresponding terms of **feature ablation** and **model ablation**), but any design choice or module of the system may be included in an ablation study.

## Experiments and Trials

We can think that an ablation study is an *experiment* that consists of several *trials*. For example, each model ablation trial involves training a model with one or more of its components (e.g. a layer) removed. Similarly, a feature ablation trial involves training a model using a different set of dataset features, and observing the outcomes.

## Ablation Studies with Maggy

With Maggy, performing ablation studies of your machine learning or deep learning systems is a fairly simple task that consists of the following steps:

1. Creating an `AblationStudy` instance,
2. Specifying the components that you want to ablate by *including* them in your `AblationStudy` instance,
3. Defining a *base model generator function* and/or a *dataset generator function*,
4. Wrapping your TensorFlow/Keras code in a Python function (let's call it **training function**) that receives two arguments (`model_function` and `dataset_function`), and
5. Launching your experiment with Maggy while specifying an *ablation policy*.

It's as simple as that.

## What Changes Should I Make in my TensorFlow/Keras Code?

Not so much. You'll see an example shortly, but the most important thing is:

- For **model ablation**, you need to define a function that returns a TF/Keras `model`, and use that in your code instead of defining the model in your training function. If you want to perform **layer ablation**, then you should provide a `name` argument while adding layers to your `tf.keras.Sequential` model, and include those names in your `AblationStudy` instance as well.

- For **feature ablation**:
    - if you have your training dataset in the [**Feature Store**](https://www.logicalclocks.com/featurestorepage) in form of `tfrecord`, you can directly include the features you want to ablate using their names and calling a *dataset generator function* in your training function. The dataset generator functions will be created under the hood by maggy for each feature ablation trial.
    - alternatively, you can define your own *dataset generator function* and pass it to your `AblationStudy` instance initializer as an argument. 
    
Now let's see how this actually works.
Get your `SparkSession` by executing the following cell:

In [1]:
from hops import hdfs
from hops import featurestore
import maggy

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
91,application_1596125182098_0095,pyspark,idle,Link,Link


SparkSession available as 'spark'.


The next step is to create an `AblationStudy` instance. Here, the required arguments are 1) the name of your training dataset *as it is in your project's feature store*, and 2) * the name of the *label* column.

You can also provide the version of your training dataset in the feature store, but the default version is `1`.

In [2]:

# create an AblationStudy instance.

from maggy.ablation import AblationStudy

ablation_study = AblationStudy('titanic_train_dataset', training_dataset_version=1,
                              label_name='survived')

## Feature Ablation

We perform feature ablation by **including** features in our `AblationStudy` instance. Including a feature means that there will be a trial where the model will be trained *without* that feature. In other words, you include features in the ablation study so that they will be excluded from the training dataset.

We have the following features in our training dataset:

`['age', 'fare', 'parch', 'pclass', 'sex', 'sibsp', 'survived']`

You can include features using `features.include()` method of your `AblationStudy` instance, by passing the names of the features, either separately or as a list of strings:

In [3]:
#include features one by one

ablation_study.features.include('pclass')

# include a list of features

list_of_features = ['fare', 'sibsp', 'sex', 'parch', 'age']
ablation_study.features.include(list_of_features)

In [4]:
ablation_study.features.list_all()

fare
sibsp
pclass
age
sex
parch

In [5]:
# define the base model generator function

def base_model_generator():
    import tensorflow as tf
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    model.add(tf.keras.layers.Dense(64, name='my_dense_two', activation='relu'))
    model.add(tf.keras.layers.Dense(2, name='my_dense_sigmoid', activation='sigmoid'))
    # output layer
    model.add(tf.keras.layers.Dense(1, activation='linear'))
    return model

In [6]:
# set the base model generator

ablation_study.model.set_base_model_generator(base_model_generator)

Just to recap, the ablator will generate one trial per each feature included in the `AblationStudy` instance, and one base trial that contains all the features.

Now the only thing you need to do is to wrap your training code in a Python function. You can name this function whatever you wish, but we will refer to it as the *training* or *wrapper* function. The `model_function` and `dataset_function` used in the code are generated by the ablator per each trial, and you should call them in your code. This is your everyday TensorFlow/Keras code:

In [7]:
# wrap your code in a Python function

from maggy import experiment

def training_fn(dataset_function, model_function):
    import tensorflow as tf
    epochs = 10
    batch_size = 30
    
    # since no custom dataset function is provided, maggy will use its own
    # dataset generator (implemented in ablation.utils package) to prepare
    # the dataset from the project featurestore
    dataset = dataset_function(epochs, batch_size)
    
    # 80% training, 20% test
    split = 4
    train_set = dataset.window(split, split + 1).flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))
    test_set = dataset.skip(split).window(1, split + 1).flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))
    
    
    model = model_function()
    model.compile(optimizer=tf.train.AdamOptimizer(0.001),
             loss='binary_crossentropy',
             metrics=['accuracy'])
    
    history = model.fit(train_set, epochs=10, steps_per_epoch=30, verbose=0)
    
    test_score = model.evaluate(test_set)
    
    print('Test loss:', test_score[0])
    print('Test accuracy:', test_score[1])
    
    tf.keras.backend.clear_session()

    return test_score[1]

In [8]:
# launch the experiment

result = experiment.lagom(map_fun=training_fn, experiment_type='ablation',
                           ablation_study=ablation_study, 
                           ablator='loco', 
                           name='TITANIC-LOCO-10-epochs-features'
                          )

HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=7.0, style=ProgressStyle(descripti…

0: Test loss: 0.5853615204493204
0: Test accuracy: 0.68421054
1: Test loss: 6.169979492823283
1: Test accuracy: 0.5964912
2: Test loss: 0.6550388385852178
2: Test accuracy: 0.7134503
0: Test loss: 6.169979492823283
0: Test accuracy: 0.5964912
3: Test loss: 6.169979492823283
3: Test accuracy: 0.5964912
1: Test loss: 0.5243698060512543
1: Test accuracy: 0.7719298
2: Test loss: 0.588441381851832
2: Test accuracy: 0.6608187
Started Maggy Experiment: TITANIC-LOCO-10-epochs-features, application_1596125182098_0095, run 1

------ LOCO Results ------ 
BEST Config Excludes {"ablated_feature": "fare", "ablated_layer": "None"} -- metric 0.7719298
WORST Config Excludes {"ablated_feature": "sex", "ablated_layer": "None"} -- metric 0.5964912
AVERAGE metric -- 0.6599832858358111
Total Job Time 0 hours, 0 minutes, 59 seconds

Finished Experiment
