# Workflow Interface 405: Federated Evaluation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/openfl/blob/develop/openfl-tutorials/experimental/workflow/405_MNIST_FederatedEvaluation.ipynb)

Welcome to the first OpenFL Federated Evaluation Workflow Interface tutorial! This notebook demonstrates OpenFL capability of running your first horizontal federated evaluation workflow. This work has the following goals:

- Template for federated evaluation exposing key metrics post evaluation run (e.g model accuracy)
- Build on top of first example of learning via workflow API (refer [101 MNIST Notebook](https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/experimental/workflow/101_MNIST.ipynb) ) using MNIST dataset and perform fedeval (federated evaluation) 

# Getting Started

First we start by installing the necessary dependencies for the workflow interface as per [installation guide](https://openfl.readthedocs.io/en/latest/installation.html)

In [None]:
#Uncomment this if running in Google Colab and set USERNAME if running in docker container.
#!pip install -r https://raw.githubusercontent.com/intel/openfl/develop/openfl-tutorials/experimental/workflow/workflow_interface_requirements.txt
#import os
#os.environ["USERNAME"] = "colab"

One foundational pre-requisite for evaluation is to have a pre-trained model available and that's exactly what this notebook expects as a pre-requisite:
- A pre-trained model that can be loaded for evaluation

For this tutorial, let's use the final output model of [101 MNIST Notebook](https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/experimental/workflow/101_MNIST.ipynb) run, a sample of same is saved at [Pre-trained model](../pretrainedmodels/cnn_mnist.pth)

Sample of the output of training run model that was saved

In [None]:
Sample of the final model weights: tensor([[[ 0.1221, -0.0846, -0.0635,  0.0590, -0.2059],
         [ 0.1558, -0.0202,  0.1005,  0.0272, -0.0148],
         [ 0.1034,  0.0560,  0.1089, -0.0367,  0.0182],
         [ 0.0086,  0.0602,  0.0315,  0.2058,  0.0909],
         [-0.0778, -0.1234, -0.0414, -0.0904, -0.0548]]])

Final aggregated model accuracy for 2 rounds of training: 0.8463999927043915

Let's first define our dataloaders, model, optimizer, and some helper functions like we would for any other deep learning experiment, however 
notice the difference in this setup compared to a typical training/learning setup as detailed in [101 MNIST Notebook](https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/experimental/workflow/101_MNIST.ipynb) :
- There is no need to download the training set as we will do only evaluation
- No optimizer settings needed

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import torchvision
import numpy as np

batch_size_test = 1000
learning_rate = 0.01
log_interval = 10

random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

mnist_test = torchvision.datasets.MNIST(
    "./files/",
    train=False,
    download=True,
    transform=torchvision.transforms.Compose(
        [
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Normalize((0.1307,), (0.3081,)),
        ]
    ),
)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

def inference(network,test_loader):
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
      for data, target in test_loader:
        output = network(data)
        test_loss += F.nll_loss(output, target, size_average=False).item()
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.view_as(pred)).sum()
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
      test_loss, correct, len(test_loader.dataset),
      100. * correct / len(test_loader.dataset)))
    accuracy = float(correct / len(test_loader.dataset))
    return accuracy

Next we import the `FLSpec`, `LocalRuntime`, and placement decorators.

- `FLSpec` – Defines the flow specification. User defined flows are subclasses of this.
- `Runtime` – Defines where the flow runs, infrastructure for task transitions (how information gets sent). The `LocalRuntime` runs the flow on a single node.
- `aggregator/collaborator` - placement decorators that define where the task will be assigned

Now we come to the flow definition. The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with `start` and concludes with the `end` task. The aggregator begins with an optionally passed in model and optimizer. The aggregator begins the flow with the `start` task, where the list of collaborators is extracted from the runtime (`self.collaborators = self.runtime.collaborators`) and is then used as the list of participants to run the task listed in `self.next`, `evaluate`. The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the `start` function on the aggregator to the `evaluate` task on the collaborator. Where the tasks run is determined by the placement decorator that precedes each task definition (`@aggregator` or `@collaborator`). Once each of the collaborators (defined in the runtime) complete the `evaluate` task, they finally `join` at the aggregator doing just model evaluation/validation without any training. It is in `join` that an accuracy of model per collaborator is shown.

![FedEval.png](../../../docs/images/FedEval.png)

In [None]:
from copy import deepcopy

from openfl.experimental.workflow.interface import FLSpec, Aggregator, Collaborator
from openfl.experimental.workflow.runtime import LocalRuntime
from openfl.experimental.workflow.placement import aggregator, collaborator
class FederatedEvaluationFlow(FLSpec):

    def __init__(self, model=None, rounds=1, **kwargs):
        super().__init__(**kwargs)
        if model is not None:
            self.model = model
        else:
            self.model = Net()
            
        self.rounds = rounds

    @aggregator
    def start(self):
        print(f'Performing initialization for model')
        self.collaborators = self.runtime.collaborators
        self.private = 10
        self.current_round = 0
        self.next(self.evaluate, foreach='collaborators', exclude=['private'])

    @collaborator
    def evaluate(self):
        print(f'Performing model evaluation for collaborator {self.input}')
        self.agg_validation_score = inference(self.model, self.test_loader)
        print(f'{self.input} value of {self.agg_validation_score}')
        self.next(self.join)

    @aggregator
    def join(self, inputs):
        self.aggregated_model_accuracy = sum(
            input.agg_validation_score for input in inputs) / len(inputs)
        print(f'Average aggregated model accuracy values = {self.aggregated_model_accuracy}')
        self.current_round += 1
        if self.current_round < self.rounds:
            self.next(self.evaluate,
                      foreach='collaborators', exclude=['private'])
        else:
            self.next(self.end)

    @aggregator
    def end(self):
        print(f'This is the end of the flow')

Now let's setup the participants in similar fashion as basic learning/training tutorial but notice the difference in the setup below since we are doing only evaluation there is no need to configure training related data, targets and data loader.

In [None]:
# Setup participants
aggregator = Aggregator()
aggregator.private_attributes = {}

# Setup collaborators with private attributes
collaborator_names = ['Portland', 'Seattle', 'Chandler','Bangalore']
collaborators = [Collaborator(name=name) for name in collaborator_names]
for idx, collaborator in enumerate(collaborators):
    local_test = deepcopy(mnist_test)
    local_test.data = mnist_test.data[idx::len(collaborators)]
    local_test.targets = mnist_test.targets[idx::len(collaborators)]
    collaborator.private_attributes = {
            'test_loader': torch.utils.data.DataLoader(local_test,batch_size=batch_size_test, shuffle=True)
    }

local_runtime = LocalRuntime(aggregator=aggregator, collaborators=collaborators, backend='single_process')
print(f'Local runtime collaborators = {local_runtime.collaborators}')

Now that we have our evaluation flow and runtime defined, let's run the experiment! Since its evaluation we need to run it only for one round of validation and for that first we will load a pre-trained model

In [None]:
!wget https://github.com/securefederatedai/openfl/raw/refs/heads/develop/openfl-tutorials/experimental/workflow/pretrainedmodels/cnn_mnist.pth cnn_mnist.pth
model = Net()
model.load_state_dict(torch.load('cnn_mnist.pth'))
best_model = model
flflow = FederatedEvaluationFlow(model, checkpoint=True)
flflow.runtime = local_runtime
flflow.run()

Now that the flow has completed, let's get the model accuracy

In [None]:
print(f'\nFinal model accuracy for {flflow.rounds} rounds of evaluation: {flflow.aggregated_model_accuracy}')

It should ideally report +-0.05 as per the pre-trained models' accuracy that is used in this experiment which, as detailed above, was ~0.846

Now that the flow is complete, let's dig into some of the information captured along the way

In [None]:
run_id = flflow._run_id
from metaflow import Metaflow, Flow, Task, Step
m = Metaflow()
s = Step(f'FederatedEvaluationFlow/{run_id}/evaluate')
list(s)

Now we see **4** steps: **4** collaborators each performed **1** rounds of model evaluation
Let's look at one of those data points

In [None]:
t = Task(f'FederatedEvaluationFlow/{run_id}/evaluate/2')

Now let's look at the data artifacts this task generated

In [None]:
t.data

In [None]:
t.data.input

Now let's look at its log output (stdout)

In [None]:
print(t.stdout)

For more details on checkpointing and using Metaflow to dig into more details of a federation run please refer to previous tutorials on learning like [101 MNIST Notebook](https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/experimental/workflow/101_MNIST.ipynb)

# Congratulations!
You've successfully completed your first Federated Evaluation workflow interface quickstart notebook