# The basics of the ML platform
This ML platform is a highly capable and customisable tool for building and training machine learning models. It takes a configuration object with various parameters, such as data location, image size, model type, and loss function, to simplify the process of building and training models. The platform can be used for central and decentralized training and allows access to data without running any training. The platform makes it easy to build and train machine learning models with minimal coding and configuration.

In this notebook we will go through the basics of the platform. We will start by importing the platform and creating a configuration object. We will then use the configuration object to create a dataset and a model. We will then train the model and evaluate the results.

## About the platform
The platform is developed to allow for fast iteration and development of benchmarks and tests between different learning methods. Currently, the platform supports the methods:

- Centralised learning
- Federated learning (uses Flower)
- Swarm learning (uses ray)
- Federated baseline (edgeless graph case of swarm)

### Tutorial notebooks
There are not many and there is not really to much planned in terms of tutorials, however, here are some examples and walk throughs of the platform.
- [Example 0: The basics of the ML platform](/platform/example-0-platform.ipynb)
- [Example 1: Working with data](/platform/example-1-data.ipynb)

### Experiments
Experiments that we have run that are relevant to our thesis are made available in the experiments folder. For each experiment we have a notebook that describes the experiment and a configuration file that can be used to run the experiment. 

### Some known issues
- Running multiple instances of the platform may not work as expected due to ray as backend for federated and swarm learning.
- For federated and swarm learning, ray leaves idle workers with high GPU Memory usage. This may cause CUDA out of memory (OOM) errors. This can partially be solved by minimising the number of available processes/workers.

## The configuration object
The configuration object is a dictionary that contains all the parameters needed to build and train a model. The configuration object is passed to the platform when creating a dataset, a model, or when training a model. The configuration object is also stored so that it can be used to reproduce the results.

Let's take a look at the `config`

In [None]:
config = {
    "data":{
        "path": "/mnt/ZOD",
        "version":"full",
        "ratio":0.003,
        "shuffle_seed": 42,
        "img_size": 160,
        "transforms":"[Resize((img_size, img_size))]",
        "dataloader_args": {
            "prefetch_factor": 2,
            "num_workers": 2,
            "batch_size": 32,
        }
    },
    "model":{
        "name": "default",
        "args": {
            "num_output":66
        },
        "loss":"MSELoss"
    },
    "central":{
        "train":"false",
        "use_gpu":"true",
        "epochs": 10
    },
    "decentralised":{
        "train": ["federated", "swarm", "baseline"],
        "global":{
            "n_clients":5,
            "global_rounds":3,
            "client_resources":{
                "num_cpus": 2, 
                "num_gpus": 0.2
            },
            "ray_init_args":{
                "include_dashboard": True,
                "num_cpus": 12,
                "num_gpus": 2
            },
            "swarm_orchestrator": "synchronous_fixed_rounds_fc",
        },
        "client": {
            "epochs": 3
        }
    }
}

Let's take a look at the `config` object. The `config` object is a dictionary that contains all the parameters needed to build and train a model. The `config` object is passed to the platform when creating a dataset, a model, or when training a model. The `config` object is also stored so that it can be used to reproduce the results.

Not all parameters are required. The above example configuration is one of the minimal examples where we use the same global and client configuration for our decentralised methods and limited to no custom methods. For more information about the schema of the configuration object, please see the [configuration schema](utils/templates/config_schema.json).

Now, let's run the platform with the configuration object. The above configuration will create our required datasets and models, and train them. When you create an instance of the platform, a `run` is initalised. A `run` is stored under the `runs` folder and is named after the timestamp of when the platform was initalised (2023-03-14_15:02:49). The results and pending progress is logged to tensorboard files that are stored in this folder and all log files are also stored in this `run` folder.

To begin, we will start a tensorboard session to monitor the training progress. To do this, run the following command in a terminal:

```bash
tensorboard --logdir runs
```

This will start a web server that will allow you to monitor the training progress. You can access the web server by going to [http://localhost:6006](http://localhost:6006) (Or other port if specified).

In [2]:
from fedswarm import Platform

platform = Platform(config)

### Optional parameters
If required, it is possible to specify different client and global configurations for the different decentralised methods. While the summary notation allows you to define a `global` and `client` config for all decentralised tasks. It is indeed possible to have one unique for each.

The config parser reads the `decentralised` configuration and expands it, this means that the config feed to the platform is actually a config where each decentralised task has its own `global` and `client` configuration. This is done to simplify the configuration of the platform. The expanded config is saved to tensorboard. However, this means that we can also specify our own `global` and `client` configuration for each decentralised task. This is done by specifying the `global` and `client` configuration for each task. 

For example, if we want to use a different `global` configuration for swarm learning but the same for federated and baseline, then we can use the following sub-config:

```json
    "decentralised":{
        "train": ["federated", "baseline"],
        "global":{
            "n_clients":5,
            "global_rounds":3,
            "client_resources":{
                "num_cpus": 2, 
                "num_gpus": 0.2
            },
            "ray_init_args":{
                "include_dashboard": True,
                "num_cpus": 12,
                "num_gpus": 2
            },
            "swarm_orchestrator": "synchronous_fixed_rounds_fc",
        },
        "client": {
            "epochs": 3
        }
    },
    "swarm":{
        "train": "true",
        "global":{
            "n_clients":5,
            "global_rounds":3,
            "client_resources":{
                "num_cpus": 2, 
                "num_gpus": 0.2
            },
            "ray_init_args":{
                "include_dashboard": True,
                "num_cpus": 12,
                "num_gpus": 2
            },
            "orchestrator": "synchronous_fixed_rounds_fc",
        },
        "client": {
            "epochs": 3
        }
    }
```

This can be done in the same manner for all decentralised methods. This allows for a lot of flexibility in the configuration of the platform. Note that method specific configurations should include the `"train"` key and it should be set to `"true"` to enable it.

### Shorthand notation
For easier configuration, the platform supports a shorthand notation. This shorthand notation allows you to define method specific global arguments without doing method specific configurations. For instance, in the example above, note how we have included `swarm_orchestrator` in the `global` configuration. This is a shorthand notation and the config parser will add `orchestrator`and its value to the `swarm` config under the `global` parameter. 

To add more method specific parameters, ensure that the parameter in the summary notation begins with the method name and is followed by `"_"`. This is how the parser identifies which method the parameter belongs to.

### Argument parameters
There are several argument parameters in the config. The dictionary provided in the config will be passed to the function that uses it as keyword arguments. Thus, it is important to ensure that the keys in the dictionary are the same as the keyword arguments of the function.

- `dataloader_args`: Arguments to be passed to the dataloader.
- `model_args`: Arguments to be passed to the model.
- `ray_init_args`: Arguments to be passed to the ray init function.
- `client_resources`: Resources to be passed to the ray client.

For requirements on the arguments, please see the documentation of the functions that use them.

### Function parameters
There are several parameters that define which functions to use, for instance:
- `model`: The model to use.
- `loss`: The loss function to use.
- `orchestrator`: The orchestrator to use for swarm learning.
- `train_val_id_generator`: The function to use to generate the train, validation and test ids.
- `transorms`: The transforms to use for the dataset.

For many of these such as `loss` it is important to ensure that the function is written as how it would be written in code. For instance, if you want to use MSE Loss, then you should write `MSELoss` as in `torch.nn.MSELoss`. This is because the platform will use `eval` to evaluate the string to a function. The same general concept applies to most of the function parameters.