# Secure XGBoost Tutorial
#### RISE Camp tutorial on the Secure XGBoost project.

## Single Party XGBoost on Data Subset
First we'll train an XGBoost model on a subset of the data. This simulates the federated setting in that a party will only have a subset of the data that's available to the central trusted server for training. We'll look at the performance of a XGBoost model that's only trained on this subset. 

Import the necessary libraries

In [None]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

Load in and examine our training data

In [None]:
training_data_subset = pd.read_csv('/data/msd_training_data_split.csv', sep=",", header=None)
training_data_subset.head()

In [None]:
y_train_subset = training_data_subset.iloc[:, 0]
y_train_subset.head()

In [None]:
x_train_subset = training_data_subset.iloc[:, 1:]
x_train_subset.head()

Do the same with the test data

In [None]:
test_data_subset = pd.read_csv('/data/msd_test_data_split.csv', sep=",", header=None)
y_test_subset = test_data_subset.iloc[:, 0]
x_test_subset = test_data_subset.iloc[:, 1:]

Train the model with the training data

In [None]:
model = xgb.XGBRegressor(n_estimators=40)
model.fit(x_train_subset, y_train_subset)

Get predictions and evaluate the model with the test data

In [None]:
preds = model.predict(x_test_subset)
np.sqrt(mean_squared_error(preds, y_test_subset))

## Single Party XGBoost on Full Dataset
Next we will look at the performance (both in terms of speed and accuracy) of training an XGBoost model on the full dataset. Training on the full dataset makes the model much more robust, and therefore helps to motivate running XGBoost in the federated setting.

In [None]:
training_data = pd.read_csv('/data/msd_training_data.csv', sep=",", header=None)
y_train = training_data.iloc[:, 0]
x_train = training_data.iloc[:, 1:]

In [None]:
test_data = pd.read_csv('/data/msd_test_data.csv', sep=",", header=None)
y_test = test_data.iloc[:, 0]
x_test = test_data.iloc[:, 1:]

In [None]:
full_model = xgb.XGBRegressor(n_estimators=40)
full_model.fit(x_train, y_train)

In [None]:
preds = full_model.predict(x_test)
np.sqrt(mean_squared_error(preds, y_test))

## Multiparty XGBoost with Distributed Data
This section will demonstrate a workflow in which each party has its own data and sends a copy of its data to the master. Therefore, all the training data is sent over the network to the master, who collects it and locally trains a model on all the data. We will also measure the number of bytes sent over the network to show the large bandwidth needed for this workflow.

In [None]:
# If you are a worker, run this cell
!scp /data/training_data_1.csv <master_ip>:/data

If you are the master, load all the data that has been sent to your machine. For example, if three other parties sent you data, make 4 calls to `read_csv()`: one for your own data and three for the other parties' data.

In [None]:
master_training_data = pd.read_csv('/data/msd_training_data_split.csv', sep=",", header=None)
p1_training_data = pd.read_csv('/data/msd_training_data_split.csv', sep=",", header=None)
master_training_data.shape

Concatenate all the data in preparation for training

In [None]:
aggregated_training_data = pd.concat([master_training_data, p1_training_data])
aggregated_training_data.shape

In [None]:
y_agg_train = aggregated_training_data.iloc[:, 0]
x_agg_train = aggregated_training_data.iloc[:, 1:]

In [None]:
multiparty_model = xgb.XGBRegressor(n_estimators=40)
multiparty_model.fit(x_agg_train, y_agg_train)

In [None]:
# TODO: how are we going to set up test data?
preds = multiparty_model.predict(x_test)
np.sqrt(mean_squared_error(preds, y_test))

## Federated XGBoost
We will now discuss running XGBoost in the federated setting. Unlike the previous exercise, in the federated setting all data stays on its respective machine. This eliminates the need to transfer over the network which incurs high overhead and requires significant bandwidth. Instead, in the federated setting in each iteration each party sends a summary of the update made to its model. The master then aggregates these updates, applies the aggregated update to its model, and broadcasts the new model to all parties. The parties then train locally with the new model and send the update to the master.

### Edit hosts.config 
The `hosts.config` file should contain the IPs and ports of all workers in the federation. After loading in the `hosts.config` file, modify it to contain the IPs of your new friends! Then write the new addresses back to the file by adding a magic to the top of the cell:

`%%writefile hosts.config`

Make sure to delete the `# %load hosts.config` line from the cell before saving it. We'll be continually using the `%load` and `%%writefile` magics in this tutorial to edit files.

In [None]:
%load hosts.config

### Add SSH keys
We now need to add the master's public SSH key to all worker's `authorized_keys` file.

In [None]:
# Run this if you're the master to get your SSH public key
!cat ~/.ssh/id_rsa.pub

In [None]:
# Run this if you're a worker to authorize the master to SSH into your machine.
# Replace the <master_ssh_key> with the actual key
!echo "<master_ssh_key>" >> ~/.ssh/authorized_keys

### Modifying the Training/Eval Script
We will now modify the script that will be run for training and evaluation. Load it in by running the following cell. The contents of the script should appear in the cell.

In [None]:
%load tutorial.py

### Start Job
After modifying the script, we can start our job! We use the `start_job.sh` script with the given options to do so.

The following flags must be specified when running the script.

`./start_job.sh`

* `-m | --worker-memory` string, specified as "<memory>g", e.g. 3g
    * Amount of memory on workers allocated to job
* `-p | --num-parties` integer
    * Number of parties in the federation
* `-d | --dir` string
    * Path to created subdirectory containing job script, e.g. `/home/ubuntu/mc2/federated-xgboost/risecamp`
* `-j | --job` string
    * Path to job script. This should be the parameter passed into the `--dir` option concatenated with the job script file name, e.g. `/home/ubuntu/mc2/federated-xgboost/risecamp/tutorial.py`

In [None]:
!./start_job.sh -p 3 -m 3g -j /home/$USER/tutorial.py