# Case Study 4 - Augmentation
These notebooks are also available on Google Colab. This enables you to run the notebooks without having to set up an environment locally and gives you access to GPUs to run the notebooks on.

[![Run in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JOstMJmhI2wcufyBqZ1iV3YqOdThJ-_U?usp=sharing#scrollTo=4fu5aheXqwv6)

## 1. Introduction
One of the most common problems for machine Learning practitioners is only having a small dataset for the specific problem they are working on. This traditionally can often lead to dead ends or hold-ups in projects while more data is collected. However, if you have different dataset that shares common features then Synthcity may have the ability to solve the issue for you with "Augmentation by domain adaption".

### 1.1 The Task
Augment a small dataset using the concept of domain adaptation (or transfer learning). For this we will be using a RadialGAN as discussed in [this paper](https://arxiv.org/pdf/1802.06403.pdf).

### 2. Imports
Lets import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here. 

In [None]:
# stdlib
import warnings
from pathlib import Path

# 3rd Party
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, accuracy_score
import xgboost as xgb
import seaborn as sns
# from tqdm import tqdm

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import GenericDataLoader
from synthcity.utils import serialization

warnings.filterwarnings("ignore")
# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

# Set up paths to resources
AUG_RES_PATH = Path("../resources/augmentation/")

## 3. The Scenario

Brazil is divided geopolitically into five macroregions: north, northeast, central-west, southeast, and south. For this case study, we will be acting as government officials in the Central-West Region of Brazil. Central-West Brazil is the smallest region in the country by population. It is also one of the larger and more rural regions. This means the number of COVID-19 patient records is significantly smaller compared to the larger regions.

<img src="../data/Brazil_COVID/Brazil_Labelled_Map.png" alt="Brazil Region Map" width="400"/>

COVID-19 hit different regions at different time. Cases peaked later in the Central-West than in the more densely-populated and well-connected regions. Giving us the problem of scarce data in terms of COVID-19 patients in the region, but the potential lifeline of having larger datasets from the other regions, which we can learn from in order to augment our dataset. We cannot simply train our model on the data from all regions, because there is significant co-variate shift between the different regions and so we will achieve a better classifier by training on solely Central-West data, even if it is synthetic. 

### 4. Load the data
Lets set it up as a classification task with a death at time horizon column, as we did in a previous case study.

In [None]:
time_horizon = 14
X = pd.read_csv(f"../data/Brazil_COVID/covid_normalised_numericalised.csv")

X.loc[(X["Days_hospital_to_outcome"] <= time_horizon) & (X["is_dead"] == 1), f"is_dead_at_time_horizon={time_horizon}"] = 1
X.loc[(X["Days_hospital_to_outcome"] > time_horizon), f"is_dead_at_time_horizon={time_horizon}"] = 0
X.loc[(X["is_dead"] == 0), f"is_dead_at_time_horizon={time_horizon}"] = 0
X[f"is_dead_at_time_horizon={time_horizon}"] = X[f"is_dead_at_time_horizon={time_horizon}"].astype(int)

X.drop(columns=["is_dead", "Days_hospital_to_outcome"], inplace=True) # drop survival columns as they are not needed for a classification problem

Here we define a region mapper, which maps the region encoding to the real values. These can be found in `synthetic-data-lab/data/Brazil_COVID/Brazil_COVID_data.md`.

In [None]:
# Define the mappings from region index to region
region_mapper = {
    0: "Central-West",
    1: "North",
    2: "Northeast",
    3: "South",
    4: "Southeast",
}

As we are acting as officials from Central-West Brazil, split the data into data from our region and data from other regions.

In [None]:
our_region_index = 0

# Flatten region to simulate the scenario where we don't know where the data has come from, we just have our data and other data
other_region_index = 1 if our_region_index != 1 else 0
X.loc[X["Region"] != our_region_index, "Region"] = other_region_index
print(X["Region"].value_counts().rename({our_region_index: "Our region", 1:"Another region"}))

X_our_region_only = X.loc[X["Region"] == our_region_index].copy()
X_other_regions = X.loc[X["Region"] != our_region_index].copy()

display(X_our_region_only)

## 5. The problem

Lets see how a model trained just on our data from the Central-West region performs.

### 5.1 Set up the data splits using train_test_split from sklearn.

In [None]:
y = X_our_region_only["is_dead_at_time_horizon=14"]
X_in = X_our_region_only.drop(columns=["is_dead_at_time_horizon=14"])

X_train, X_test, y_train, y_test = train_test_split(X_in, y, random_state=4)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

### 5.2 Load classifier
Load the trained xgboost classifier, which has been trained on Central-West data only.

In [None]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_real_{region_mapper[our_region_index]}_data.json"
# xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
xgb_model.fit(X_train, y_train)
xgb_model.save_model(saved_model_path)

### 5.3 Evaluate classifier
Now print the performance of the model trained on only Central-West data.

In [None]:
calculated_accuracy_score_train = accuracy_score(y_train, xgb_model.predict(X_train))
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Evaluating accuracy: train set: {calculated_accuracy_score_train} | test set: {calculated_accuracy_score_test}")

As you can see we are significantly over-fitting due to the very small dataset. The performance does not look as good as it could be.

## 6. Concatenation, not Augmentation

Lets test our assertion that we can't just use all the training data and apply it to our region.

### 6.1 Set up the training and testing data sets for the model

Make sure the training sets come from the all region dataset, but the test sets come from our region.

In [None]:
y = X["is_dead_at_time_horizon=14"]
X_in = X.drop(columns=["is_dead_at_time_horizon=14"])

X_train, _, y_train, _ = train_test_split(X_in, y, random_state=4)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

y = X_our_region_only["is_dead_at_time_horizon=14"]
X_in = X_our_region_only.drop(columns=["is_dead_at_time_horizon=14"])

_, X_test, _, y_test = train_test_split(X_in, y, random_state=4)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

### 6.2 Load the model trained on data from all regions.

In [None]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_real_all_data.json"
# xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
xgb_model.fit(X_train, y_train)
xgb_model.save_model(saved_model_path)

### 6.3 Show the performance of the model.

In [None]:
calculated_accuracy_score_train = accuracy_score(y_train, xgb_model.predict(X_train))
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Evaluating accuracy: train set: {calculated_accuracy_score_train} | test set: {calculated_accuracy_score_test}")

As you can see our accuracy does improve, but we can do better! And there may well be cases where there is a greater co-variate shift that impacts this accuracy to a much greater extent. It is also worth bearing in mind that there are contexts where the above approach is not even an option, such as in the case of only partially overlapping (or missing) features.

## 7. The Solution

Augment this dataset with the use of a RadialGAN.

### 7.1 Load the data

First, lets load the super-set of data from all regions into the GerericDataLoader object.

In [None]:
loader = GenericDataLoader(
    X, # X is the dataframe which is a superset of all region data
    target_column="is_dead_at_time_horizon=14", # The column containing the labels to predict
    sensitive_features=["Ethnicity"], # The sensitive features in the dataset
    domain_column="Region", # This labels the domain that each record is from. Where it is `0` it is from our small dataset.
    random_state=42,
)

### 7.2 Load/Create the synthetic data model
Lets use a RadialGan to augment the data. We need to load the plugin and then fit it to the dataloader object.

In [None]:
outdir = Path("saved_models")
prefix = "augmentation"
model="radialgan"
n_iter = 100
random_state = 42

# Define saved model name
save_file = outdir / f"{prefix}.{model}_numericalised_{region_mapper[our_region_index]}_n_iter={n_iter}_rnd={random_state}.bkp"
# Load if available
if Path(save_file).exists():
    syn_model = serialization.load_from_file(save_file)
# create and fit if not available
else:
    syn_model = Plugins().get(model, n_iter=n_iter, random_state=random_state)
    syn_model.fit(loader)
    serialization.save_to_file(save_file, syn_model)


### 7.3 Augment the dataset

Lets use our synthetic model to generate some data and use it to augment our original dataset.

In [None]:
n_gen_records = 1000

synth_data = syn_model.generate(n_gen_records, domains=[our_region_index], random_state=random_state)

# Now we can augment our original dataset with our new synthetic data
augmented_data = pd.concat([
    synth_data.dataframe(),
    X_our_region_only,
])

### 7.4 Set up the data for the classifier
Now we need to test a model trained on the augmented dataset, so we need to set up our data splits again.

In [None]:
augmented_y = augmented_data["is_dead_at_time_horizon=14"]
augmented_X = augmented_data.drop(columns=["is_dead_at_time_horizon=14"])

augmented_y.reset_index(drop=True, inplace=True)
augmented_X.reset_index(drop=True, inplace=True)

### 7.5 Load the trained model

In [None]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_augmented_data.json"
# xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
xgb_model.fit(augmented_X, augmented_y)
xgb_model.save_model(saved_model_path)

### 7.6 Evaluate the classifiers performance
Show the performance of the trained model

In [None]:
y_pred = xgb_model.predict(augmented_X)
calculated_accuracy_score_test = accuracy_score(augmented_y, y_pred)
print(f"Accuracy on training set (augmented data): {calculated_accuracy_score_test:0.4f}")
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Accuracy on real data test set: {calculated_accuracy_score_test:0.4f}")

### Results

The model over-fitting on the training data is significantly reduced and the accuracy that is much higher than for the small dataset comprised solely of data from the Central-West region. We also see a significant improvement over training the model on the superset of the real data.

## 8. Extension
Use the code block below as a space to complete the extension exercises below.

### Can you generate some more augmented datasets to answer the following questions?
1) How much synthetic data should you create for best results?
2) How much does changing the RadialGan plugin parameter `n_iter` change the quality of the generated data?