# Case Study 4 - Augmentation

TODO: Whole dataset model performance too high...

## The Task
Augment a small dataset using the concept of domain adaptation (or transfer learning). For this we will be using a RadialGAN as discussed in [this paper](https://arxiv.org/pdf/1802.06403.pdf).

### Imports
Lets import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here. 

In [1]:
# stdlib
import warnings
from pathlib import Path

# 3rd Party
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, accuracy_score
import xgboost as xgb
import seaborn as sns
# from tqdm import tqdm

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import GenericDataLoader
from synthcity.utils import serialization

warnings.filterwarnings("ignore")
# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

# Set up paths to resources
AUG_RES_PATH = Path("../resources/augmentation/")

  from .autonotebook import tqdm as notebook_tqdm


# The Scenario

Brazil is divided geopolitically into five macroregions: north, northeast, central-west, southeast, and south. For this case study, we will be acting as government officials in the Central-West Region of Brazil. Central-West Brazil is the smallest region in the country by population. It is also one of the larger and more rural regions. This means the number of COVID-19 patient records is significantly smaller compared to the larger regions.

<img src="../data/Brazil_COVID/Brazil_Labelled_Map.png" alt="Brazil Region Map" width="400"/>

COVID-19 hit different regions at different time. Cases peaked later in the Central-West than in the more densely-populated and well-connected regions. Giving us the problem of scarce data in terms of COVID-19 patients in the region, but the potential lifeline of having larger datasets from the other regions, which we can learn from in order to augment our dataset. We cannot simply train our model on the data from all regions, because there is significant co-variate shift between the different regions and so we will achieve a better classifier by training on solely Central-West data, even if it is synthetic. 

### Load the data
Lets set it up as a classification task with a death at time horizon column, as we did in a previous case study.

In [2]:
time_horizon = 14
X = pd.read_csv(f"../data/Brazil_COVID/covid_normalised_numericalised.csv")

X.loc[(X["Days_hospital_to_outcome"] <= time_horizon) & (X["is_dead"] == 1), f"is_dead_at_time_horizon={time_horizon}"] = 1
X.loc[(X["Days_hospital_to_outcome"] > time_horizon), f"is_dead_at_time_horizon={time_horizon}"] = 0
X.loc[(X["is_dead"] == 0), f"is_dead_at_time_horizon={time_horizon}"] = 0
X[f"is_dead_at_time_horizon={time_horizon}"] = X[f"is_dead_at_time_horizon={time_horizon}"].astype(int)

X.drop(columns=["is_dead", "Days_hospital_to_outcome"], inplace=True) # drop survival columns as they are not needed for a classification problem

Here we define a region mapper, which maps the region encoding to the real values. These can be found in `synthetic-data-lab/data/Brazil_COVID/Brazil_COVID_data.md`.

In [3]:
# Define the mappings from region index to region
region_mapper = {
    0: "Central-West",
    1: "North",
    2: "Northeast",
    3: "South",
    4: "Southeast",
}

As we are acting as officials from Central-West Brazil, split the data into data from our region and data from other regions.

In [23]:
our_region_index = 0

# Flatten region to simulate the scenario where we don't know where the data has come from, we just have our data and other data
other_region_index = 1 if our_region_index != 1 else 0
X.loc[X["Region"] != our_region_index, "Region"] = other_region_index
print(X["Region"].value_counts().rename({our_region_index: "Our region", 1:"Another region"}))

X_our_region_only = X.loc[X["Region"] == our_region_index].copy()
X_other_regions = X.loc[X["Region"] != our_region_index].copy()

display(X_our_region_only)

Another region    6777
Our region         105
Name: Region, dtype: int64


Unnamed: 0,Age,Sex,Ethnicity,Region,Fever,Cough,Sore_throat,Shortness_of_breath,Respiratory_discomfort,SPO2,...,Cardiovascular,Asthma,Diabetis,Pulmonary,Immunosuppresion,Obesity,Liver,Neurologic,Renal,is_dead_at_time_horizon=14
32,37,1,0,0,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
33,62,1,0,0,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
42,56,1,1,0,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
44,25,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45,27,1,0,0,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6818,58,1,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6819,63,0,1,0,1,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
6820,30,1,1,0,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
6821,38,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# The problem

Lets see how a model trained just on our data from the Central-West region performs.

Set up the data splits using train_test_split from sklearn.

In [5]:
y = X_our_region_only["is_dead_at_time_horizon=14"]
X_in = X_our_region_only.drop(columns=["is_dead_at_time_horizon=14"])

X_train, X_test, y_train, y_test = train_test_split(X_in, y, random_state=4)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

Load the trained xgboost classifier, which has been trained on Central-West data only.

In [8]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_real_{region_mapper[our_region_index]}_data.json"
xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
# xgb_model.fit(X_train, y_train)
# xgb_model.save_model(saved_model_path)

Now print the performance of the model trained on only Central-West data.

In [9]:
calculated_accuracy_score_train = accuracy_score(y_train, xgb_model.predict(X_train))
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Evaluating accuracy: train set: {calculated_accuracy_score_train} | test set: {calculated_accuracy_score_test}")

Evaluating accuracy: train set: 0.9102564102564102 | test set: 0.7407407407407407


As you can see we are significantly over-fitting due to the very small dataset.

### Now lets test our assertion that we can't just use all the training data and apply it to our region

Set up the training and testing data sets for the model, making sure the training sets come from the all region dataset, but the test sets come from our region.

In [10]:
y = X["is_dead_at_time_horizon=14"]
X_in = X.drop(columns=["is_dead_at_time_horizon=14"])

X_train, _, y_train, _ = train_test_split(X_in, y, random_state=4)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

y = X_our_region_only["is_dead_at_time_horizon=14"]
X_in = X_our_region_only.drop(columns=["is_dead_at_time_horizon=14"])

_, X_test, _, y_test = train_test_split(X_in, y, random_state=4)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

Load the model trained on data from all regions.

In [12]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_real_{region_mapper[our_region_index]}_data.json"
xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
# xgb_model.fit(X_train, y_train)
# xgb_model.save_model(saved_model_path)

Show the performance of the model.

In [13]:
calculated_accuracy_score_train = accuracy_score(y_train, xgb_model.predict(X_train))
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Evaluating accuracy: train set: {calculated_accuracy_score_train} | test set: {calculated_accuracy_score_test}")

Evaluating accuracy: train set: 0.8006200348769619 | test set: 0.9259259259259259


As you can see our accuracy does improve, but we can do better! And there may well be cases where there is a greater co-variate shift that impacts this accuracy to a much greater extent. It is also worth bearing in mind that there are contexts where the above approach is not even an option, such as in the case of only partially overlapping (or missing) features.

# The Solution

Lets augment this dataset with the use of a RadialGAN.

First, lets load the super-set of data from all regions into the GerericDataLoader object.

In [14]:
loader = GenericDataLoader(
    X, # X is the dataframe which is a superset of all region data
    target_column="is_dead_at_time_horizon=14", # The column containing the labels to predict
    sensitive_features=["Age", "Sex", "Ethnicity", "Region"], # The sensitive features in the dataset (Not needed here?)
    domain_column="Region", # This labels the domain that each record is from. Where it is `0` it is from our small dataset.
    random_state=42,
)


Lets use a RadialGan to augment the data. We need to load the plugin and then fit it to the dataloader object.

In [15]:
outdir = Path("saved_models")
prefix = "Augmentation"
model="radialgan"
n_iter = 15
random_state = 42

print(model)

# Define saved model name
save_file = outdir / f"{prefix}.{model}_numericalised_{region_mapper[our_region_index]}_n_iter={n_iter}_rnd={random_state}.bkp"
# Load if available
if Path(save_file).exists():
    syn_model = serialization.load_from_file(save_file)
# create and fit if not available
else:
    syn_model = Plugins().get(model, n_iter=n_iter, random_state=random_state)
    syn_model.fit(loader)
    serialization.save_to_file(save_file, syn_model)


radialgan


100%|██████████| 15/15 [00:07<00:00,  1.93it/s]


### The Solution
Lets train the model on an augmented dataset and see what our performance is now.

Lets use our synthetic model to generate some data and use it to augment our original dataset.

In [16]:
n_gen_records = 1000

synth_data = syn_model.generate(n_gen_records, domains=[our_region_index], random_state=random_state)

# Now we can augment our original dataset with our new synthetic data
augmented_data = pd.concat([
    synth_data.dataframe(),
    X_our_region_only,
])

Now we need to test a model trained on the augmented dataset, so we need to set up our data splits again.

In [17]:
augmented_y = augmented_data["is_dead_at_time_horizon=14"]
augmented_X_in = augmented_data.drop(columns=["is_dead_at_time_horizon=14"])

X_train, X_test, y_train, y_test = train_test_split(augmented_X_in, augmented_y, random_state=4)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

Load the trained model.

In [20]:
# Define the model
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    subsample=0.8, 
    colsample_bytree=1, 
    gamma=1, 
    objective="binary:logistic",
    random_state=42,
)

# Load the model trained on the whole dataset
saved_model_path = AUG_RES_PATH / f"augmentation_xgboost_augmented_data.json"
xgb_model.load_model(saved_model_path)

# # The saved model was trained with the following code:
# xgb_model.fit(X_train, y_train)
# xgb_model.save_model(saved_model_path)

Show the performance of the trained model

In [21]:
y_pred = xgb_model.predict(X_test)
calculated_accuracy_score_train = accuracy_score(y_train, xgb_model.predict(X_train))
calculated_accuracy_score_test = accuracy_score(y_test, y_pred)
print(f"Evaluating accuracy: n_gen_records: {n_gen_records} train set: {calculated_accuracy_score_train} | test set: {calculated_accuracy_score_test}")

Evaluating accuracy: n_gen_records: 1000 train set: 0.9251207729468599 | test set: 0.9169675090252708


### Results

The model over-fitting on the training data is significantly reduced and the accuracy that is much higher than for the small dataset comprised solely of data from the Central-West region. We also see a significant improvement over training the model on the superset of the real data.

### Can you generate some more augmented datasets to answer the foloowing questions?
 - How much synthetic data should you create for best results?
 - How much does changing the RadialGan plugin parameter `n_iter` change the quality of the generated data?