# Task 2: Training and Tuning with Ray

## Part 1: Training with Ray Train and Xgboost
In this task, you will train a machine learning model using the preprocessed data. The goal is to train an Xgboost model to predict the user rating for a product. 

In [None]:
import ray
import psutil
ray.shutdown()
NINE_GIB = 9 * 1024 * 1024 * 1024
ray.init(object_store_memory=int(NINE_GIB))
import pandas as pd
import os
import json
import random
import numpy as np
from pathlib import Path
seed = 41
random.seed(seed)
np.random.seed(seed)

from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig

In [None]:
# load the preprocessed dataset as dense vectors in the parquet format
train_data_path=os.path.expanduser("ml_features_train.parquet")
train_data = ray.data.read_parquet(train_data_path)

### Training
Train a regression model (MSE objective) to predict `overall` using `XGBoostTrainer`.
- Objective: regression with squared error (`reg:squarederror`)
- Model params: `max_depth=3`, `eta=0.3`, others default.
- Use Ray Train with a suitable `ScalingConfig`; and you can set 2 CPUs per worker.
- After training, run inference on the test set.

Note: Ray will by default try to store results in `~/ray_results`. This can throw permission errors in DataHub, so you can change the location to `~/private/ray_results` using [RunConfig](https://docs.ray.io/en/latest/train/api/doc/ray.train.RunConfig.html).

In [None]:
# trainer = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
result = trainer.fit()

In [None]:
print(result)

## Analyzing test data performance

Next, use the trained model to generate predictions on test data. Calculate the root mean square error (RMSE) of
the test predictions and report it in the output. 

For this task, we will make use of [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) to make a stateful transformation of the test data. 

In [None]:
test_data_path=os.path.expanduser("ml_features_test.parquet")
test_dataset= ray.data.read_parquet(test_data_path)

In [None]:
model = trainer.get_model(result.checkpoint)

In [None]:
import pandas as pd
from ray.train import Checkpoint
import xgboost
import math

class Predictor:

    def __init__(self, checkpoint: Checkpoint):
        self.model = XGBoostTrainer.get_model(checkpoint)
        self.label_col = "overall"

    def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:
        """
        1. Get the predictions on a batch of data for an xgboost model.
        2. Return the squared errors for each entry using the label column
        """
        batch_features, batch_labels =  batch.loc[:, batch.columns != self.label_col], batch[[self.label_col]]
        dmatrix = xgboost.DMatrix(batch_features)
        batch["predictions"] = self.model.predict(dmatrix)
        errors = (batch["predictions"] - batch_labels[self.label_col])**2
        return {"se": errors}

def predict_xgboost(test_dataset, result):
    """
    Obtains the predictions for a test dataset given a `ray.train.Result` object and returns the squared errors for each entry
    Hint: ds.map_batches()
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return squared_errors

In [None]:
# get the root mean squared error for the test dataset using the result.
# Save the test rmse in `test_rmse` 

# YOUR CODE HERE
raise NotImplementedError()

# write to file
res_2_1 = {"test_rmse": test_rmse, 
          "train_rmse": result.metrics["train-rmse"]}

In [None]:
expected_path = Path("expected_2_1.json").expanduser()
with open(expected_path) as expected_file:
    expected = json.load(expected_file)

assert math.isclose(float(expected["train_rmse"]), float(res_2_1["train_rmse"]), abs_tol=0.01), \
    f"train_rmse mismatch: expected {expected['train_rmse']} vs {res_2_1['train_rmse']} (±0.01)"

assert math.isclose(float(expected["test_rmse"]), float(res_2_1["test_rmse"]), abs_tol=0.01), \
    f"test_rmse mismatch: expected {expected['test_rmse']} vs {res_2_1['test_rmse']} (±0.01)"

print("✅ Task 2.1 output matches expected within ±0.01 for RMSE.")


# Part 2: Tuning with Ray Tune

Based on `XGBoostTrainer` you just implemented, now We'll tune the following hyperparameters:

1. `max_depth`
2. `eta`

You can read more about each hyperparameter in the [official docs](https://xgboost.readthedocs.io/en/stable/parameter.html). Since the overall search space is large, and our compute budget is limited, we'll focus on running 4 *trials* (or 4 instances of 2-tuples of hyperparameters) with a grid search.  Here are the values:

1. `max_depth`: $[3, 5]$ 
2. `eta`: $[0.3, 0.5]$

Steps to implement, repeated from the problem statement:
1. Create a new training and validation data from the original training data - with a random split of 75/25.
2. Train Xgboost models with 4 hyperparameter trials over the given grid using Ray Tune. [Offical Example](https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html)
3. Select the best model with the lowest validation RMSE. 
4. Report the test RMSE for the best model and the lowest validation RMSE.

Make sure to use the same `ScalingConfig` as before. Restrict the number of concurrent trials to 1 for memory efficiency. Store the final `tune.ResultGrid` object in `result_grid` and the best result in the variable `best_result`.

In [None]:
from ray import tune

# store your answers in these
best_result = None
result_grid = None


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(best_result)

Now, 
1. Get the root mean squared error for the test dataset using the best result from the hyperparameter tuning experiments.
2. Report the validation rmse values for the best model as well as the given specific configurations

In [None]:
def get_task_2_2_results(result_grid: tune.ResultGrid, best_result: ray.train.Result):
    res = {
        "test_rmse": None, # test rmse for the best model
        "valid_rmse": None, # validation rmse for the best model
        "valid_depth_5_eta_0.3": None, # validation rmse for max_depth=5, eta=0.3
        "valid_depth_3_eta_0.5": None, # validation rmse for max_depth=3, eta=0.5
    }

    # YOUR CODE HERE
    raise NotImplementedError()
    return res

In [None]:
res_2_2 = get_task_2_2_results(result_grid, best_result)

In [None]:
expected_path = Path("expected_2_2.json").expanduser()
with open(expected_path) as expected_file:
    expected = json.load(expected_file)

for key in res_2_2.keys():
    assert math.isclose(float(expected[key]), float(res_2_2[key]), abs_tol=0.01), \
        f"{key} mismatch: expected {expected[key]} vs {res_2_2[key]} (±0.01)"

print("✅ Task 2.2 output matches expected within ±0.01 for RMSE.")


In [None]:
# shutdown!
ray.shutdown()