In [None]:
%run "./Imports"

In [None]:
%run "./Model_Tuning"

In [None]:
%run "./General_Functions"

This notebook contains the code of the Demand Forecasting Pipeline, which is the process used to tune (train +
validation) and back-test the models for each product (SKU) according to the different defined experiments. 

For each product this pipeline will train, tune, yield a best model (best set of hyperparameters) and backtest it for
each of the different experiments; which means that every product will have as many "best models" as the number of
experiments. The decision about which of these models to use as the final model is done by selecting the better on in
terms of the validation WAPE.

The result of this process consists of logging for all products the best model of each experiment into the Mlflow
tracking API and generating the forecast for the back-testing period with each one of these, after that only the
forecast corresponding to the best model among all the experiments is kept.

The functions included are:

| Function | Description |
| -------- | ----------- |
| `obtain_models` | for each product, obtains the best forecasting model per experiment |

###### Initializing variables

In [None]:
# Experiment variables
algorithms = ["lightgbm"]
holidays = False
num_evals = 2

# Dates for validation
start_val = "2019-01-01"
end_val = "2019-06-30"

# Dates for testing
start_test = "2019-07-01"
end_test = "2019-12-31"

###### Defining search space of each algorithm

In [None]:
# Defining search space for LightGBM
params_lightgbm = {
    "n_estimators":  hp.randint("n_estimators", 15, 200),
    "max_depth": hp.randint("max_depth", 3, 50),
    "learning_rate": hp.choice("learning_rate", [0.01, 0.05, 0.1, 0.2]),
    "reg_alpha": hp.choice("reg_alpha", [0.2, 1, 5, 10]),
    "reg_lambda": hp.choice("reg_lambda", [0.2, 1, 5, 10])
}

###### Setting Mlflow experiment

In [None]:
# Defining experiment path
mlflow_exp = r"/Users/n.garcia.aramouni@accenture.com/UDP_E2E_Forecasting/nerdearla_udp_shipments"

# Launching Mlflow client
client = MlflowClient()

# Creating experiment or re-using it if already exists
experiment = client.get_experiment_by_name(mlflow_exp)
if experiment is None:
    exp_id = mlflow.create_experiment(mlflow_exp)
else:
    exp_id = experiment.experiment_id

###### Defining modeling function

In [None]:
def obtain_models(data, df_frds):
    """
    For each product, obtains the best forecasting model per experiment; where an experiment is defined by an algorithm
    in the context of the workshop.

    Obtaining the best model of each experiment is done by performing hyperparameter tuning, which involves training and
    validating multiple sets of hyperparameters to then select the best performing set according to a specific metric on
    the validation set. Finally, all the best models from the experiments are used to generate the forecast for the
    back-testing set and their performances on that set are recorded.

    Results for each experiment such as train WAPE, validation WAPE and test WAPE are logged into Mlflow.

    Parameters
    __________
        data (pd.DataFrame): Dataset with the time series of the product.
        df_frds (pd.DataFrame): Dataset with holidays.

    Returns
    _______
        df_forecasts (pd.DataFrame): Table with the forecasts of the best models for the back-testing set.
    """
    # Ensuring order of observations
    data = data.sort_values(by="ds", ascending=True).reset_index(drop=True)
    
    # Obtaining product info
    sku = data["n_sku"][0]
    
    # Order columns
    regressors = [x for x in data.columns if x not in ["ds", "n_sku", "y"]]
    data = data[["ds"] + regressors + ["y"]]

    # Splitting the series
    df_trainval, df_test = split_series(data, start_test, end_test)

    # Defining the output object
    df_forecasts = pd.DataFrame()

    # Looping over the algorithms
    for algorithm in algorithms:
        # Validating the algorithm to use
        if algorithm == "sarimax":
            search_space = params_sarimax
        elif algorithm == "prophet":
            search_space = params_prophet
        elif algorithm == "lightgbm":
            search_space = params_lightgbm

        # Tuning the model
        results = tune_ts_model(
            algorithm, search_space, num_evals, df_trainval, start_val, end_val, holidays=holidays, df_frds=None
        )

        # Re-fitting model and generating forecast
        df_fcst_alg, test_wape = refit_generate_forecast(
            algorithm, results["params"], df_trainval, df_test, holidays, df_frds
        )

        # Appending forecast to the output object
        df_forecasts = df_forecasts.append(df_fcst_alg)

        # Starting run and assigning tags
        mlflow.start_run(experiment_id=exp_id, run_name=str(sku))
        mlflow.set_tags(
           {
            "experiment": "Nerdearla 2021",
            "product": sku,
            "algorithm": algorithm
           }
        )

        # Logging results in Mlflow
        mlflow.log_metrics({"train_wape": results["train_wape"], "val_wape": results["val_wape"], "test_wape": test_wape})

        # Ending run
        mlflow.end_run()

    # Adding identification column
    df_forecasts["n_sku"] = sku

    return df_forecasts

##### Training pipeline main code

###### 1. Loading the preprocessed data from Delta table

In [None]:
# Read data
df = spark.read.csv("/FileStore/tables/raw_consumption_data_clean.csv", header="true")

df = df\
    .withColumn("ds", df['ds'].cast(DateType()))\
    .withColumn("y", df['y'].cast(IntegerType()))  

In [None]:
from pyspark.sql import Window
from pyspark.sql.functions import lag, lead, col
df = df.withColumn("lead",lead(col("y"),1).over(Window.partitionBy("n_sku").orderBy("ds")))
df = df.withColumn("lag",lag(col("y"),1).over(Window.partitionBy("n_sku").orderBy("ds")))
df = df.na.fill(50)

###### 2. Performing modeling of SKUs

In [None]:
# Defining schema of the resulting dataframe:
result_schema = StructType(
    [
     StructField("algorithm", StringType(), False),
     StructField("ds", DateType(), False),
     StructField("fcst", FloatType(), False),
     StructField("n_sku", StringType(), False)
    ]
)

# Performing modeling of the DFUs
df_fcsts = df.groupBy("n_sku") \
    .applyInPandas(
        lambda df: obtain_models(df, holidays),
        result_schema
    ) \
    .persist(StorageLevel.MEMORY_ONLY)

# Adding identification key of experiments
df_fcsts = df_fcsts.withColumn("exp_key", concat(df_fcsts["n_sku"], lit("_"), df_fcsts["algorithm"]))
display(df_fcsts)