#### Notebook with all the general purpose functions

This notebook contains the general purpose functions that will be used throughout the solution. In this context,
"general purpose" means that these functions dont belong to a particular process such as preprocessing, modeling or
prediction; however, they are used as auxiliary functions to accomplish specific tasks within the code while preserving
modularity and order.

The functions included are:

| Function | Description |
| -------- | ----------- |
| `split_series`  | splits a time series into train and test sets according to the given dates |
| `add_holidays`  | adds the holidays to the time series dataset as an additional binary column |
| `shift_time_series` | creates a set of lags and leads of a specific column |
| `create_orders_lags` | creates a set of lags of a time series |

###### Definition of functions

In [0]:
def split_series(data, start_test, end_test):
    """
    Splits a time series into train and test sets according to the given date boundaries.

    Note that all records before the start of the test set are considered to be part of the train set. Furthermore, the
    end of the train set and start of the test set are assumed to be contiguous.

    Parameters
    __________
        data (pd.DataFrame): Dataset with the time series to split.
        start_test (str): Start of test set (included).
        end_test (str): End of test set (included).

    Returns
    ________
        train_df (pd.DataFrame): Train set DataFrame.
        test_df (pd.DataFrame): Test set DataFrame.
    """
    # Splitting dataset
    train_df = data[data['ds'] < pd.to_datetime(start_test)]
    test_df = data[(data['ds'] >= pd.to_datetime(start_test)) & (data['ds'] <= pd.to_datetime(end_test))]

    return train_df, test_df

In [0]:
def add_holidays(df_data, df_holidays):
    """
    Adds the holidays to the time series dataset as an additional binary column, where the value of this column is 1 for
    the dates where there is a holiday and 0 otherwise.

    Parameters
    __________
        df_data (pd.DataFrame): Dataset with the time series
        df_holidays (pd.DataFrame): Dataset with holidays.

    Returns
    ________
        df_data (pd.DataFrame): Same input "df_data" dataset but modified after adding the binary column with the
            holidays.
    """
    # Adding holidays column to the dataset
    df_data["holiday"] = 0

    # Identifying dates with holidays
    rows = pd.merge(df_data, df_holidays, on="ds", how="left", indicator=True)["_merge"] == "both"

    # Replacing holidays with 1
    df_data.loc[rows, "holiday"] = 1

    return df_data

In [None]:
def shift_time_series(sdf_data, column, lag_lead=7, suf=""):
    """
    Creates a set of lags and leads of the column specified in "column". The number of lags and leads is the same and is
    defined by the "lag_lead" argument.

    Note that this function operates over Spark Dataframes only.

    Parameters
    __________
        sdf_data (pyspark.sql.DataFrame): Dataset with the column to shift
        column (str): Name of the column to shift.
        lag_lead (int): Number of lags and leads to create from "column".
        suf (str, defaults to ""): Suffix to use for the new columns.

    Returns
    ________
        sdf_data (pyspark.sql.DataFrame): Same input dataset but modified after adding the columns of lags and leads.
    """
    # Defining granularity level of the window
    window = (
        Window.partitionBy(["n_sku"])
        .orderBy(["ds"])
    )

    # Creating leads and lags
    for shift in range(lag_lead):
        sdf_data = sdf_data.withColumn("lag_{}".format(shift + 1) + suf, lag(sdf_data[column], offset = shift + 1).over(window)) \
            .withColumn("lead_{}".format(shift + 1) + suf, lag(sdf_data[column], offset = -1 * (shift + 1)).over(window))

    return sdf_data

In [None]:
def create_orders_lags(df, column, lags=7):
    """
    Creates a set of lags of a column that contains a time series. The number of lags is defined by the "lags" argument.

    Note that this function operates over Pandas Dataframes only.

    Parameters
    __________
        df (pd.DataFrame): Dataframe with column "y" that will be used to create lags
        column (str): Name of the column to shift.
        lags (int, defaults to 7): Number of lags to create

    Returns
    ________
        df (pd.DataFrame): Same input dataset but modified after adding the columns of lags
    """
    # Creating lags
    for i in range(1, lags + 1):
        df[f"y_lag_{i}"] = df[column].shift(i).fillna(method='bfill')

    return df