# Distributed Training with Spark Demo

PySpark is often used with large datasets that don't fit in memory on a single machine. Distributed training refers to traning a model over several workers across a cluster. Spark's MLLib handles distributing the machine learning training process. This process generates one model for a huge dataset.

In this demo, we'll be doing something different. We're going to train one model per group, and then scale this training process with Spark. We'll be training multiple models across multiple workers and keeping track of the performance. Pandas and Sklearn code will be wrapped in a Pandas User Defined Function (UDF) and applied by PySpark.

Note that the purpose of the demo is not to create the best model, but to illustrate scaling training multiple models in a distributed fashion. As such, we won't be going in so deep into the data

* [Installing Prerequisites](#installing-prerequisites)
* [Exploring the Dataset](#exploring-dataset)
    - [Competition Setup](#competition-setup)
    - [Data Files](#data-files)
    - [Initial Exploration](#initial-exploration)
* [Compressing Timeseries](#compressing-timeseries)
* [Binary Blob](#binary-blob)
* [Spark Orchestration](#spark-orchestration)
* [Performance Evaluation](#section-three)

<a id = "installing-prerequisites"></a>
## Installing Prerequisites (PySpark and Java 8)

Even though the (documentation)[https://spark.apache.org/docs/3.0.0/#downloading] says PySpark 3+ works with Java 11, I was running into some errors with Pandas_UDFs and PyArrow types so I just decided to install Java 8 instead. In general, PySpark with Java 8 will be more stable.

In [None]:
# Install java 8
! apt remove -y openjdk-11-jre-headless
! apt install -y openjdk-8-jdk openjdk-8-jre

# Check version
! java -version

In [None]:
# Install pyspark
!pip install pyspark==3.0.1

<a id = "exploring-dataset"></a>
## Exploring the Dataset

In [None]:
# Imports
import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

from datetime import date
from typing import List, Any, Dict

%matplotlib inline

<a id = "competition-setup"></a>
### Competition Setup

This notebook wil use data from the [M5 Forecasting](https://www.kaggle.com/c/m5-forecasting-accuracy) competition, which asks participants to predict sales of Walmart products over a 28-day period, given the historical sales data. Plot below is taken from this [notebook](https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda). 

The plot shows the competition setup, orange is the training period. Yellow and blue show the validation and evaluation periods, respectively. 

For this demo, we'll just be concerned with training and validation (will be referred to as test set)

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1014468%2F5855ba35843d22a319e3682e5bb2e9de%2FScreenshot%202020-05-29%20at%2020.23.16.png?generation=1590866269400767&alt=media)

<a id=#data_files></a>
### Data Files

For this competition, `sales_train_validation.csv` was initially given. One more before the competition deadline, 
`sales_train_evalutation.csv` was released with labels for the final 28 days.

For this demo we'll just concern ourselves with 3 files:


- `calendar.csv` - Contains information about the dates on which the products are sold.
- `sales_train_validation.csv` - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
- `sell_prices.csv` - Contains information about the price of the products sold per store and date.

<a id = "initial-exloration"></a>
### Initial Exploration

**Calendar**

First, we'll take a look at the calendar data. It is the smallest of the three files. It contains the dates, year_wk, and events that happened. This also contains events, along with a binary variable if SNAP purchases were allowed on that [date](https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133614)

In [None]:
# Loading in calendar data
calendar = pd.read_csv('../input/m5-forecasting-accuracy/calendar.csv')
calendar['date'] = pd.to_datetime(calendar['date'])
calendar.head()

In [None]:
START_DATE = calendar['date'].min()
END_DATE = calendar['date'].max()
print("Calendar length = " + str(calendar.shape[0]) + " days")
print("Ending date: " + str(END_DATE))

Note that the length of the calendar is 1969 days. Days 1 - 1913 are the training set. Days 1914 - 1941 are the validation set, and days 1942 - 1969 are the evaluation set.

**Prices**



In [None]:
# Loading in the prices
prices = pd.read_csv('../input/m5-forecasting-accuracy/sell_prices.csv')
prices.head()

Looks like the price is defined per week, and we need to join this to the calendar data to get the appropriate time series for each product. I am keeping wday to to include another feature.

In [None]:
# Join prices to calendar
prices = calendar[['wm_yr_wk', 'date', 'wday']].merge(prices, on = 'wm_yr_wk', how='inner')
prices.head()

In [None]:
# This is a check to make sure we know the sell_prices beforehand
# to determine if we can use it as a feature
prices['date'].max()

In [None]:
# Plotting one timeseries
temp = prices.loc[(prices['item_id'] == 'HOBBIES_1_012') & (prices['store_id'] == "CA_1")]
sns.lineplot(temp['date'], temp['sell_price'])
plt.title('Sell price for HOBBIES_1_012 over time')

<a id = "compressing-timeseries"></a>
## Compressing timeseries data into a list

The first concept in this demo is we can compress the size of our data by putting it in a list format, and then just keeping track of the start date. There is an assumption here that the data is continuous.

In [None]:
%%time
prices = prices.groupby(['store_id', 'item_id']).agg({'date': min,'sell_price': lambda x: list(x), 
                                                      'wday': lambda x: list(x)}).reset_index()\
           .rename(columns = {'date':'sell_price_start_date'})
prices.head()

In [None]:
sales = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_validation.csv')
sales.head()

Note the format of the raw data. Each day in a column. The timeseries for an item goes from left to right.

Similar to the transformation we did for the price data, we can also convert this to a list to save memory.

In [None]:
cols = sales.columns[6:]
sales['sales'] = sales[cols].values.tolist()
sales = sales[['item_id', 'store_id', 'sales']]
sales['sales_start_date'] = START_DATE

In [None]:
# Plotting one timeseries for sales
temp = sales.loc[(sales['item_id'] == 'HOBBIES_1_012') & (sales['store_id'] == "CA_1")]
dr = pd.date_range(START_DATE,periods=len(temp.iloc[0]["sales"]), freq="d")
temp = pd.DataFrame({'date': dr, 'sales': temp.iloc[0]['sales']})
sns.lineplot(temp['date'], temp['sales'])
plt.title('Sales for HOBBIES_1_012 over time')

In [None]:
data = sales.merge(prices, on = ['item_id', 'store_id'], how = 'inner')
data.head()
print(data.shape)

### Sampling Due to Memory Constraints

Sampling rows of the products due to memory limitations of Kaggle

In [None]:
data = data.sample(frac = 0.1).reset_index()
data.head()

<a id = "binary-blob"></a>
## Using Binary Blobs to Pass Data

Here we have a code snippet to combine the timeseries into one dataframe. This dataframe will be pickled into a binary blob that will be passed to workers in the following cell.

In [None]:
row = data.iloc[0]
dr1 = pd.date_range(row["sales_start_date"],periods=len(row["sales"]), freq="d")
df = pd.DataFrame({"sales":row["sales"]},index = dr1)
dr2 = pd.date_range(row["sell_price_start_date"],periods=len(row["sell_price"]), freq="d")
df["price"] = pd.Series(row["sell_price"],index = dr2)
df['wday'] = pd.Series(row["wday"], index = dr2)
df.dropna(inplace = True)
df.head(10)

In [None]:
%%time
for index ,row in data.iterrows():
    dr1 = pd.date_range(row["sales_start_date"],periods=len(row["sales"]), freq="d")
    df = pd.DataFrame({"quantity":row["sales"]},index = dr1)
    dr2 = pd.date_range(row["sell_price_start_date"],periods=len(row["sell_price"]), freq="d")
    df["price"] = pd.Series(row["sell_price"],index = dr2)
    df['wday'] = pd.Series(row["wday"], index = dr2)
    df=df.dropna()
    data.loc[index, "start_date"] = df.index[0].date()
    data.loc[index, "timeseries"] = pickle.dumps(df)
data.head()

In [None]:
def train_test_split(df: pd.DataFrame, train_end = pd.to_datetime("2015-06-01"), target='quantity', test_period = 28):
    if (df.index[-1]-train_end).days<test_period:
        return None
    n_train = df.shape[0] - 28
    y = df[target]
    x = df.drop(target, axis = 1)
    X_train = x.iloc[:n_train]
    X_test = x.iloc[n_train:]
    y_train = y.iloc[:n_train]
    y_test = y.iloc[n_train:]
    return X_train, X_test, y_train, y_test


In [None]:
def generate_train_set(df:pd.DataFrame):
    for i, row in df.iterrows():
        result = train_test_split(pickle.loads(row["timeseries"]))
        if result is None:
            continue
        X_train, X_test, y_train, y_test = result
        df.loc[i, "X_train"] = pickle.dumps(X_train)
        df.loc[i, "X_test"] = pickle.dumps(X_test)
        df.loc[i, "y_train"] = pickle.dumps(y_train)
        df.loc[i, "y_test"] = pickle.dumps(y_test)
    return df

In [None]:
data = generate_train_set(data)
data = data[['item_id', 'store_id', 'X_train', 'X_test', 'y_train', 'y_test']]
data.head()

<a id = "spark-orchestration"></a>
## Spark as Orchestration DataFrame

Here we create a Spark DataFrame from the Pandas DataFrame with the pickled rows. This Spark DataFrame will serve as the orchestration piece to run Sklearn models on each row.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, FloatType, BinaryType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
in_schema = StructType([
    StructField("item_id", StringType(), False),
    StructField("store_id", StringType(), False),
    StructField("X_train", BinaryType(), False),
    StructField("X_test", BinaryType(), False),
    StructField("y_train", BinaryType(), False),
    StructField("y_test", BinaryType(), False)
])
data_spark = spark.createDataFrame(data, in_schema)
data_spark.show(10)

<a id = "evaluation"></a>
# Setting up models for evaluation

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

def eval_lr(x_train, y_train, x_test):
    model = LinearRegression(fit_intercept=True)
    model.fit(x_train,y_train)
    return model, model.predict(x_test)

def eval_svr(x_train, y_train, x_test):
    model = SVR(C=1.0, epsilon=0.2)
    model.fit(x_train,y_train)
    return model, model.predict(x_test)

def eval_nn(x_train,y_train,x_test):
    model = Sequential()
    model.add(Dense(32, input_dim=x_train.shape[1], activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='linear'))
    model.compile(
        loss="mae",
        optimizer="adam",
        metrics=["mean_absolute_error"],
    )
    model.fit(x_train,y_train, epochs=4, batch_size=16)
    return 1, model.predict(x_test) # keras models can't be pickled so it's useless for this demo

In [None]:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType

out_schema = StructType([
    StructField("item_id", StringType(), False),
    StructField("store_id", StringType(), False),
    StructField("score", FloatType(), False),
    StructField("model_name", StringType(), False),
    StructField("model", BinaryType(), False)
])

@pandas_udf(out_schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def evaluate_model(df):
    result = pd.DataFrame()
    row = df.iloc[0].to_dict()
    x_train, y_train = pickle.loads(row["X_train"]), pickle.loads(row["y_train"])
    x_test, y_test = pickle.loads(row["X_test"]), pickle.loads(row["y_test"])
    for eval_func in [eval_lr, eval_svr, eval_nn]:
        model, pred = eval_func(x_train, y_train, x_test)
        score = r2_score(y_test, pred)
        result = result.append({"item_id": row["item_id"], "store_id": row["store_id"], 
                                "score": score, "model_name": eval_func.__name__,
                               "model": pickle.dumps(model)}, ignore_index=True)

    return result

In [None]:
%%time
evaluation = data_spark.groupBy(["item_id", "store_id"]).apply(evaluate_model).toPandas()

# Comparing Model Performance

In [None]:
# Reading in data to join back categorical
sales = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_validation.csv')
evaluation = evaluation.merge(sales[['cat_id', 'state_id', 'item_id', 'store_id']], on = ['item_id', 'store_id'], how = 'left')
evaluation.head()

In [None]:
# Average score for each model
temp = evaluation.loc[evaluation['score'] > 0]
sns.barplot(temp['state_id'],temp['score'], hue=temp['model_name'])

Getting best model for each product

In [None]:
evaluation = evaluation.sort_values('score', ascending = False)
evaluation = evaluation.groupby(['item_id', 'store_id']).first()

This plot shows how many times each model performed the best for a given product.

In [None]:
sns.countplot(x="cat_id", hue="model_name", data=evaluation)
plt.title('Best model')

In [None]:
sns.countplot(x="state_id", hue="model_name", data=evaluation)
plt.title('Best model')