# Model training and registration
This notebook show the process for training the model, converting the model to ONNX and uploading the ONNX model to Azure Storage.

## Explore the training data
The following cells load the source CSV file into a Spark DataFrame and create a temporary view that can be used to query the data with Spark SQL.

WWI has provided a small CSV file you can use for showing the process of training a simple model.

They have already loaded for you in the data lake. 
It is located under the `wwi-02` container with the path `/sale-csv/wwi-factsale.csv`.
You need to build the correct path to the file and the run the cells that follow to load and query the data.


In [None]:
df = spark.read.load('abfss://<REPLACE-WITH-YOUR-PATH>', format="csv"
, header=True, sep="|"
)

Next, WWI would like you to show them how create a temporary view over the loaded dataframe.

The view should be named `facts`.

Complete the code in the cell and run it.


In [None]:
df.#<- can you complete this?

In the next cell, WWI would like you to explore the data with an initial query.

You want to preview all of the sales having the `Customer Key` of `11`.

You should order the results by `Stock Item Key`.


In [None]:
display(spark.sql("<INSERT YOUR SQL QUERY HERE>"))

## Predict Quantity given Customer Key and Stock Item Key
In the following cells we load a subset of the data that just contains the fields needed for training. 

WWI's data scientists have already provided some of the code for you. 

Read thru and run the following cells.



In [None]:
from pyspark.sql.functions import col
df3 = spark.sql("SELECT double(`Customer Key`) as customerkey, double(`Stock Item Key`) as stockitemkey, double(`Quantity`) as quantity FROM facts").where(col("quantity").isNotNull())
df3.cache()

Next, we package the data into the format expected by Spark ML's LinearRegression. It requires a DataFrame with two columns- `features` and a column with the labels to predict (`quantity` in this case).


In [None]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols = ['customerkey', 'stockitemkey'], outputCol = 'features')
df4 = vectorAssembler.transform(df3)
df5 = df4.select(['features', 'quantity'])
df5.show(10)

Now, we split our DataFrame into training and testing DataFrames.


A best practice is to split data into training and test sets.

WWI would like you to complete the final line that produces the train and test dataframes. 

Once you have completed the cell, run it.


In [None]:
trainingFraction = 0.7
testingFraction = (1-trainingFraction)
seed = 42

# Split the dataframe into test and training dataframes
df_train, df_test = # use df5 to create the two dataframes

In the following cell, you will train your LinearRegression model.

The goal of this regressor is to predict the `quantity` field given all of the features. 

Complete the missing parameters and the last line to train the model.


In [None]:
from pyspark.ml.regression import LinearRegression

lin_reg = LinearRegression(featuresCol = '<REPLACE WITH YOUR ANSWER>', labelCol='<REPLACE WITH YOUR ANSWER>', maxIter = 10, regParam=0.3)
lin_reg_model = # complete this line, using df_train to train the linear regression model 

Now that you have a trained model in hand, WWI wants to verify you can use it to make predictions against the test DataFrame.

Complete the first line to use your trained model to make predictions against the `df_test` dataframe.


In [None]:
df_pred = #<-complete this to use your model to make predictions against df_test 
display(df_pred)

## Convert model to ONNX
In the cells that follow, WWI wants you to show how you convert the model to ONNX and show how an output of how ONNX represents the Spark ML model.

They have already provided you the code, you just need to run the cells.


In [None]:
from onnxmltools import convert_sparkml
from onnxmltools.convert.common.data_types import FloatTensorType

initial_types = [ 
    ("features", FloatTensorType([1, lin_reg_model.numFeatures])),
    # (repeat for the required inputs)
]

In [None]:
model_onnx = convert_sparkml(lin_reg_model, 'sparkml GeneralizedLinearRegression', initial_types)
model_onnx

## Upload the model to Azure Storage

In order for an ONNX model to be used by the T-SQL predict statement, it must be uploaded to Azure Storage.

WWI wants you to show them how they would serialize the model to disk and then upload the model file to Azure Storage.

Run the following cell to save  the ONNX model to the storage of the Spark driver node temporarily. 

In [None]:
with open("model.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

Next, you need to show WWI how to use the Azure Storage Python SDK to upload the ONNX model to Azure Storage.

Complete the connection string with the correct values for your non-hierarchical Storage Account.


In [None]:
from azure.storage.blob import BlockBlobService
 
block_blob_service = BlockBlobService(
 account_name='#DATA_LAKE_ACCOUNT_NAME#', account_key='#DATA_LAKE_ACCOUNT_KEY#') 
 
block_blob_service.create_blob_from_text('wwi-02', '/ml/onnx/model.onnx', model_onnx.SerializeToString())