<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Model persistence using SparkML


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use SparkML to persist a model and to load the persisted model.


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Create-a-model">Task 1 - Create a model
      </a>
    </li>
    <li>
      <a href="#Task-2---Save-the-model">Task 2 - Save the model
      </a>
    </li>
    <li>
      <a href="#Task-3---Load-the-model">Task 3 - Load the model
      </a>
    </li>
    <li>
      <a href="#Task-4---Predict-using-the-loaded-model">Task 4 - Predict using the loaded model
      </a>
    </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-model">Exercise 1 - Create a model
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Save-the-model">Exercise 2 - Save the model
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Load-the-model">Exercise 3 - Load the model
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Predict-using-the-loaded-model">Exercise 4 - Predict using the loaded model
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Save a trained model.
 - Load a saved model.
 - Make predictions using the loaded model.


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg 
 - Modified version of diamonds dataset. Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active
 


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>



The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [ ]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [ ]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

#import functions/Classes for sparkml

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator


# Examples


## Task 1 - Create a model


In [ ]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Model Persistence").getOrCreate()

Download the data file


In [ ]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv


Load the dataset into the spark dataframe


In [ ]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)


Print the schema of the dataset


In [ ]:
mpg_data.printSchema()

Show top 5 rows from the dataset


In [ ]:
mpg_data.show(5)

We ask the VectorAssembler to group a bunch of inputCols as single column named "features"


In [ ]:
# Prepare feature vector
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
mpg_transformed_data = assembler.transform(mpg_data)


Display the assembled "features" and the label column "MPG"


In [ ]:
mpg_transformed_data.select("features","MPG").show()

We split the data set in the ratio of 70:30. 70% training data, 30% testing data.


In [ ]:
# Split data into training and testing sets
(training_data, testing_data) = mpg_transformed_data.randomSplit([0.7, 0.3])


Create a LR model and train the model using the pipeline on training data set


In [ ]:
# Train linear regression model
# Ignore any warnings
lr = LinearRegression(labelCol="MPG", featuresCol="features")
pipeline = Pipeline(stages=[lr])
model = pipeline.fit(training_data)


## Task 2 - Save the model


Create a folder where the model will to be saved


In [ ]:
!mkdir model_storage

In [ ]:
# Persist the model to the path "./model_stoarage/"

model.write().overwrite().save("./model_storage/")

#The overwrite method is used to overwrite the model if it already exists,
#and the save method is used to specify the path where the model should be saved.



## Task 3 - Load the model


In [ ]:
from pyspark.ml.pipeline import PipelineModel

# Load persisted model
loaded_model = PipelineModel.load("./model_storage/")

## Task 4 - Predict using the loaded model


In [ ]:
# Make predictions on test data
predictions = loaded_model.transform(testing_data)
#In the above example, we use the load method of the PipelineModel object to load the persisted model from disk. We can then use this loaded model to make predictions on new data using the transform method.


Your model is now trained. We use the testing data to make predictions.


In [ ]:
# Make predictions on testing data
predictions = model.transform(testing_data)

In [ ]:
predictions.select("prediction").show(5)

Stop Spark Session


In [ ]:
spark.stop()

# Exercises


### Exercise 1 - Create a model


Create a spark session with appname "Model Persistence Exercise"


In [ ]:
spark = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the SparkSession.builder

</details>


<details>
    <summary>Click here for Solution</summary>

```python
spark = SparkSession.builder.appName("Model Persistence Exercise").getOrCreate()
```

</details>


Download the data set


In [ ]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv


Load the dataset into a spark dataframe


In [ ]:
diamond_data = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the spark.read.csv function

</details>


<details>
    <summary>Click here for Solution</summary>

```python
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
```
</details>


Display sample data from dataset


In [ ]:
diamond_data.show(5)

Assemble the columns columns carat,depth and table into a single column named "features"


In [ ]:
assembler = #TODO
diamond_transformed_data = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task1
</details>


<details>
    <summary>Click here for Solution</summary>

```python
assembler = VectorAssembler(inputCols=["carat", "depth", "table"], outputCol="features")
diamond_transformed_data = assembler.transform(diamond_data)
```

</details>


Print the vectorized features and label columns


In [ ]:
diamond_transformed_data.select("features","price").show()

Split the dataset into training and testing sets in the ratio of 70:30.


In [ ]:
(training_data, testing_data) =  #TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the randomSplit method
</details>


<details>
    <summary>Click here for Solution</summary>

```
(training_data, testing_data) = diamond_transformed_data.randomSplit([0.7, 0.3])


```

</details>


Create a LR model and train the model using the pipeline on training data set


In [ ]:
# Train linear regression model
# Ignore any warnings
lr =  #TODO
pipeline =  #TODO
model =  #TODO


<details>
    <summary>Click here for a Hint</summary>
    
use the Pipeline with an lr stage
</details>


<details>
    <summary>Click here for Solution</summary>

```
lr = LinearRegression(labelCol="price", featuresCol="features")
pipeline = Pipeline(stages=[lr])
model = pipeline.fit(training_data)

```

</details>


### Exercise 2 - Save the model


Create a folder "diamond_model". This is where the model will to be saved


In [ ]:
!mkdir diamond_model

Persist the model to the folder "diamond_model"


In [ ]:
#your code goes here

<details>
    <summary>Click here for a Hint</summary>
    
use the write method of the model
</details>


<details>
    <summary>Click here for Solution</summary>

```
model.write().overwrite().save("./diamond_model/")

```

</details>


### Exercise 3 - Load the model


Load the model from the folder "diamond_model"


In [ ]:
from pyspark.ml.pipeline import PipelineModel

# Load persisted model
loaded_model = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the load method of the PipelineModel
</details>


<details>
    <summary>Click here for Solution</summary>

```
loaded_model = PipelineModel.load("./diamond_model/")
```

</details>


### Exercise 4 - Predict using the loaded model


Make predictions on test data


In [ ]:

predictions = #TODO


<details>
    <summary>Click here for a Hint</summary>
    
use the transform method of the loaded model
</details>


<details>
    <summary>Click here for Solution</summary>

```
predictions = loaded_model.transform(testing_data)

```

</details>


In [ ]:
predictions.select("prediction").show(5)

Stop Spark Session


In [ ]:
spark.stop()

Congratulations you have completed this lab.<br>


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


Copyright © 2023 IBM Corporation. All rights reserved.


<!--
## Change Log
-->


<!--
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-04|0.1|Ramesh Sannareddy|Initial Version Created|
-->
