<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Regression using SparkML


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use SparkML to predict the mileage of a car.


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Task-2---Load-the-data-in-a-csv-file-into-a-dataframe">Task 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Task-3---Identify-the-label-column-and-the-input-columns">Task 3 - Identify the label column and the input columns
      </a>
    </li>
    <li>
      <a href="#Task-4---Split-the-data">Task 4 - Split the data
      </a>
    </li>
    <li>
      <a href="#Task-5---Build-and-Train-a-Linear-Regression-Model">Task 5 - Build and Train a Linear Regression Model
      </a>
    </li>
    <li>
      <a href="#Task-6---Evaluate-the-model">Task 6 - Evaluate the model
      </a>
    </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-spark-session">Exercise 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Identify-the-label-column-and-the-input-columns">Exercise 3 - Identify the label column and the input columns
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Split-the-data">Exercise 4 - Split the data
      </a>
    </li>
    <li>
      <a href="#Exercise-5---Build-and-Train-a-Linear-Regression-Model">Exercise 5 - Build and Train a Linear Regression Model
      </a>
    </li>
    <li>
      <a href="#Exercise-6---Evaluate-the-model">Exercise 6 - Evaluate the model
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame.
 - Split the dataset into training and testing sets.
 - Use VectorAssembler to combine multiple columns into a single vector column
 - Use Linear Regression to build a prediction model.
 - Use metrics to evaluate the model.
 - Stop the spark session
 
 
 


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg 
 - Modified version of diamonds dataset. Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active
 


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [ ]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

In [ ]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [ ]:
# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

from pyspark.sql import SparkSession

#import functions/Classes for sparkml

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator


## Task 1 - Create a spark session


In [ ]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Regressing using SparkML").getOrCreate()

## Task 2 - Load the data in a csv file into a dataframe


Download the data file


In [ ]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv


Load the dataset into the spark dataframe


In [ ]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)


Print the schema of the dataset


In [ ]:
mpg_data.printSchema()

show top 5 rows from the dataset


In [ ]:
mpg_data.show(5)

## Task 3 - Identify the label column and the input columns


We ask the VectorAssembler to group a bunch of inputCols as single column named "features"


In [ ]:
# Prepare feature vector
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
mpg_transformed_data = assembler.transform(mpg_data)


Display the assembled "features" and the label column "MPG"


In [ ]:
mpg_transformed_data.select("features","MPG").show()

## Task 4 - Split the data


We split the data set in the ratio of 70:30. 70% training data, 30% testing data.


In [ ]:
# Split data into training and testing sets
(training_data, testing_data) = mpg_transformed_data.randomSplit([0.7, 0.3], seed=42)


The random_state variable "seed" controls the shuffling applied to the data before applying the split. Pass the same integer for reproducible output across multiple function calls


## Task 5 - Build and Train a Linear Regression Model


Create a LR model and train the model using the training data set


In [ ]:
# Train linear regression model
# Ignore any warnings

lr = LinearRegression(featuresCol="features", labelCol="MPG")
model = lr.fit(training_data)

## Task 6 - Evaluate the model


Your model is now trained. We use the testing data to make predictions.


In [ ]:
# Make predictions on testing data
predictions = model.transform(testing_data)

##### R Squared


In [ ]:
#R-squared (R2): R2 is a statistical measure that represents the proportion of variance
#in the dependent variable (target) that is explained by the independent variables (features).
#Higher values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)


##### Root Mean Squared Error


In [ ]:
#Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences
#between the predicted and actual values. It measures the average distance between the predicted
#and actual values, and lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)


##### Mean Absolute Error


In [ ]:
#Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and
#actual values. It measures the average absolute distance between the predicted and actual values, and
#lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)


Stop Spark Session


In [ ]:
spark.stop()

# Exercises


### Exercise 1 - Create a spark session


Create a spark session with appname "Diamond Price Prediction"


In [ ]:
spark = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the SparkSession.builder

</details>


<details>
    <summary>Click here for Solution</summary>

```python
spark = SparkSession.builder.appName("Diamond Price Prediction").getOrCreate()
```

</details>


### Exercise 2 - Load the data in a csv file into a dataframe


Download the data set


In [ ]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv


Load the dataset into a spark dataframe


In [ ]:
diamond_data = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the spark.read.csv function

</details>


<details>
    <summary>Click here for Solution</summary>

```python
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
```
</details>


Display sample data from dataset


In [ ]:
#TO DO

<details>
    <summary>Click here for Solution</summary>

```python
diamond_data.show(5)
```
</details>


### Exercise 3 - Identify the label column and the input columns


 - use the price column as label column
 - use the columns carat,depth and table as features


Assemble the columns columns carat,depth and table into a single column named "features"


In [ ]:
assembler = #TODO
diamond_transformed_data = #TODO


<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task4
</details>


<details>
    <summary>Click here for Solution</summary>

```python
assembler = VectorAssembler(inputCols=["carat", "depth", "table"], outputCol="features")
diamond_transformed_data = assembler.transform(diamond_data)
```

</details>


Print the vectorized features and label columns


In [ ]:
#TO DO

<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task3
</details>


<details>
    <summary>Click here for Solution</summary>

```python
diamond_transformed_data.select("features","price").show()
```

</details>


### Exercise 4 - Split the data


Split the dataset into training and testing sets in the ratio of 70:30.


In [ ]:
(training_data, testing_data) = # TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the randomSplit method
</details>


<details>
    <summary>Click here for Solution</summary>

```
(training_data, testing_data) = diamond_transformed_data.randomSplit([0.7, 0.3])


```

</details>


### Exercise 5 - Build and Train a Linear Regression Model


Build a linear regression and train it


In [ ]:
lr = #TODO
model = #TODO


<details>
    <summary>Click here for a Hint</summary>
    
use the fit method of LinearRegression
</details>


<details>
    <summary>Click here for Solution</summary>

```
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(training_data)


```

</details>


Predict the values using the test data


### Exercise 6 - Evaluate the model


Your model is now trained. Make the model predict on testing_data


In [ ]:
predictions = #TODO


<details>
    <summary>Click here for a Hint</summary>
    
use the transform method of the model
</details>


<details>
    <summary>Click here for Solution</summary>

```python
predictions = model.transform(testing_data)
```

</details>


Print the metrics :
- R squared
- mean absolute error
- root mean squared error


In [ ]:
#your code goes here

<details>
    <summary>Click here for a Hint</summary>
    
use the RegressionEvaluator </details>


<details>
    <summary>Click here for Solution</summary>

```python
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)


evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)


evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)


```

</details>


Congratulations you have completed this lab.<br>


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-04-25|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
