## Exercise: Train a Model and Create Predictions

Train a model using the Boston dataset and a different set of input features.  Predict on new data.

In [0]:
%run "../Includes/Classroom-Setup"

Import the Boston dataset, which contains median house values in 1000's (`medv`) for a variety of different features.  Since this dataset is "supervised" my the median value, this is a supervised machine learning use case.

In [0]:
bostonDF = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bostonhousing/bostonhousing/bostonhousing.csv")
)

display(bostonDF)

### Step 1: Create the Features

Using `bostonDF`, use a `VectorAssembler` object `assembler` to create a new column `newFeatures` that has the following three variables:<br><br>

1. `indus`: proportion of non-retail business acres per town
2. `age`: proportion of owner-occupied units built prior to 1940
3. `dis`: weighted distances to five Boston employment centers

Save the results to `bostonFeaturizedDF2`

In [0]:
# ANSWER
from pyspark.ml.feature import VectorAssembler

featureCols = ["indus", "age", "dis"]
assembler = VectorAssembler(inputCols=featureCols, outputCol="newFeatures")

bostonFeaturizedDF2 = assembler.transform(bostonDF)

display(bostonFeaturizedDF2)

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-02-01-01", True, set(assembler.getInputCols()) == {'indus', 'age', 'dis'})
dbTest("ML1-P-02-01-02", True, bool(bostonFeaturizedDF2.schema['newFeatures'].dataType))

print("Tests passed!")

### Step 2: Train the Model

Instantiate a linear regression model `lrNewFeatures`.  Save the trained model to `lrModelNew`.

In [0]:
# ANSWER
from pyspark.ml.regression import LinearRegression

lrNewFeatures = LinearRegression(labelCol="medv", featuresCol="newFeatures")

lrModelNew = lrNewFeatures.fit(bostonFeaturizedDF2)

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-02-02-01", True, lrNewFeatures.getFeaturesCol() == "newFeatures")
dbTest("ML1-P-02-02-02", True, lrNewFeatures.getLabelCol() == "medv")
dbTest("ML1-P-02-02-03", True, lrModelNew.hasSummary)

print("Tests passed!")

### Step 3: Create Predictions

Create the DataFrame `predictionsDF` for the following values, created for you in `newDataDF`:

| Feature | Datapoint 1 | Datapoint 2 | Datapoint 3 |
|:--------|:------------|:------------|:------------|
| `indus` | 11          | 6           | 19          |
| `age`   | 68          | 35          | 74          |
| `dis`   | 4           | 2           | 8           |

In [0]:
# ANSWER
from pyspark.ml.linalg import Vectors

data = [(Vectors.dense([11., 68., 4.]), ),
        (Vectors.dense([6., 35., 2.]), ),
        (Vectors.dense([19., 74., 8.]), )]
newDataDF = spark.createDataFrame(data, ["newFeatures"])
predictionsDF = lrModelNew.transform(newDataDF)

display(predictionsDF)

In [0]:
# TEST - Run this cell to test your solution
predicitions = [i.prediction for i in predictionsDF.select("prediction").collect()]

dbTest("ML1-P-02-02-01", True, predicitions[0] > 20 and predicitions[0] < 23)
dbTest("ML1-P-02-02-01", True, predicitions[1] > 30 and predicitions[1] < 34)
dbTest("ML1-P-02-02-01", True, predicitions[2] > 7 and predicitions[2] < 11)

print("Tests passed!")