## Exercise: Select Optimal Model by Tuning Hyperparameters

Use grid search and cross-validation to tune the hyperparameters from a logistic regression model.

Run the following cell to set up our environment.

In [0]:
%run "./Includes/Classroom-Setup"

### Step 1: Import the Data

Import the data and perform a train/test split.

In [0]:
from pyspark.sql.functions import col

cols = ["index",
 "sample-code-number",
 "clump-thickness",
 "uniformity-of-cell-size",
 "uniformity-of-cell-shape",
 "marginal-adhesion",
 "single-epithelial-cell-size",
 "bare-nuclei",
 "bland-chromatin",
 "normal-nucleoli",
 "mitoses",
 "class"]

cancerDF = (spark.read  # read the data
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/cancer/biopsy/biopsy.csv")
)

cancerDF = (cancerDF    # Add column names and drop nulls
  .toDF(*cols)
  .withColumn("bare-nuclei", col("bare-nuclei").isNotNull().cast("integer"))
)

display(cancerDF)

Perform a train/test split to create `trainCancerDF` and `testCancerDF`.  Put 80% of the data in `trainCancerDF` and use the seed that is set for you.

In [0]:
# TODO
seed = 42
trainCancerDF, testCancerDF = cancerDF.randomSplit([0.8, 0.2], seed=seed)

### Step 2: Create a Pipeline

Create a pipeline `cancerPipeline` that consists of the following stages:<br>

1. `indexer`: a `StringIndexer` that takes `class` as an input and outputs the column `is-malignant`
2. `assembler`: a `VectorAssembler` that takes all of the other columns as an input and outputs  the column `features`
3. `logr`: a `LogisticRegression` that takes `features` as the input and `is-malignant` as the output variable

In [0]:
# TODO
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler

indexer = StringIndexer(inputCol="class", outputCol="is-malignant")
assembler = VectorAssembler(inputCols=cols[2:-1], outputCol="features")
logr = LogisticRegression(labelCol="is-malignant", featuresCol="features")
cancerPipeline = Pipeline(stages=[indexer, assembler, logr])

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler

dbTest("ML1-P-08-02-01", True, type(indexer) == type(StringIndexer()))
dbTest("ML1-P-08-02-02", True, indexer.getInputCol() == 'class')
dbTest("ML1-P-08-02-03", True, indexer.getOutputCol() == 'is-malignant')

dbTest("ML1-P-08-02-04", True, type(assembler) == type(VectorAssembler()))
dbTest("ML1-P-08-02-05", True, assembler.getInputCols() == cols[2:-1])
dbTest("ML1-P-08-02-06", True, assembler.getOutputCol() == 'features')

dbTest("ML1-P-08-02-07", True, type(logr) == type(LogisticRegression()))
dbTest("ML1-P-08-02-08", True, logr.getLabelCol() == "is-malignant")
dbTest("ML1-P-08-02-09", True, logr.getFeaturesCol() == 'features')

dbTest("ML1-P-08-02-10", True, type(cancerPipeline) == type(Pipeline()))

print("Tests passed!")

### Step 3: Create Grid Search Parameters

Take a look at the parameters for our `LogisticRegression` object.  Use this to build the inputs to grid search.

In [0]:
print(logr.explainParams())

Create a `ParamGridBuilder` object with two grids:<br><br>

1. A regularization parameter `regParam` of `[0., .2, .8, 1.]`
2. Test both with and without an intercept using `fitIntercept`

In [0]:
# TODO
from pyspark.ml.tuning import ParamGridBuilder

cancerParamGrid = (ParamGridBuilder()
  .addGrid(logr.regParam, [0.0, 0.2, 0.8, 1.0])
  .addGrid(logr.fitIntercept, [True, False])
  .build()
)

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-08-03-01", True, type(cancerParamGrid) == list)

print("Tests passed!")

### Step 4: Perform 3-Fold Cross-Validation

Create a `BinaryClassificationEvaluator` object and use it to perform 3-fold cross-validation.

In [0]:
# TODO
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

binaryEvaluator = BinaryClassificationEvaluator(
  labelCol = "is-malignant", 
  metricName = "areaUnderROC"
)

cancerCV = CrossValidator(
  estimator = logr,                   # Estimator (individual model or pipeline)
  estimatorParamMaps = cancerParamGrid, # Grid of parameters to try (grid search)
  evaluator = binaryEvaluator,        # Evaluator
  numFolds = 3,                       # Set k to 3
  seed = 42                           # Seed to sure our results are the same if ran again
)

cancerCVModel = cancerCV.fit(trainCancerDF)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

dbTest("ML1-P-08-04-01", True, type(binaryEvaluator) == type(BinaryClassificationEvaluator()))
dbTest("ML1-P-08-04-02", True, type(cancerCV) == type(CrossValidator()))

print("Tests passed!")

### Step 5: Examine the results

Take a look at the results.  Which combination of hyperparameters learned the most from the data?

In [0]:
for params, score in zip(cancerCVModel.getEstimatorParamMaps(), cancerCVModel.avgMetrics):
  print("".join([param.name+"\t"+str(params[param])+"\t" for param in params]))
  print("\tScore: {}".format(score))


---

### **Setup: Environment Setup with `%run`**
```python
%run "./Includes/Classroom-Setup"
```
- `%run` is a magic command in Jupyter notebooks that allows you to run code from another notebook or file.
- Here, `"./Includes/Classroom-Setup"` is a path to a file that sets up the environment, like preparing libraries, variables, and settings for this notebook.
- Running this cell ensures the workspace is ready to use, so we don’t need to manually set up every tool or configuration needed to run the notebook.

---

### **Step 1: Import the Data**
#### Code Block 1: Data Import and Initial Cleaning
```python
from pyspark.sql.functions import col

cols = ["index", "sample-code-number", "clump-thickness", "uniformity-of-cell-size", "uniformity-of-cell-shape", 
        "marginal-adhesion", "single-epithelial-cell-size", "bare-nuclei", "bland-chromatin", "normal-nucleoli", 
        "mitoses", "class"]

cancerDF = (spark.read  # read the data
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/cancer/biopsy/biopsy.csv"))

cancerDF = (cancerDF
    .toDF(*cols)  # Add column names
    .withColumn("bare-nuclei", col("bare-nuclei").isNotNull().cast("integer")))  # Clean data
display(cancerDF)
```

1. **`from pyspark.sql.functions import col`**:
   - Imports the `col` function from PySpark, which is used to reference columns easily in DataFrames.

2. **Define Column Names**:
   - `cols` is a list of column names that describe each attribute of the data, like `sample-code-number` (an ID for each patient), `clump-thickness` (a feature to analyze), and `class` (the diagnosis, indicating whether the tumor is benign or malignant).

3. **Read CSV File**:
   - `spark.read` initiates reading data with Spark's `DataFrame` API.
   - `.option("HEADER", True)` tells Spark to treat the first row as column names.
   - `.option("inferSchema", True)` allows Spark to automatically detect data types (e.g., integer, string) for each column.
   - `.csv("/mnt/training/cancer/biopsy/biopsy.csv")` loads the file stored at this path into a DataFrame called `cancerDF`.

4. **Rename Columns**:
   - `.toDF(*cols)` renames columns in `cancerDF` to match the list `cols`.

5. **Handle Missing Values**:
   - `.withColumn("bare-nuclei", col("bare-nuclei").isNotNull().cast("integer"))`:
     - `col("bare-nuclei").isNotNull()` replaces any missing values in the "bare-nuclei" column.
     - `.cast("integer")` ensures that all values are of integer type.

6. **Display the DataFrame**:
   - `display(cancerDF)` shows the DataFrame in a table format for easy viewing.

#### Code Block 2: Split Data into Training and Testing Sets
```python
seed = 42
trainCancerDF, testCancerDF = cancerDF.randomSplit([0.8, 0.2], seed=seed)
```

1. **Set a Seed**:
   - `seed = 42` sets a random seed, ensuring that the data split will be consistent every time we run the code.

2. **Random Split**:
   - `cancerDF.randomSplit([0.8, 0.2], seed=seed)`:
     - Splits `cancerDF` into two parts:
       - **80%** of the data goes to `trainCancerDF` for training the model.
       - **20%** goes to `testCancerDF` for evaluating the model.

---

### **Step 2: Create a Pipeline**

```python
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler

indexer = StringIndexer(inputCol="class", outputCol="is-malignant")
assembler = VectorAssembler(inputCols=cols[2:-1], outputCol="features")
logr = LogisticRegression(labelCol="is-malignant", featuresCol="features")

cancerPipeline = Pipeline(stages=[indexer, assembler, logr])
```

1. **Import Required Libraries**:
   - `Pipeline` helps to create and apply a series of steps.
   - `LogisticRegression` is the machine learning model we’ll train.
   - `StringIndexer` converts labels (like "benign" or "malignant") into numeric values.
   - `VectorAssembler` combines multiple columns into a single "features" vector.

2. **Stage 1 - Indexing**:
   - `indexer = StringIndexer(inputCol="class", outputCol="is-malignant")`:
     - Takes the "class" column (indicating benign or malignant tumors) and assigns it to a new column "is-malignant" as numerical labels (0 or 1).

3. **Stage 2 - Assemble Features**:
   - `assembler = VectorAssembler(inputCols=cols[2:-1], outputCol="features")`:
     - Takes feature columns (like cell size, thickness, etc.) and combines them into a single "features" column. This "features" column is what the model will analyze.

4. **Stage 3 - Model Setup**:
   - `logr = LogisticRegression(labelCol="is-malignant", featuresCol="features")`:
     - Initializes a Logistic Regression model that will take "features" as inputs and predict "is-malignant."

5. **Pipeline Creation**:
   - `cancerPipeline = Pipeline(stages=[indexer, assembler, logr])`:
     - Combines all three steps into a single pipeline so we can run all transformations and model training in one go.

---

### **Step 3: Set Up Grid Search Parameters**

#### Code Block 1: Print Logistic Regression Parameters
```python
print(logr.explainParams())
```

- `logr.explainParams()` prints out a list of all the configurable parameters for Logistic Regression, like `regParam` (controls overfitting) and `fitIntercept` (decides whether the model should include an intercept term).

#### Code Block 2: Create Parameter Grid
```python
from pyspark.ml.tuning import ParamGridBuilder

cancerParamGrid = (ParamGridBuilder()
  .addGrid(logr.regParam, [0.0, 0.2, 0.8, 1.0])
  .addGrid(logr.fitIntercept, [True, False])
  .build())
```

1. **Initialize ParamGridBuilder**:
   - `ParamGridBuilder()` helps to build a grid of parameters for tuning the model.

2. **Define Hyperparameters to Test**:
   - `.addGrid(logr.regParam, [0.0, 0.2, 0.8, 1.0])`:
     - Adds a range of values to test for `regParam` (regularization parameter).
   - `.addGrid(logr.fitIntercept, [True, False])`:
     - Tests whether including an intercept (fitIntercept) improves performance.

3. **Build the Grid**:
   - `.build()` finalizes the parameter grid, making it ready to test.

---

### **Step 4: Perform 3-Fold Cross-Validation**

```python
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

binaryEvaluator = BinaryClassificationEvaluator(
  labelCol = "is-malignant", 
  metricName = "areaUnderROC"
)

cancerCV = CrossValidator(
  estimator = logr,                   # Model (or pipeline) to evaluate
  estimatorParamMaps = cancerParamGrid, # Parameter grid
  evaluator = binaryEvaluator,        # Metric to evaluate performance
  numFolds = 3,                       # 3-fold cross-validation
  seed = 42                           # Random seed
)

cancerCVModel = cancerCV.fit(trainCancerDF)
```

1. **Create an Evaluator**:
   - `BinaryClassificationEvaluator` is used to measure how well the model distinguishes between two classes (malignant vs. benign).
   - `metricName = "areaUnderROC"` specifies the metric "area under ROC" curve, which measures how well the model separates the two classes.

2. **Set Up Cross-Validation**:
   - `CrossValidator` trains multiple versions of the model using different parameter settings from `cancerParamGrid` and evaluates their performance using `binaryEvaluator`.
   - `numFolds = 3` specifies that 3-fold cross-validation should be used to test each model 3 times on different parts of the data.

3. **Train the Model with Cross-Validation**:
   - `cancerCVModel = cancerCV.fit(trainCancerDF)` runs cross-validation on the training data and finds the best model parameters.

---

### **Step 5: Examine the Results**

```python
for params, score in zip(cancerCVModel.getEstimatorParamMaps(), cancerCVModel.avgMetrics):
  print("".join([param.name+"\t"+str(params[param])+"\t" for param in params]))
  print("\tScore: {}".format(score))
```



1. **Get Parameter-Score Pairs**:
   - `cancerCVModel.getEstimatorParamMaps()` provides the list of parameter combinations tested.
   - `cancerCVModel.avgMetrics` provides the average score for each parameter combination.

2. **Print Parameters and Scores**:
   - `for params, score in zip(...):` loops through each parameter combination and its corresponding score.
   - `print("".join([...]))` and `print("\tScore: ...")` print each parameter combination along with its score to see which settings performed best. 

This final output helps identify the best parameter settings for predicting if a case is malignant or benign.


---

### Setup: Getting the Environment Ready
The first line of code, `%run "./Includes/Classroom-Setup"`, is like preparing your workspace. It sets up the environment in a way that makes sure everything runs smoothly. Think of it as preparing all the tools you need before you start working.

---

### Step 1: Import the Data
1. **Load the Data**:
   - First, the data about cancer patients is loaded. Each row represents a person, and each column is a different measurement about them (like cell size, thickness, etc.), which might help us predict if they have cancer.
   - This data is read into a `DataFrame`, a kind of table that organizes data nicely for us to work with.

2. **Clean the Data**:
   - We give each column a name to make it easier to understand.
   - One column, "bare-nuclei," sometimes has missing information, so we fix it by replacing missing values with a number that represents "unknown."

3. **Split the Data**:
   - We divide the data into two parts:
      - **Training Data (80%)**: Used to teach the model how to make predictions.
      - **Testing Data (20%)**: Used to see how well the model learned.

---

### Step 2: Create a Pipeline
A **pipeline** is like an assembly line for building a model. It has three stages:

1. **Stage 1 - Label the Classes**:
   - We create a new column, "is-malignant," that labels whether each case is cancerous (malignant) or not. This step helps the model understand what it’s supposed to predict.

2. **Stage 2 - Assemble the Features**:
   - All the columns that describe the patient (like cell size and shape) are combined into one big list, called "features." The model will look at this list to find patterns.

3. **Stage 3 - Create the Model**:
   - We make a **logistic regression model**, which is a type of model that can be used to predict categories, like "cancer" or "no cancer."

4. **Put it All Together**:
   - All three steps are put into a pipeline called `cancerPipeline`, so we can run everything at once when we're ready.

---

### Step 3: Set Up Grid Search Parameters
**Grid search** helps find the best settings (or **hyperparameters**) for our model.

1. **Set Different Values to Test**:
   - Two important settings are tested here:
      - `regParam` (short for "regularization parameter"): Controls how flexible or strict the model is. We try values like `0.0`, `0.2`, `0.8`, and `1.0`.
      - `fitIntercept`: Decides if the model should use an intercept, like adding a starting point to the equation.
   - `ParamGridBuilder` is used to test different combinations of these settings.

---

### Step 4: Perform 3-Fold Cross-Validation
**Cross-validation** is like taking multiple "practice tests" to check how well the model might work in real life.

1. **Set Up a Test Evaluator**:
   - We use an evaluator to see how well each model version performs. The evaluator looks at how well the model separates cases with cancer from those without cancer, using a scoring method called **ROC (Receiver Operating Characteristic)**.

2. **Run Cross-Validation**:
   - We test each combination of hyperparameters (from Step 3) with the training data. The cross-validation splits the data into 3 parts (or folds), trains the model on each part, and then checks its performance on the remaining parts.

---

### Step 5: Examine the Results
1. **Check Each Model's Score**:
   - For each combination of settings tested, we print out the parameters and their score. The scores help us see which settings produced the best model for making predictions. 

Each time we run the notebook, it will choose the best model settings based on these tests.