### Supplementary Lab 06 activies

#### To create a dataset with 7 Inputs, 1 Output, including 1,000 data points.

In this data set, 'Gender', 'Major', 'TypeOfSchool', 'Region', 'Grade' are categorical data.

In [1]:
# Create a completely random dataset
import pandas as pd
import random

data = {
    "Age": [random.randint(18, 30) for _ in range(1000)],
    "Gender": [random.choice(["Male", "Female"]) for _ in range(1000)],
    "StudyHours": [random.randint(5, 25) for _ in range(1000)],
    "Participation": [random.randint(1, 10) for _ in range(1000)],
    "Major": [random.choice(["Computer Science", "Biology", "Business", "Literature", "Physics"]) for _ in range(1000)],
    "TypeOfSchool": [random.choice(["Public", "Private", "Online"]) for _ in range(1000)],
    "Region": [random.choice(["North", "South", "East", "West", "Central"]) for _ in range(1000)],
    "Grade": [random.choice(["Pass", "Fail"]) for _ in range(1000)]
}

df = pd.DataFrame(data)
df.to_csv("students_grades.csv", index=False)

In [2]:
# Create a dataset with stronger relationship between Inputs and Outputs
import pandas as pd
import random

def determine_grade(study_hours, participation, school_type):
    # Base probability of passing
    base_prob = 0.4  # Adjusted down to allow for larger swings based on criteria
    
    # Increase the probability based on study hours
    if study_hours > 20:
        base_prob += 0.4
    elif study_hours > 15:
        base_prob += 0.3
    elif study_hours > 10:
        base_prob += 0.2
    elif study_hours <= 10:
        base_prob -= 0.1
    
    # Increase the probability based on participation
    if participation > 8:
        base_prob += 0.3
    elif participation > 6:
        base_prob += 0.2
    elif participation <= 5:
        base_prob -= 0.2
    
    # Adjust the probability based on school type
    if school_type == "Private":
        base_prob += 0.2
    elif school_type == "Online":
        base_prob -= 0.2
    
    # Final decision
    return "Pass" if random.random() < base_prob else "Fail"

data = {
    "Age": [random.randint(18, 30) for _ in range(1000)],
    "Gender": [random.choice(["Male", "Female"]) for _ in range(1000)],
    "StudyHours": [random.randint(5, 25) for _ in range(1000)],
    "Participation": [random.randint(1, 10) for _ in range(1000)],
    "Major": [random.choice(["Computer Science", "Biology", "Business", "Literature", "Physics"]) for _ in range(1000)],
    "TypeOfSchool": [random.choice(["Public", "Private", "Online"]) for _ in range(1000)],
    "Region": [random.choice(["North", "South", "East", "West", "Central"]) for _ in range(1000)],
}
data["Grade"] = [determine_grade(data["StudyHours"][i], data["Participation"][i], data["TypeOfSchool"][i]) for i in range(1000)]

df = pd.DataFrame(data)
df.to_csv("students_grades.csv", index=False)

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from sklearn.metrics import classification_report

In [4]:
# Initialize Spark Session
spark = SparkSession.builder.appName("SupervisedLearning").getOrCreate()

In [5]:
# Load the dataset
df = spark.read.csv('students_grades.csv', header=True, inferSchema=True)

In [6]:
# Handle missing data by deletion
df = df.dropna()

In [7]:
# Showing the type of each column
df.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- StudyHours: integer (nullable = true)
 |-- Participation: integer (nullable = true)
 |-- Major: string (nullable = true)
 |-- TypeOfSchool: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Grade: string (nullable = true)



In machine learning, the categorial data are generall encoded before running a ML algorithm.

#### What is Categorical Data?

- Categorical data are variables that contain label values rather than numeric values.
- The number of possible values is often limited to a fixed set.
- Categorical variables are often called **Nominal**.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.

#### What is the Problem with Categorical Data?
- Some algorithms can work with categorical data directly.
- Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

#### Solution: Convert Categorical Data to Numerical Data:

In [8]:
# Create a list including all categorical columns of INPUTS
categorical_cols = ['Gender', 'Major', 'TypeOfSchool', 'Region', 'Grade']

#### StringIndexer:

The StringIndexer is a vital PySpark feature that helps convert categorical string columns in a DataFrame into numerical indices.


#### Pipeline:
Pipeline is a tool from the PySpark ML library that allows for the chaining and structuring of multiple stages of data processing and/or modeling steps.

`stages=indexers` means that the pipeline is being constructed with a series of stages that are represented by the indexers list. Each stage in indexers represents a StringIndexer transformation, which is used to convert categorical string columns into numeric indices.

In [9]:
indexers = [StringIndexer(inputCol=col, outputCol=col + "Numeric").fit(df) for col in categorical_cols]

pipeline = Pipeline(stages=indexers)
df_encoded = pipeline.fit(df).transform(df)

In [10]:
# Show the dataset
df_encoded.toPandas()

Unnamed: 0,Age,Gender,StudyHours,Participation,Major,TypeOfSchool,Region,Grade,GenderNumeric,MajorNumeric,TypeOfSchoolNumeric,RegionNumeric,GradeNumeric
0,18,Female,8,6,Literature,Public,North,Fail,0.0,0.0,2.0,2.0,1.0
1,26,Female,12,9,Literature,Public,West,Pass,0.0,0.0,2.0,0.0,0.0
2,25,Male,23,6,Literature,Private,North,Pass,1.0,0.0,0.0,2.0,0.0
3,24,Female,6,9,Computer Science,Online,West,Pass,0.0,2.0,1.0,0.0,0.0
4,21,Female,7,7,Physics,Private,West,Pass,0.0,4.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,21,Male,18,7,Computer Science,Public,East,Pass,1.0,2.0,2.0,4.0,0.0
996,25,Male,11,3,Physics,Online,East,Fail,1.0,4.0,1.0,4.0,1.0
997,25,Male,16,6,Business,Online,North,Fail,1.0,1.0,1.0,2.0,1.0
998,18,Female,11,9,Business,Online,East,Fail,0.0,1.0,1.0,4.0,1.0


### VectorAssembler

VectorAssembler is a transformer in PySpark's MLlib that combines a given list of columns into a **single vector** column. It is commonly used in the preprocessing stages of a machine learning pipeline to bring together features into one aggregate column, which is often a requirement for ML algorithms in Spark.

In [11]:
# Define feature columns and assemble them as a vector
assembler = VectorAssembler(
    inputCols=['Age', 'GenderNumeric', 'StudyHours', 'Participation', 'MajorNumeric', 'TypeOfSchoolNumeric', 'RegionNumeric'],
    outputCol='features')

df_assembled = assembler.transform(df_encoded)

Now, all Inputs(features) have been assembled into a single vector, titled as 'features'.

In [12]:
df_assembled.toPandas()

Unnamed: 0,Age,Gender,StudyHours,Participation,Major,TypeOfSchool,Region,Grade,GenderNumeric,MajorNumeric,TypeOfSchoolNumeric,RegionNumeric,GradeNumeric,features
0,18,Female,8,6,Literature,Public,North,Fail,0.0,0.0,2.0,2.0,1.0,"[18.0, 0.0, 8.0, 6.0, 0.0, 2.0, 2.0]"
1,26,Female,12,9,Literature,Public,West,Pass,0.0,0.0,2.0,0.0,0.0,"[26.0, 0.0, 12.0, 9.0, 0.0, 2.0, 0.0]"
2,25,Male,23,6,Literature,Private,North,Pass,1.0,0.0,0.0,2.0,0.0,"[25.0, 1.0, 23.0, 6.0, 0.0, 0.0, 2.0]"
3,24,Female,6,9,Computer Science,Online,West,Pass,0.0,2.0,1.0,0.0,0.0,"[24.0, 0.0, 6.0, 9.0, 2.0, 1.0, 0.0]"
4,21,Female,7,7,Physics,Private,West,Pass,0.0,4.0,0.0,0.0,0.0,"[21.0, 0.0, 7.0, 7.0, 4.0, 0.0, 0.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,21,Male,18,7,Computer Science,Public,East,Pass,1.0,2.0,2.0,4.0,0.0,"[21.0, 1.0, 18.0, 7.0, 2.0, 2.0, 4.0]"
996,25,Male,11,3,Physics,Online,East,Fail,1.0,4.0,1.0,4.0,1.0,"[25.0, 1.0, 11.0, 3.0, 4.0, 1.0, 4.0]"
997,25,Male,16,6,Business,Online,North,Fail,1.0,1.0,1.0,2.0,1.0,"[25.0, 1.0, 16.0, 6.0, 1.0, 1.0, 2.0]"
998,18,Female,11,9,Business,Online,East,Fail,0.0,1.0,1.0,4.0,1.0,"[18.0, 0.0, 11.0, 9.0, 1.0, 1.0, 4.0]"


From this point forward, we just need two columns:
1. **features** which includes all Inputs
2. **GradeNumeric** which is the Output of the model

In [13]:
# Filtering the Input and Output columns into a new dataframe
df_assembled_filtered = df_assembled.select("features", "GradeNumeric")

In [14]:
df_assembled_filtered.toPandas()

Unnamed: 0,features,GradeNumeric
0,"[18.0, 0.0, 8.0, 6.0, 0.0, 2.0, 2.0]",1.0
1,"[26.0, 0.0, 12.0, 9.0, 0.0, 2.0, 0.0]",0.0
2,"[25.0, 1.0, 23.0, 6.0, 0.0, 0.0, 2.0]",0.0
3,"[24.0, 0.0, 6.0, 9.0, 2.0, 1.0, 0.0]",0.0
4,"[21.0, 0.0, 7.0, 7.0, 4.0, 0.0, 0.0]",0.0
...,...,...
995,"[21.0, 1.0, 18.0, 7.0, 2.0, 2.0, 4.0]",0.0
996,"[25.0, 1.0, 11.0, 3.0, 4.0, 1.0, 4.0]",1.0
997,"[25.0, 1.0, 16.0, 6.0, 1.0, 1.0, 2.0]",1.0
998,"[18.0, 0.0, 11.0, 9.0, 1.0, 1.0, 4.0]",1.0


### Building the MODEL

In [15]:
# Train-Test split
train_data, test_data = df_assembled_filtered.randomSplit([0.8, 0.2])

In [16]:
# Train a Decision Tree model
dtc = DecisionTreeClassifier(featuresCol='features', labelCol="GradeNumeric")
model = dtc.fit(train_data)

### Prediction using the Trained Model

In [17]:
# Predictions using test_data
predictions = model.transform(test_data)

In [18]:
# "Raw prediction" for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
predictions.toPandas()

Unnamed: 0,features,GradeNumeric,rawPrediction,probability,prediction
0,"(18.0, 0.0, 5.0, 7.0, 0.0, 0.0, 0.0)",0.0,"[213.0, 22.0]","[0.9063829787234042, 0.09361702127659574]",0.0
1,"[18.0, 0.0, 6.0, 5.0, 0.0, 2.0, 1.0]",0.0,"[3.0, 64.0]","[0.04477611940298507, 0.9552238805970149]",1.0
2,"[18.0, 0.0, 11.0, 2.0, 3.0, 2.0, 2.0]",0.0,"[88.0, 50.0]","[0.6376811594202898, 0.36231884057971014]",0.0
3,"[18.0, 0.0, 11.0, 9.0, 1.0, 1.0, 4.0]",1.0,"[32.0, 9.0]","[0.7804878048780488, 0.21951219512195122]",0.0
4,"[18.0, 0.0, 13.0, 8.0, 4.0, 2.0, 1.0]",0.0,"[213.0, 22.0]","[0.9063829787234042, 0.09361702127659574]",0.0
...,...,...,...,...,...
171,"[30.0, 1.0, 20.0, 5.0, 2.0, 2.0, 0.0]",0.0,"[88.0, 50.0]","[0.6376811594202898, 0.36231884057971014]",0.0
172,"[30.0, 1.0, 21.0, 3.0, 2.0, 0.0, 0.0]",1.0,"[62.0, 19.0]","[0.7654320987654321, 0.2345679012345679]",0.0
173,"[30.0, 1.0, 21.0, 8.0, 2.0, 0.0, 0.0]",0.0,"[213.0, 22.0]","[0.9063829787234042, 0.09361702127659574]",0.0
174,"[30.0, 1.0, 21.0, 9.0, 0.0, 0.0, 1.0]",0.0,"[213.0, 22.0]","[0.9063829787234042, 0.09361702127659574]",0.0


In [19]:
# Print Decision Tree rules
print(model.toDebugString)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_c3227c165b06, depth=5, numNodes=39, numClasses=2, numFeatures=7
  If (feature 3 <= 6.5)
   If (feature 2 <= 10.5)
    If (feature 5 in {0.0})
     If (feature 1 in {1.0})
      If (feature 3 <= 4.5)
       Predict: 1.0
      Else (feature 3 > 4.5)
       Predict: 0.0
     Else (feature 1 not in {1.0})
      If (feature 6 in {0.0,2.0,3.0,4.0})
       Predict: 0.0
      Else (feature 6 not in {0.0,2.0,3.0,4.0})
       Predict: 1.0
    Else (feature 5 not in {0.0})
     If (feature 3 <= 5.5)
      Predict: 1.0
     Else (feature 3 > 5.5)
      If (feature 4 in {3.0})
       Predict: 0.0
      Else (feature 4 not in {3.0})
       Predict: 1.0
   Else (feature 2 > 10.5)
    If (feature 5 in {0.0,2.0})
     If (feature 2 <= 20.5)
      If (feature 3 <= 1.5)
       Predict: 1.0
      Else (feature 3 > 1.5)
       Predict: 0.0
     Else (feature 2 > 20.5)
      Predict: 0.0
    Else (feature 5 not in {0.0,2.0})
     If (feature 2 <= 1

### Evaluate the performance of a binary classification model

**BinaryClassificationEvaluator:** This is an evaluator for binary classification, which expects two input columns: **raw prediction** and **label**.

Parameters:

`rawPredictionCol="rawPrediction"`: This parameter tells the evaluator to expect the column named "rawPrediction" in the dataset (typically predictions in this context) to hold the raw prediction values from the model.
`labelCol="GradeNumeric"`: This parameter tells the evaluator that the true labels for the binary classification task can be found in the "GradeNumeric" column of the dataset.
evaluate() Method:

`evaluator.evaluate(predictions)`: This is where the actual evaluation happens. The evaluate() method computes the metric (Area Under ROC, by default) for the predictions dataset using the true labels and raw predictions.

**Area Under ROC:**

The code calculates the Area Under the Receiver Operating Characteristic (ROC) curve, which is a metric used to evaluate the performance of binary classification models. The value of Area Under ROC (often abbreviated as AUC) ranges between 0 and 1. A value of 0.5 indicates no discriminative power (i.e., the model is as good as random guessing), while a value of 1.0 indicates perfect classification. A higher AUC indicates a better model.

In [20]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="GradeNumeric")
area_under_roc = evaluator.evaluate(predictions)
print("Area Under ROC:", area_under_roc)

Area Under ROC: 0.4803830296581992


When dealing with Spark's Machine Learning Library (MLlib), often one needs to evaluate the performance of a model, especially for classification tasks. In order to do that, you often use evaluators that require the prediction and actual label in a specific format.

Convert 'predictions' DataFrame to an **Resilient Distributed Dataset(RDD)** of (prediction, label) tuples" means that you need to transform the DataFrame (predictions) which contains predicted and actual values into a Resilient Distributed Dataset (RDD) that consists of tuples. Each tuple in this RDD contains two elements: the **predicted value** (often the first element) and the **actual label** (often the second element).

Each tuple in this RDD contains two elements: the predicted value (often the first element) and the actual label (often the second element).

In [21]:
#  Convert 'predictions' DataFrame to an RDD of (prediction, label) tuples
prediction_and_label = predictions.select("prediction", "GradeNumeric").rdd.map(lambda row: (float(row["prediction"]), float(row["GradeNumeric"])))
prediction_and_label

PythonRDD[125] at RDD at PythonRDD.scala:53

In [22]:
import sys
print(sys.executable)
print(sys.version)

C:\Users\yubar\anaconda3\envs\spark_env\python.exe
3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]


In [23]:
# Using 'collect' to show the content of a RDD
for pred, label in prediction_and_label.collect():
    print(f"Prediction: {pred}, Actual Label: {label}")

Prediction: 0.0, Actual Label: 0.0
Prediction: 1.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 1.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 1.0, Actual Label: 1.0
Prediction: 1.0, Actual Label: 1.0
Prediction: 1.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 1.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 1.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 1.0
Prediction: 0.0, Actual Label: 0.0
Prediction: 0.0, Act

### Confusion Matrix

Where:

- **TN (True Negative):** The number of actual negatives (0s) that were correctly predicted as negatives by the model.
- **FP (False Positive):** The number of actual negatives (0s) that were incorrectly predicted as positives (1s) by the model.
- **FN (False Negative):** The number of actual positives (1s) that were incorrectly predicted as negatives (0s) by the model.
- **TP (True Positive):** The number of actual positives (1s) that were correctly predicted as positives by the model.


###### Interpretation:

**High values of TP and TN, along with low values of FP and FN, generally indicate a good model.**

In [24]:
# Create a MulticlassMetrics object to develop the Confusion Matrix
metrics = MulticlassMetrics(prediction_and_label)
confusion_matrix = metrics.confusionMatrix()



In [25]:
# Step 17:Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix)

Confusion Matrix:
DenseMatrix([[90., 13.],
             [32., 41.]])


### Using Scikit-learn package to get a detailed classification report

In [26]:
# Convert 'predictions' DataFrame to a Pandas DataFrame
predictions_pd = predictions.select("prediction", "GradeNumeric").toPandas()

In [27]:
# Step 19: Calculate classification report
report = classification_report(predictions_pd["GradeNumeric"], predictions_pd["prediction"])
print("Classification Report:")
print(report)

Classification Report:
              precision    recall  f1-score   support

         0.0       0.74      0.87      0.80       103
         1.0       0.76      0.56      0.65        73

    accuracy                           0.74       176
   macro avg       0.75      0.72      0.72       176
weighted avg       0.75      0.74      0.74       176



#### Please investigate the meaning of these metrics, as your homework