# DSC 232R: Week 7 Discussion
## Ensemble Methods: Boosting, Margins, and Applications
Today we will cover the core theoretical concepts behind Boosting, including how it compares to other models, why it resists overfitting, and its real-world applications. Afterwards, we will jump into PySpark code to implement these concepts.

### 1. Generative vs. Discriminative Models
* **Generative Models:** goal is to explain how data is generated. Generative models are more accurate when they are correct.
* **Discriminative Models:** goal is to predict a property of the data (such as label). Discriminative models are more robust against poor modeling or outliers.



### 2. Introduction to Boosting and Weak Learners
* **Boosting Concept:** Adaboost is an algorithm for combining weak learners. The resulting weighted majority rule is more accurate than any of the weak rules.
* **Focus on Hard Examples:** Intuitively: boosting concentrates on the hard examples.
* **Weak Learners:** Almost anything can be a weak learner.

Popular choices: Boosting Trees, Boosting Stumps, Alternating Decision Trees



### 3. AdaBoost and Gradient Descent

* **AdaBoost vs. LogitBoost:**
  * Adaboost: performs well when a achievable error rate is close to zero (almost consistent case).  Errors = examples with negative margins, get very large weights, can overfit.
  * Logitboost: Inferior to adaboost when achievable error rate is close to zero.  Often better than Adaboost when achievable error rate is high.  Weight on any example never larger than 1.


* **Loss Functions:** AdaBoost minimizes the exponential loss, which acts as a strict upper bound on the 0-1 (classification) loss.
  $$L_{AdaBoost} = e^{-y \cdot f(x)}$$
* **LogitBoost Loss:** Uses a logistic loss function, which grows linearly rather than exponentially for large negative margins, making it more robust to outliers.
  $$L_{LogitBoost} = \log(1 + e^{-y \cdot f(x)})$$
* **Gradient Descent:** AdaBoost finds the weight vector $\alpha$ through coordinate-wise gradient descent, adding weak rules one at a time to minimize this loss.



### 4. Over-fitting, Margins, and The Bias/Variance Tradeoff


* **Sources of Error:**
  * *Model Bias:* error resulting from the inability of the model class to represent the true distribution.
  * *Data Variation:* error resulting from the difference between the training set and the true distribution.
* **Bagging vs. Boosting:**
  * Bagging decreases data variation without significantly increasing model bias.
  * Boosting reduces both variation and model bias.


* **The Definition of Margin:** In binary classification where $y \in \{-1, +1\}$, the margin is defined as $y \cdot f(x)$. A positive margin means a correct prediction; a negative margin means a mistake.
* **Resistance to Over-fitting:** Boosting pushes to maximize this margin. Even after achieving 0 training error, boosting continues to run to push the margins wider, meaning small changes in the test set won't cause the prediction to flip.

### 5. Applications of Boosting
* **Voice Request Classification:** Classify voice requests Voice -> text -> category
* **Face Detection**



### 6. PySpark Implementation: Boosting vs. Bagging
In this section, we implement Bagging (Random Forest) and Boosting (Gradient-Boosted Trees) using PySpark MLlib, allowing us to observe these concepts in a distributed Big Data context.

In [1]:
# Setup PySpark session for the discussion
from pyspark.sql import SparkSession
from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
import random

# Initialize Spark Session
spark = SparkSession.builder \
    .appName('DSC232R_Week7_Boosting_vs_Bagging') \
    .getOrCreate()

# Generate a larger synthetic dataset with ~15% inherent noise
random.seed(42)
data = []
for _ in range(500):
    f1 = random.uniform(0, 10)
    f2 = random.uniform(0, 10)

    # Base decision boundary: if f1 + f2 > 10, label is 1, else 0
    label = 1 if (f1 + f2) > 10 else 0

    # Introduce noise: flip the true label 15% of the time to simulate data variation
    if random.random() < 0.15:
        label = 1 - label

    data.append((label, f1, f2))

columns = ['label', 'feature1', 'feature2']
df = spark.createDataFrame(data, columns)

# Assemble features into a vector
assembler = VectorAssembler(inputCols=['feature1', 'feature2'], outputCol='features')
df_assembled = assembler.transform(df)

# Split data into training and test sets
train_df, test_df = df_assembled.randomSplit([0.75, 0.25], seed=42)
print(f"Training rows: {train_df.count()}, Testing rows: {test_df.count()}")
train_df.show(5)

Training rows: 387, Testing rows: 113
+-----+--------------------+------------------+--------------------+
|label|            feature1|          feature2|            features|
+-----+--------------------+------------------+--------------------+
|    0|0.026757226028857328| 4.505037046177024|[0.02675722602885...|
|    0| 0.09669699608339966|0.7524386007376704|[0.09669699608339...|
|    0|  0.2366443470145152|1.9312978832770866|[0.23664434701451...|
|    0| 0.24786361898188725| 7.365644717550821|[0.24786361898188...|
|    0|  0.2567842549746602| 3.119572443946952|[0.25678425497466...|
+-----+--------------------+------------------+--------------------+
only showing top 5 rows


In [2]:
# 1. Bagging: Random Forest Classifier
# Bagging decreases data variation (variance) without significantly increasing model bias.
# It handles the 15% noise well by taking a majority vote across multiple randomized trees.
rf = RandomForestClassifier(featuresCol='features', labelCol='label', numTrees=20, maxDepth=5, seed=42)
rf_model = rf.fit(train_df)
rf_predictions = rf_model.transform(test_df)

evaluator = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='accuracy')
rf_accuracy = evaluator.evaluate(rf_predictions)
print(f'Random Forest (Bagging) Accuracy: {rf_accuracy:.2f}')

# 2. Boosting: Gradient-Boosted Trees (GBT)
# Boosting reduces both variation and model bias by iteratively focusing on hard examples.
# However, if the data is highly noisy, it can sometimes overfit by trying too hard to correct the unavoidable noise.
gbt = GBTClassifier(featuresCol='features', labelCol='label', maxIter=20, maxDepth=3, seed=42)
gbt_model = gbt.fit(train_df)
gbt_predictions = gbt_model.transform(test_df)

gbt_accuracy = evaluator.evaluate(gbt_predictions)
print(f'Gradient-Boosted Trees (Boosting) Accuracy: {gbt_accuracy:.2f}')

Random Forest (Bagging) Accuracy: 0.78
Gradient-Boosted Trees (Boosting) Accuracy: 0.75


### Discussion: Why did Random Forest beat Gradient-Boosted Trees?

You might be surprised to see that Random Forest (Bagging) achieved a higher accuracy (0.78) compared to Gradient-Boosted Trees (0.75). Since boosting is often considered a more powerful technique that reduces both bias and variance, why did it lose here?

The answer lies in the **15% random noise** we injected into the dataset.

1. **Bagging's Strength:** Random Forest builds independent trees. By taking a majority vote, the random noise naturally averages out, making the model highly robust to messy data.
2. **Boosting's Weakness:** Boosting algorithms (like GBT and AdaBoost) are designed to relentlessly focus on misclassified examples by increasing their weights. When a dataset has inherent noise, the boosting algorithm assumes these noisy points are just hard to learn examples. It dedicates all its resources to fitting this random noise (points with negative margins), leading to **overfitting**.

**Key Takeaway:** While boosting pushes for larger margins and typically yields highly accurate models on clean data, it can be highly sensitive to noise. If your achievable error rate is high due to messy data, bagging or a modified boosting loss function (like LogitBoost) is often the safer choice!