# Question 4.1: PySpark Random Forest Implementation

This notebook ports the pandas Random Forest implementation to PySpark, using only PySpark APIs for data processing and MLlib for machine learning.

## Overview
- Load data from Google Cloud Storage
- Join datasets using PySpark DataFrame operations
- Perform feature engineering with PySpark
- Train Random Forest using PySpark MLlib
- Evaluate model and save to disk

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, Imputer, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
import numpy as np

In [None]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("RandomForestPySpark") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark context: {spark.sparkContext}")

## Data Loading
Load the same datasets from Google Cloud Storage as in the original implementation.

In [None]:
# Download data files (same as original implementation)
import subprocess

# Download external sources
subprocess.run([
    "wget", 
    "https://storage.googleapis.com/bdt-spark-store/external_sources.csv", 
    "-O", "gcs_external_sources.csv"
], check=True)

# Download internal data
subprocess.run([
    "wget", 
    "https://storage.googleapis.com/bdt-spark-store/internal_data.csv", 
    "-O", "gcs_internal_data.csv"
], check=True)

print("Data files downloaded successfully")

In [None]:
# Load data using PySpark
df_data = spark.read.csv("gcs_internal_data.csv", header=True, inferSchema=True)
df_ext = spark.read.csv("gcs_external_sources.csv", header=True, inferSchema=True)

print(f"Internal data shape: {df_data.count()} rows, {len(df_data.columns)} columns")
print(f"External data shape: {df_ext.count()} rows, {len(df_ext.columns)} columns")

# Show schema
print("\nInternal data schema:")
df_data.printSchema()
print("\nExternal data schema:")
df_ext.printSchema()

## Data Joining
Join the datasets on their common identifier key using PySpark DataFrame operations.

In [None]:
# Join datasets on SK_ID_CURR (equivalent to pandas merge)
df_full = df_data.join(df_ext, on="SK_ID_CURR", how="inner")

print(f"Joined data shape: {df_full.count()} rows, {len(df_full.columns)} columns")

# Show first few rows
df_full.show(5, truncate=False)

## Feature Selection
Select the same features as in the original implementation.

In [None]:
# Select the same columns as in the original implementation
columns_extract = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
                  'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_EDUCATION_TYPE',
                  'DAYS_ID_PUBLISH', 'CODE_GENDER', 'AMT_ANNUITY',
                  'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT',
                  'ORGANIZATION_TYPE', 'DAYS_LAST_PHONE_CHANGE',
                  'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE', 'TARGET']

df = df_full.select(*columns_extract)

print(f"Selected features shape: {df.count()} rows, {len(df.columns)} columns")
df.show(3, truncate=False)

## Train-Test Split
Split the data into training and testing sets using PySpark's randomSplit.

In [None]:
# Set seed for reproducibility (equivalent to np.random.RandomState(101))
spark.sparkContext.setCheckpointDir("/tmp/spark-checkpoint")

# Split data 80/20 for train/test (equivalent to the original 0.8 split)
train_df, test_df = df.randomSplit([0.8, 0.2], seed=101)

print(f"Training set: {train_df.count()} rows")
print(f"Test set: {test_df.count()} rows")

# Check target distribution
print("\nTraining set target distribution:")
train_df.groupBy("TARGET").count().show()

print("Test set target distribution:")
test_df.groupBy("TARGET").count().show()

## Feature Engineering Pipeline
Create a PySpark ML Pipeline for preprocessing steps including categorical encoding, imputation, and scaling.

In [None]:
# Identify categorical and numerical columns
categorical_cols = ['NAME_EDUCATION_TYPE', 'CODE_GENDER', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE']
numerical_cols = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
                 'DAYS_ID_PUBLISH', 'AMT_ANNUITY', 'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 
                 'AMT_CREDIT', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE']

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

In [None]:
# Create preprocessing pipeline stages
stages = []

# String indexing for categorical variables
indexers = [StringIndexer(inputCol=col, outputCol=col + "_indexed", handleInvalid="keep") 
           for col in categorical_cols]
stages.extend(indexers)

# One-hot encoding for categorical variables (equivalent to pd.get_dummies)
encoders = [OneHotEncoder(inputCol=col + "_indexed", outputCol=col + "_encoded") 
           for col in categorical_cols]
stages.extend(encoders)

# Imputation for numerical columns (median strategy)
imputer = Imputer(inputCols=numerical_cols, 
                 outputCols=[col + "_imputed" for col in numerical_cols],
                 strategy="median")
stages.append(imputer)

# Prepare feature columns for vector assembler
encoded_categorical_cols = [col + "_encoded" for col in categorical_cols]
imputed_numerical_cols = [col + "_imputed" for col in numerical_cols]
feature_cols = encoded_categorical_cols + imputed_numerical_cols

# Vector assembler to combine all features
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_unscaled")
stages.append(assembler)

# Standard scaling (equivalent to StandardScaler in sklearn)
scaler = StandardScaler(inputCol="features_unscaled", outputCol="features", 
                       withStd=True, withMean=True)
stages.append(scaler)

print(f"Created preprocessing pipeline with {len(stages)} stages")

## Model Training
Train the Random Forest model using PySpark MLlib with the same parameters as the original implementation.

In [None]:
# Random Forest Classifier (equivalent to sklearn RandomForestClassifier)
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="TARGET",
    numTrees=100,  # equivalent to n_estimators=100
    seed=50,       # equivalent to random_state=50
    maxDepth=10,   # reasonable default for large datasets
    minInstancesPerNode=1
)

# Add Random Forest to pipeline
stages.append(rf)

# Create and fit the complete pipeline
pipeline = Pipeline(stages=stages)

print("Training the model...")
model = pipeline.fit(train_df)
print("Model training completed!")

## Model Evaluation
Make predictions and calculate accuracy metrics.

In [None]:
# Make predictions on test set
predictions = model.transform(test_df)

# Show predictions
predictions.select("TARGET", "prediction", "probability").show(10)

# Calculate accuracy (equivalent to sklearn accuracy_score)
evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET", 
    predictionCol="prediction", 
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print(f"\nAccuracy: {accuracy:.10f}")

# Additional metrics
f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET", 
    predictionCol="prediction", 
    metricName="f1"
)
f1_score = f1_evaluator.evaluate(predictions)
print(f"F1 Score: {f1_score:.10f}")

# AUC for binary classification
auc_evaluator = BinaryClassificationEvaluator(
    labelCol="TARGET", 
    rawPredictionCol="rawPrediction", 
    metricName="areaUnderROC"
)
auc = auc_evaluator.evaluate(predictions)
print(f"AUC: {auc:.10f}")

## Feature Importance
Extract and display feature importance from the trained Random Forest model.

In [None]:
# Extract the Random Forest model from the pipeline
rf_model = model.stages[-1]

# Get feature importances
feature_importances = rf_model.featureImportances.toArray()

# Create feature importance DataFrame
# Note: feature names correspond to the assembled features
importance_data = list(zip(range(len(feature_importances)), feature_importances))
importance_df = spark.createDataFrame(importance_data, ["feature_index", "importance"])

# Sort by importance and show top features
importance_df.orderBy(col("importance").desc()).show(20)

print(f"\nTotal number of features after preprocessing: {len(feature_importances)}")

## Model Persistence
Save the trained model to disk for future use.

In [None]:
# Save the complete pipeline model
model_path = "./pyspark_random_forest_model"
model.write().overwrite().save(model_path)
print(f"Model saved to: {model_path}")

# Also save just the Random Forest model
rf_model_path = "./pyspark_rf_only_model"
rf_model.write().overwrite().save(rf_model_path)
print(f"Random Forest model saved to: {rf_model_path}")

## Question 4.2: Accuracy Metric Analysis

### Is accuracy a good choice of metric for this problem?

**Answer: No, accuracy is not the best metric for this problem.**

**Reasoning:**

1. **Class Imbalance**: The dataset shows a severe class imbalance with approximately:
   - Class 0 (no default): ~91.9%
   - Class 1 (default): ~8.1%

2. **Accuracy Paradox**: With such imbalance, a naive classifier that always predicts class 0 would achieve ~91.9% accuracy without learning anything meaningful about the data.

3. **Business Context**: In credit risk assessment, the cost of missing a default (false negative) is typically much higher than incorrectly flagging a good customer (false positive). Accuracy treats both errors equally.

**Better Metrics for This Problem:**

1. **Precision and Recall for Class 1**: Focus on how well we identify actual defaults
2. **F1-Score**: Harmonic mean of precision and recall, better for imbalanced datasets
3. **AUC-ROC**: Measures the model's ability to distinguish between classes across all thresholds
4. **Precision-Recall AUC**: Particularly useful for imbalanced datasets
5. **Cost-sensitive metrics**: Incorporate business costs of different types of errors

**Conclusion**: While accuracy provides a quick overview, it can be misleading for imbalanced datasets like this credit risk problem. The F1-score and AUC metrics calculated above provide more meaningful insights into model performance.


In [None]:
# Demonstrate the class imbalance issue
print("Class distribution in test set:")
class_counts = test_df.groupBy("TARGET").count().collect()
total_count = test_df.count()

for row in class_counts:
    class_label = row['TARGET']
    count = row['count']
    percentage = (count / total_count) * 100
    print(f"Class {class_label}: {count} samples ({percentage:.2f}%)")

# Calculate baseline accuracy (always predict majority class)
majority_class_count = max([row['count'] for row in class_counts])
baseline_accuracy = majority_class_count / total_count
print(f"\nBaseline accuracy (always predict majority class): {baseline_accuracy:.4f}")
print(f"Our model accuracy: {accuracy:.4f}")
print(f"Improvement over baseline: {accuracy - baseline_accuracy:.4f}")

In [None]:
# Clean up
spark.stop()
print("Spark session stopped.")