# Question 4.3: Koalas Random Forest Implementation

This notebook revisits the PySpark Random Forest implementation from Question 4.1 and replaces as much PySpark code as possible with the Koalas API, which provides a pandas-like interface for Spark DataFrames.

## Overview
- Use Koalas for data loading and manipulation (pandas-like syntax)
- Leverage Koalas for feature engineering and preprocessing
- Convert to PySpark DataFrame only when necessary for MLlib
- Train Random Forest using PySpark MLlib (as Koalas doesn't have ML algorithms)
- Compare the code simplicity and readability with pure PySpark approach

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
import pyspark.pandas as ps  # Koalas is now integrated as pyspark.pandas
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, Imputer, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
import numpy as np

# Configure Koalas to work with larger datasets
ps.set_option('compute.default_index_type', 'distributed')
ps.set_option('compute.ops_on_diff_frames', True)

In [None]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("RandomForestKoalas") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Koalas version: {ps.__version__}")

## Data Loading with Koalas
Use Koalas to load data with pandas-like syntax, making the code more familiar and readable.

In [None]:
# Download data files (same as original implementation)
import subprocess

# Download external sources
subprocess.run([
    "wget", 
    "https://storage.googleapis.com/bdt-spark-store/external_sources.csv", 
    "-O", "gcs_external_sources.csv"
], check=True)

# Download internal data
subprocess.run([
    "wget", 
    "https://storage.googleapis.com/bdt-spark-store/internal_data.csv", 
    "-O", "gcs_internal_data.csv"
], check=True)

print("Data files downloaded successfully")

In [None]:
# Load data using Koalas (pandas-like syntax)
df_data = ps.read_csv("gcs_internal_data.csv")
df_ext = ps.read_csv("gcs_external_sources.csv")

print(f"Internal data shape: {df_data.shape}")
print(f"External data shape: {df_ext.shape}")

# Show basic info (pandas-like)
print("\nInternal data info:")
print(df_data.dtypes.head(10))
print("\nExternal data info:")
print(df_ext.dtypes.head(10))

## Data Joining with Koalas
Use pandas-like merge syntax instead of PySpark join operations.

In [None]:
# Join datasets using pandas-like merge syntax
df_full = df_data.merge(df_ext, on="SK_ID_CURR", how="inner")

print(f"Joined data shape: {df_full.shape}")

# Show first few rows (pandas-like)
print("\nFirst 3 rows:")
print(df_full.head(3))

## Feature Selection with Koalas
Use pandas-like column selection syntax.

In [None]:
# Select the same columns as in the original implementation (pandas-like syntax)
columns_extract = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
                  'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_EDUCATION_TYPE',
                  'DAYS_ID_PUBLISH', 'CODE_GENDER', 'AMT_ANNUITY',
                  'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT',
                  'ORGANIZATION_TYPE', 'DAYS_LAST_PHONE_CHANGE',
                  'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE', 'TARGET']

df = df_full[columns_extract]  # pandas-like column selection

print(f"Selected features shape: {df.shape}")
print("\nFirst 3 rows of selected features:")
print(df.head(3))

## Data Exploration with Koalas
Use pandas-like methods for data exploration and analysis.

In [None]:
# Explore data using pandas-like methods
print("Data types:")
print(df.dtypes)

print("\nTarget distribution:")
print(df['TARGET'].value_counts())

print("\nBasic statistics for numerical columns:")
print(df.describe())

## Train-Test Split with Koalas
Use pandas-like sampling for train-test split.

In [None]:
# Create train-test split using pandas-like sampling
# Note: Koalas doesn't have train_test_split, so we use sampling
train_df_koalas = df.sample(frac=0.8, random_state=101)
test_df_koalas = df.drop(train_df_koalas.index)

print(f"Training set shape: {train_df_koalas.shape}")
print(f"Test set shape: {test_df_koalas.shape}")

# Check target distribution using pandas-like syntax
print("\nTraining set target distribution:")
print(train_df_koalas['TARGET'].value_counts())

print("\nTest set target distribution:")
print(test_df_koalas['TARGET'].value_counts())

## Feature Engineering with Koalas
Use pandas-like operations for feature engineering including dummy encoding and data preprocessing.

In [None]:
# Identify categorical and numerical columns
categorical_cols = ['NAME_EDUCATION_TYPE', 'CODE_GENDER', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE']
numerical_cols = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
                 'DAYS_ID_PUBLISH', 'AMT_ANNUITY', 'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 
                 'AMT_CREDIT', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE']

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

In [None]:
# One-hot encoding using pandas-like get_dummies (much simpler than PySpark!)
train_encoded = ps.get_dummies(train_df_koalas, columns=categorical_cols, prefix=categorical_cols)
test_encoded = ps.get_dummies(test_df_koalas, columns=categorical_cols, prefix=categorical_cols)

print(f"Training set shape after encoding: {train_encoded.shape}")
print(f"Test set shape after encoding: {test_encoded.shape}")

# Align columns between train and test (pandas-like operation)
# Get common columns
train_cols = set(train_encoded.columns)
test_cols = set(test_encoded.columns)
common_cols = list(train_cols.intersection(test_cols))

# Ensure TARGET is included
if 'TARGET' not in common_cols:
    common_cols.append('TARGET')

train_aligned = train_encoded[common_cols]
test_aligned = test_encoded[common_cols]

print(f"Aligned training set shape: {train_aligned.shape}")
print(f"Aligned test set shape: {test_aligned.shape}")

In [None]:
# Handle missing values using pandas-like fillna
# Calculate median for numerical columns in training set
numerical_medians = {}
for col in numerical_cols:
    if col in train_aligned.columns:
        numerical_medians[col] = train_aligned[col].median()

print("Numerical medians for imputation:")
for col, median_val in numerical_medians.items():
    print(f"{col}: {median_val}")

# Fill missing values with median (pandas-like)
train_imputed = train_aligned.fillna(numerical_medians)
test_imputed = test_aligned.fillna(numerical_medians)

print(f"\nTraining set shape after imputation: {train_imputed.shape}")
print(f"Test set shape after imputation: {test_imputed.shape}")

## Convert to PySpark for MLlib
Convert Koalas DataFrames to PySpark DataFrames for machine learning with MLlib.

In [None]:
# Convert Koalas DataFrames to PySpark DataFrames for MLlib
train_spark = train_imputed.to_spark()
test_spark = test_imputed.to_spark()

print(f"Converted to PySpark - Training: {train_spark.count()} rows, {len(train_spark.columns)} columns")
print(f"Converted to PySpark - Test: {test_spark.count()} rows, {len(test_spark.columns)} columns")

# Prepare feature columns (exclude TARGET)
feature_columns = [col for col in train_spark.columns if col != 'TARGET']
print(f"\nNumber of feature columns: {len(feature_columns)}")

## Model Training with PySpark MLlib
Use PySpark MLlib for the actual machine learning (as Koalas doesn't provide ML algorithms).

In [None]:
# Create feature vector and scaling pipeline
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features_unscaled")
scaler = StandardScaler(inputCol="features_unscaled", outputCol="features", 
                       withStd=True, withMean=True)

# Random Forest Classifier
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="TARGET",
    numTrees=100,  # equivalent to n_estimators=100
    seed=50,       # equivalent to random_state=50
    maxDepth=10,
    minInstancesPerNode=1
)

# Create pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])

print("Training the model...")
model = pipeline.fit(train_spark)
print("Model training completed!")

## Model Evaluation
Evaluate the model performance using the same metrics as in the PySpark implementation.

In [None]:
# Make predictions on test set
predictions = model.transform(test_spark)

# Show predictions
predictions.select("TARGET", "prediction", "probability").show(10)

# Calculate accuracy
evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET", 
    predictionCol="prediction", 
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print(f"\nAccuracy: {accuracy:.10f}")

# Additional metrics
f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET", 
    predictionCol="prediction", 
    metricName="f1"
)
f1_score = f1_evaluator.evaluate(predictions)
print(f"F1 Score: {f1_score:.10f}")

# AUC for binary classification
auc_evaluator = BinaryClassificationEvaluator(
    labelCol="TARGET", 
    rawPredictionCol="rawPrediction", 
    metricName="areaUnderROC"
)
auc = auc_evaluator.evaluate(predictions)
print(f"AUC: {auc:.10f}")

## Feature Importance Analysis
Extract and analyze feature importance from the trained model.

In [None]:
# Extract the Random Forest model from the pipeline
rf_model = model.stages[-1]

# Get feature importances
feature_importances = rf_model.featureImportances.toArray()

# Create feature importance analysis using Koalas (pandas-like)
importance_data = {
    'feature_name': feature_columns,
    'importance': feature_importances
}

# Convert to Koalas DataFrame for easier manipulation
importance_df = ps.DataFrame(importance_data)

# Sort by importance (pandas-like syntax)
importance_sorted = importance_df.sort_values('importance', ascending=False)

print("Top 20 most important features:")
print(importance_sorted.head(20))

print(f"\nTotal number of features: {len(feature_importances)}")

## Model Persistence
Save the trained model to disk.

In [None]:
# Save the complete pipeline model
model_path = "./koalas_random_forest_model"
model.write().overwrite().save(model_path)
print(f"Model saved to: {model_path}")

# Also save just the Random Forest model
rf_model_path = "./koalas_rf_only_model"
rf_model.write().overwrite().save(rf_model_path)
print(f"Random Forest model saved to: {rf_model_path}")

## Comparison: Koalas vs Pure PySpark

### Advantages of Using Koalas:

1. **Familiar Syntax**: Koalas provides pandas-like syntax, making it easier for data scientists familiar with pandas to work with Spark DataFrames.

2. **Simplified Data Manipulation**:
   - `df.merge()` instead of `df.join()`
   - `ps.get_dummies()` instead of complex StringIndexer + OneHotEncoder pipeline
   - `df.fillna()` instead of Imputer transformations
   - `df.sample()` for train-test split instead of `randomSplit()`

3. **Easier Data Exploration**:
   - `df.describe()`, `df.value_counts()`, `df.head()` work as expected
   - More intuitive data inspection and analysis

4. **Reduced Code Complexity**: The Koalas implementation requires significantly fewer lines of code for data preprocessing.

### Limitations:

1. **No ML Algorithms**: Koalas doesn't provide machine learning algorithms, so we still need to convert to PySpark DataFrames for MLlib.

2. **Performance Considerations**: Some Koalas operations might be less optimized than native PySpark operations.

3. **Feature Completeness**: Not all pandas features are available in Koalas.

### Conclusion:

Koalas significantly simplifies data preprocessing and exploration tasks, making the code more readable and maintainable. However, for machine learning tasks, we still need to rely on PySpark MLlib. The hybrid approach (Koalas for data processing + PySpark MLlib for ML) provides the best of both worlds: familiar syntax for data manipulation and powerful distributed ML capabilities.


In [None]:
# Demonstrate the class imbalance issue using Koalas
print("Class distribution analysis using Koalas:")

# Convert predictions back to Koalas for easier analysis
predictions_koalas = predictions.select("TARGET", "prediction").to_koalas()

print("\nActual vs Predicted distribution:")
print("Actual TARGET distribution:")
print(predictions_koalas['TARGET'].value_counts())

print("\nPredicted distribution:")
print(predictions_koalas['prediction'].value_counts())

# Calculate baseline accuracy
total_samples = len(predictions_koalas)
majority_class_count = predictions_koalas['TARGET'].value_counts().max()
baseline_accuracy = majority_class_count / total_samples

print(f"\nBaseline accuracy (always predict majority class): {baseline_accuracy:.4f}")
print(f"Our model accuracy: {accuracy:.4f}")
print(f"Improvement over baseline: {accuracy - baseline_accuracy:.4f}")

In [None]:
# Clean up
spark.stop()
print("Spark session stopped.")