<a href="https://colab.research.google.com/github/watsonselah/bubba-watson/blob/master/koalas_random_forest_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Import Libraries
Import required libraries including pyspark.pandas (Koalas), PySpark MLlib, and other dependencies.

In [1]:
!pip install numpy==1.26.4



In [None]:
# NumPy 2.0 & PyArrow Compatibility Setup
import os
import sys
import warnings

# Set PyArrow environment variable to suppress timezone warnings
os.environ['PYARROW_IGNORE_TIMEZONE'] = '1'
print("✓ PyArrow timezone environment variable set")

# NumPy 2.0 compatibility fix
try:
    import numpy as np

    # Check if np.NaN exists (it was removed in NumPy 2.0)
    if not hasattr(np, 'NaN'):
        # Add np.NaN as an alias to np.nan for backward compatibility
        np.NaN = np.nan

    print(f"NumPy version: {np.__version__}")

except Exception as e:
    print(f"⚠️ NumPy compatibility setup failed: {e}")
    print("Continuing with standard NumPy import...")

In [2]:
# Import required libraries
from pyspark.sql import SparkSession
import pyspark.pandas as ps  # Koalas is now integrated as pyspark.pandas
from pyspark.pandas import config
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, Imputer, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
import subprocess

# Configure Koalas to work with larger datasets
ps.set_option('compute.default_index_type', 'distributed')
ps.set_option('compute.ops_on_diff_frames', True)

print("Libraries imported successfully!")



Libraries imported successfully!


## 2. Spark Session Setup
Initialize Spark session with optimized configuration for Koalas operations.

In [3]:
# Initialize Spark Session with optimized configuration
spark = SparkSession.builder \
    .appName("RandomForestKoalas") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark context: {spark.sparkContext}")

Spark version: 3.5.1
Spark context: <SparkContext master=local[*] appName=pandas-on-Spark>


## 3. Data Loading with Koalas
Use Koalas to load data with pandas-like syntax, making the code more familiar and readable compared to PySpark's DataFrame API.

In [4]:
# Download data files (same as original implementation)
print("Downloading data files...")

# Download external sources
subprocess.run([
    "wget",
    "https://storage.googleapis.com/bdt-spark-store/external_sources.csv",
    "-O", "gcs_external_sources.csv"
], check=True)

# Download internal data
subprocess.run([
    "wget",
    "https://storage.googleapis.com/bdt-spark-store/internal_data.csv",
    "-O", "gcs_internal_data.csv"
], check=True)

print("Data files downloaded successfully!")

Downloading data files...
Data files downloaded successfully!


In [5]:
# Load data using Koalas
print("Loading data with Koalas...")

df_data = ps.read_csv("gcs_internal_data.csv")
df_ext = ps.read_csv("gcs_external_sources.csv")

print(f"Internal data shape: {df_data.shape}")
print(f"External data shape: {df_ext.shape}")

# Show basic info using pandas-like methods
print("\nInternal data column types (first 10):")
print(df_data.dtypes.head(10))

print("\nExternal data column types:")
print(df_ext.dtypes)

Loading data with Koalas...




Internal data shape: (307511, 119)
External data shape: (307511, 4)

Internal data column types (first 10):
SK_ID_CURR              int32
TARGET                  int32
NAME_CONTRACT_TYPE     object
CODE_GENDER            object
FLAG_OWN_CAR           object
FLAG_OWN_REALTY        object
CNT_CHILDREN            int32
AMT_INCOME_TOTAL      float64
AMT_CREDIT            float64
AMT_ANNUITY           float64
dtype: object

External data column types:
SK_ID_CURR        int32
EXT_SOURCE_1    float64
EXT_SOURCE_2    float64
EXT_SOURCE_3    float64
dtype: object


## 4. Data Joining with Koalas
Use pandas-like merge syntax instead of PySpark join operations. This is much more intuitive for data scientists familiar with pandas.

In [6]:
# Join datasets using pandas-like merge syntax (much simpler than PySpark join!)
print("Joining datasets using Koalas merge...")

df_full = df_data.merge(df_ext, on="SK_ID_CURR", how="inner")

print(f"Joined data shape: {df_full.shape}")

# Show first few rows using pandas-like head() method
print("\nFirst 3 rows of joined data:")
print(df_full.head(3))

print("\nColumn names (first 10):")
print(df_full.columns[:10].tolist())

Joining datasets using Koalas merge...
Joined data shape: (307511, 122)

First 3 rows of joined data:
   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  AMT_GOODS_PRICE  NAME_TYPE_SUITE NAME_INCOME_TYPE            NAME_EDUCATION_TYPE    NAME_FAMILY_STATUS  NAME_HOUSING_TYPE  REGION_POPULATION_RELATIVE  DAYS_BIRTH  DAYS_EMPLOYED  DAYS_REGISTRATION  DAYS_ID_PUBLISH  OWN_CAR_AGE  FLAG_MOBIL  FLAG_EMP_PHONE  FLAG_WORK_PHONE  FLAG_CONT_MOBILE  FLAG_PHONE  FLAG_EMAIL OCCUPATION_TYPE  CNT_FAM_MEMBERS  REGION_RATING_CLIENT  REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START  HOUR_APPR_PROCESS_START  REG_REGION_NOT_LIVE_REGION  REG_REGION_NOT_WORK_REGION  LIVE_REGION_NOT_WORK_REGION  REG_CITY_NOT_LIVE_CITY  REG_CITY_NOT_WORK_CITY  LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE  APARTMENTS_AVG  BASEMENTAREA_AVG  YEARS_BEGINEXPLUATATION_AVG  YEARS_BUILD_AVG  COMMONAREA_AVG  ELEVATORS_AVG  ENTRANCES_AVG  FLOORS

## 5. Feature Selection with Koalas
Use pandas-like column selection syntax. This is identical to pandas, making it very familiar.

In [7]:
# Select the same columns as in the original implementation using pandas-like syntax
columns_extract = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
                  'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_EDUCATION_TYPE',
                  'DAYS_ID_PUBLISH', 'CODE_GENDER', 'AMT_ANNUITY',
                  'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT',
                  'ORGANIZATION_TYPE', 'DAYS_LAST_PHONE_CHANGE',
                  'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE', 'TARGET']

# Pandas-like column selection - much cleaner than PySpark select()
df = df_full[columns_extract]

print(f"Selected features shape: {df.shape}")
print(f"Selected columns: {len(columns_extract)}")

print("\nFirst 3 rows of selected features:")
print(df.head(3))

Selected features shape: (307511, 18)
Selected columns: 18

First 3 rows of selected features:
   EXT_SOURCE_1  EXT_SOURCE_2  EXT_SOURCE_3  DAYS_BIRTH  DAYS_EMPLOYED            NAME_EDUCATION_TYPE  DAYS_ID_PUBLISH CODE_GENDER  AMT_ANNUITY  DAYS_REGISTRATION  AMT_GOODS_PRICE  AMT_CREDIT ORGANIZATION_TYPE  DAYS_LAST_PHONE_CHANGE NAME_INCOME_TYPE  AMT_INCOME_TOTAL  OWN_CAR_AGE  TARGET
0      0.311267      0.622246           NaN      -16765          -1188               Higher education             -291           F      35698.5            -1186.0        1129500.0   1293502.5            School                  -828.0    State servant          270000.0          NaN       0
1           NaN      0.322738           NaN      -19932          -3038  Secondary / secondary special            -3458           M      21865.5            -4311.0         513000.0    513000.0          Religion                 -1106.0          Working          121500.0          NaN       0
2           NaN      0.354225      

## 6. Data Exploration with Koalas
Use pandas-like methods for data exploration and analysis. This provides familiar and intuitive data inspection capabilities.

In [8]:
# Explore data using pandas-like methods - much more intuitive than PySpark!
print("Data exploration with Koalas:")

print("\nData types:")
print(df.dtypes)

print("\nTarget distribution (pandas-like value_counts):")
print(df['TARGET'].value_counts())

print("\nBasic statistics for numerical columns:")
print(df.describe())

# Check for missing values using pandas-like syntax
print("\nMissing values count:")
print(df.isnull().sum())

Data exploration with Koalas:

Data types:
EXT_SOURCE_1              float64
EXT_SOURCE_2              float64
EXT_SOURCE_3              float64
DAYS_BIRTH                  int32
DAYS_EMPLOYED               int32
NAME_EDUCATION_TYPE        object
DAYS_ID_PUBLISH             int32
CODE_GENDER                object
AMT_ANNUITY               float64
DAYS_REGISTRATION         float64
AMT_GOODS_PRICE           float64
AMT_CREDIT                float64
ORGANIZATION_TYPE          object
DAYS_LAST_PHONE_CHANGE    float64
NAME_INCOME_TYPE           object
AMT_INCOME_TOTAL          float64
OWN_CAR_AGE               float64
TARGET                      int32
dtype: object

Target distribution (pandas-like value_counts):




0    282686
1     24825
Name: TARGET, dtype: int64

Basic statistics for numerical columns:
        EXT_SOURCE_1  EXT_SOURCE_2   EXT_SOURCE_3     DAYS_BIRTH  DAYS_EMPLOYED  DAYS_ID_PUBLISH    AMT_ANNUITY  DAYS_REGISTRATION  AMT_GOODS_PRICE    AMT_CREDIT  DAYS_LAST_PHONE_CHANGE  AMT_INCOME_TOTAL    OWN_CAR_AGE         TARGET
count  134133.000000  3.068510e+05  246546.000000  307511.000000  307511.000000    307511.000000  307499.000000      307511.000000     3.072330e+05  3.075110e+05           307510.000000      3.075110e+05  104582.000000  307511.000000
mean        0.502130  5.143927e-01       0.510853  -16036.995067   63815.045904     -2994.202373   27108.573909       -4986.120328     5.383962e+05  5.990260e+05             -962.858788      1.687979e+05      12.061091       0.080729
std         0.211062  1.910602e-01       0.194844    4363.988632  141275.766519      1509.450419   14493.737315        3522.886321     3.694465e+05  4.024908e+05              826.808487      2.371231e+05   

## 7. Train-Test Split with Koalas
Use pandas-like sampling for train-test split. While not as sophisticated as sklearn's train_test_split, it's much simpler than PySpark's randomSplit.

In [10]:
# Create train-test split using Koalas-compatible method
print("Creating train-test split with Koalas-compatible approach...")

try:
    # Method 1: Use random column approach (Koalas-compatible)
    import pyspark.sql.functions as F

    # Add a random column for splitting
    df_with_random = df.to_spark().withColumn("random_col", F.rand(seed=101))

    # Convert back to Koalas for pandas-like operations
    df_random = df_with_random.to_pandas_on_spark()

    # Split based on random column (80/20 split)
    train_df_koalas = df_random[df_random['random_col'] <= 0.8].drop('random_col', axis=1)
    test_df_koalas = df_random[df_random['random_col'] > 0.8].drop('random_col', axis=1)

    print("✓ Train-test split completed using random column method")

except Exception as e:
    print(f"⚠️ Random column method failed: {e}")
    print("Falling back to simple sampling approach...")

    # Method 2: Simple fallback using numpy-style split
    # Convert to pandas temporarily for splitting
    df_pandas = df.to_pandas()
    df_shuffled = df_pandas.sample(frac=1, random_state=101).reset_index(drop=True)

    # Calculate split index
    split_idx = int(0.8 * len(df_shuffled))

    # Split the data
    train_pandas = df_shuffled[:split_idx]
    test_pandas = df_shuffled[split_idx:]

    # Convert back to Koalas if available
    if KOALAS_AVAILABLE:
        train_df_koalas = ps.from_pandas(train_pandas)
        test_df_koalas = ps.from_pandas(test_pandas)
    else:
        train_df_koalas = train_pandas
        test_df_koalas = test_pandas

    print("✓ Train-test split completed using fallback method")

print(f"\nTraining set shape: {train_df_koalas.shape}")
print(f"Test set shape: {test_df_koalas.shape}")

# Check target distribution using pandas-like syntax
print("\nTraining set target distribution:")
print(train_df_koalas['TARGET'].value_counts())

print("\nTest set target distribution:")
print(test_df_koalas['TARGET'].value_counts())

# Calculate class distribution percentages
train_class_dist = train_df_koalas['TARGET'].value_counts(normalize=True) * 100
print("\nTraining set class distribution (%):")
print(train_class_dist)

test_class_dist = test_df_koalas['TARGET'].value_counts(normalize=True) * 100
print("\nTest set class distribution (%):")
print(test_class_dist)

Creating train-test split with Koalas-compatible approach...




✓ Train-test split completed using random column method

Training set shape: (245877, 18)
Test set shape: (61634, 18)

Training set target distribution:




0    226053
1     19824
Name: TARGET, dtype: int64

Test set target distribution:




0    56633
1     5001
Name: TARGET, dtype: int64





Training set class distribution (%):
0    91.937432
1     8.062568
Name: TARGET, dtype: float64





Test set class distribution (%):
0    91.885972
1     8.114028
Name: TARGET, dtype: float64


## 8. Feature Engineering with Koalas
Use pandas-like operations for feature engineering including dummy encoding and data preprocessing. This is significantly simpler than PySpark's StringIndexer + OneHotEncoder pipeline.

In [11]:
# Identify categorical and numerical columns
categorical_cols = ['NAME_EDUCATION_TYPE', 'CODE_GENDER', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE']
numerical_cols = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
                 'DAYS_ID_PUBLISH', 'AMT_ANNUITY', 'DAYS_REGISTRATION', 'AMT_GOODS_PRICE',
                 'AMT_CREDIT', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE']

print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")

# Check unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in categorical_cols:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values")

Categorical columns (4): ['NAME_EDUCATION_TYPE', 'CODE_GENDER', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE']
Numerical columns (13): ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'AMT_ANNUITY', 'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE']

Unique values in categorical columns:
NAME_EDUCATION_TYPE: 5 unique values
CODE_GENDER: 3 unique values
ORGANIZATION_TYPE: 58 unique values
NAME_INCOME_TYPE: 8 unique values


In [12]:
# One-hot encoding using pandas-like get_dummies (MUCH simpler than PySpark!)
print("Performing one-hot encoding with Koalas get_dummies...")

# This single line replaces complex PySpark StringIndexer + OneHotEncoder pipeline!
train_encoded = ps.get_dummies(train_df_koalas, columns=categorical_cols, prefix=categorical_cols)
test_encoded = ps.get_dummies(test_df_koalas, columns=categorical_cols, prefix=categorical_cols)

print(f"Training set shape after encoding: {train_encoded.shape}")
print(f"Test set shape after encoding: {test_encoded.shape}")

# Align columns between train and test (pandas-like operation)
print("\nAligning columns between train and test sets...")

# Get common columns
train_cols = set(train_encoded.columns)
test_cols = set(test_encoded.columns)
common_cols = list(train_cols.intersection(test_cols))

# Ensure TARGET is included
if 'TARGET' not in common_cols:
    common_cols.append('TARGET')

# Sort columns for consistency
common_cols.sort()

train_aligned = train_encoded[common_cols]
test_aligned = test_encoded[common_cols]

print(f"Aligned training set shape: {train_aligned.shape}")
print(f"Aligned test set shape: {test_aligned.shape}")
print(f"Number of common columns: {len(common_cols)}")

Performing one-hot encoding with Koalas get_dummies...
Training set shape after encoding: (245877, 88)
Test set shape after encoding: (61634, 87)

Aligning columns between train and test sets...
Aligned training set shape: (245877, 87)
Aligned test set shape: (61634, 87)
Number of common columns: 87


In [13]:
# Handle missing values using pandas-like fillna (much simpler than PySpark Imputer!)
print("Handling missing values with Koalas fillna...")

# Calculate median for numerical columns in training set
numerical_medians = {}
for col in numerical_cols:
    if col in train_aligned.columns:
        median_val = train_aligned[col].median()
        numerical_medians[col] = median_val

print("Numerical medians for imputation:")
for col, median_val in numerical_medians.items():
    print(f"{col}: {median_val:.6f}")

# Fill missing values with median using pandas-like fillna
# This is much simpler than PySpark's Imputer transformer
train_filled = train_aligned.fillna(numerical_medians)
test_filled = test_aligned.fillna(numerical_medians)

print(f"\nTraining set shape after filling missing values: {train_filled.shape}")
print(f"Test set shape after filling missing values: {test_filled.shape}")

# Check if there are any remaining missing values
print("\nRemaining missing values in training set:")
print(train_filled.isnull().sum().sum())

print("\nRemaining missing values in test set:")
print(test_filled.isnull().sum().sum())

Handling missing values with Koalas fillna...
Numerical medians for imputation:
EXT_SOURCE_1: 0.506146
EXT_SOURCE_2: 0.565822
EXT_SOURCE_3: 0.535276
DAYS_BIRTH: -15741.000000
DAYS_EMPLOYED: -1212.000000
DAYS_ID_PUBLISH: -3256.000000
AMT_ANNUITY: 24907.500000
DAYS_REGISTRATION: -4498.000000
AMT_GOODS_PRICE: 450000.000000
AMT_CREDIT: 514777.500000
DAYS_LAST_PHONE_CHANGE: -758.000000
AMT_INCOME_TOTAL: 144000.000000
OWN_CAR_AGE: 9.000000

Training set shape after filling missing values: (245877, 87)
Test set shape after filling missing values: (61634, 87)

Remaining missing values in training set:
0

Remaining missing values in test set:
0


## 9. Convert to PySpark DataFrame for MLlib
Convert Koalas DataFrames to PySpark DataFrames only when necessary for machine learning with MLlib. This demonstrates the seamless integration between Koalas and PySpark.

In [14]:
# Convert Koalas DataFrames to PySpark DataFrames for MLlib
print("Converting Koalas DataFrames to PySpark DataFrames for MLlib...")

# Convert to PySpark DataFrames using .to_spark()
train_spark = train_filled.to_spark()
test_spark = test_filled.to_spark()

print(f"PySpark training DataFrame count: {train_spark.count()}")
print(f"PySpark test DataFrame count: {test_spark.count()}")

# Show schema
print("\nPySpark DataFrame schema (first 10 columns):")
for field in train_spark.schema.fields[:10]:
    print(f"{field.name}: {field.dataType}")

print("\nFirst 3 rows of PySpark training DataFrame:")
train_spark.show(3, truncate=False)

Converting Koalas DataFrames to PySpark DataFrames for MLlib...




PySpark training DataFrame count: 245877
PySpark test DataFrame count: 61634

PySpark DataFrame schema (first 10 columns):
AMT_ANNUITY: DoubleType()
AMT_CREDIT: DoubleType()
AMT_GOODS_PRICE: DoubleType()
AMT_INCOME_TOTAL: DoubleType()
CODE_GENDER_F: ByteType()
CODE_GENDER_M: ByteType()
DAYS_BIRTH: DoubleType()
DAYS_EMPLOYED: DoubleType()
DAYS_ID_PUBLISH: DoubleType()
DAYS_LAST_PHONE_CHANGE: DoubleType()

First 3 rows of PySpark training DataFrame:
+-----------+----------+---------------+----------------+-------------+-------------+----------+-------------+---------------+----------------------+-----------------+------------------+------------------+------------------+-----------------------------------+------------------------------------+-------------------------------------+-----------------------------------+-------------------------------------------------+----------------------------+-------------------------------------+--------------------------------+--------------------------+

## 10. Feature Preparation for MLlib
Prepare features for PySpark MLlib using VectorAssembler. This is the only part where we need to use PySpark-specific operations.

In [15]:
# Prepare features for MLlib using VectorAssembler
print("Preparing features for MLlib...")

# Get feature columns (all columns except TARGET)
feature_cols = [col for col in train_spark.columns if col != 'TARGET']
print(f"Number of feature columns: {len(feature_cols)}")

# Create VectorAssembler
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features",
    handleInvalid="skip"  # Skip rows with invalid values
)

# Transform training and test data
train_assembled = assembler.transform(train_spark)
test_assembled = assembler.transform(test_spark)

print(f"Training data with features vector: {train_assembled.count()} rows")
print(f"Test data with features vector: {test_assembled.count()} rows")

# Select only the features and target columns for training
train_final = train_assembled.select("features", "TARGET")
test_final = test_assembled.select("features", "TARGET")

print("\nFinal training data schema:")
train_final.printSchema()

print("\nSample of assembled features:")
train_final.show(3, truncate=False)

Preparing features for MLlib...
Number of feature columns: 86
Training data with features vector: 245885 rows
Test data with features vector: 61626 rows

Final training data schema:
root
 |-- features: vector (nullable = true)
 |-- TARGET: integer (nullable = true)


Sample of assembled features:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|features                                                                                                                                                                                                   |TARGET|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|(86,[0,1,2,3,4,6,7,8,9,10,11,12,13,15,23,66,85],[35698.5,12935

## 11. Random Forest Model Training
Train the Random Forest model using PySpark MLlib. This is where we must use PySpark since Koalas doesn't provide machine learning algorithms.

In [16]:
# Train Random Forest model using PySpark MLlib
print("Training Random Forest model...")

# Create Random Forest classifier with same parameters as original
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="TARGET",
    numTrees=100,  # Same as n_estimators=100 in sklearn
    seed=101
)

# Train the model
print("Fitting Random Forest model...")
rf_model = rf.fit(train_final)

print("Model training completed!")
print(f"Number of trees: {rf_model.getNumTrees}")
print(f"Feature importance vector size: {len(rf_model.featureImportances)}")

Training Random Forest model...
Fitting Random Forest model...
Model training completed!
Number of trees: 100
Feature importance vector size: 86


## 12. Model Evaluation
Evaluate the model using various metrics including accuracy, F1-score, and AUC.

In [17]:
# Make predictions on test set
print("Making predictions on test set...")
predictions = rf_model.transform(test_final)

# Show sample predictions
print("\nSample predictions:")
predictions.select("TARGET", "prediction", "probability").show(10)

# Calculate evaluation metrics
print("\nCalculating evaluation metrics...")

# Accuracy
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET",
    predictionCol="prediction",
    metricName="accuracy"
)
accuracy = accuracy_evaluator.evaluate(predictions)

# F1 Score
f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET",
    predictionCol="prediction",
    metricName="f1"
)
f1_score = f1_evaluator.evaluate(predictions)

# AUC
auc_evaluator = BinaryClassificationEvaluator(
    labelCol="TARGET",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)
auc = auc_evaluator.evaluate(predictions)

print(f"\n=== MODEL EVALUATION RESULTS ===")
print(f"Accuracy: {accuracy:.6f}")
print(f"F1-Score: {f1_score:.6f}")
print(f"AUC: {auc:.6f}")

Making predictions on test set...

Sample predictions:
+------+----------+--------------------+
|TARGET|prediction|         probability|
+------+----------+--------------------+
|     0|       0.0|[0.92259315023424...|
|     0|       0.0|[0.92688126014464...|
|     0|       0.0|[0.92403199026644...|
|     0|       0.0|[0.92001367924317...|
|     0|       0.0|[0.91704108296053...|
|     0|       0.0|[0.92536513722143...|
|     0|       0.0|[0.91791817914322...|
|     0|       0.0|[0.91981700926013...|
|     0|       0.0|[0.90150008271025...|
|     0|       0.0|[0.91836415221207...|
+------+----------+--------------------+
only showing top 10 rows


Calculating evaluation metrics...

=== MODEL EVALUATION RESULTS ===
Accuracy: 0.919044
F1-Score: 0.880273
AUC: 0.716582


## 13. Question 4.2: Is Accuracy a Good Metric for This Problem?

**Answer: No, accuracy is NOT a good choice of metric for this problem.**

### Reasons why accuracy is inappropriate:

1. **Severe Class Imbalance**: The dataset shows a significant class imbalance with approximately 91.9% of samples belonging to class 0 (no default) and only 8.1% belonging to class 1 (default). In such imbalanced scenarios, accuracy can be misleading.

2. **Accuracy Paradox**: A naive classifier that always predicts the majority class (no default) would achieve ~92% accuracy without learning anything meaningful about the problem. This demonstrates the "accuracy paradox" where high accuracy doesn't necessarily indicate good model performance.

3. **Business Impact**: In credit risk assessment, the cost of false negatives (missing actual defaults) is typically much higher than false positives (incorrectly flagging good customers). Accuracy treats both types of errors equally.

4. **Lack of Insight**: Accuracy doesn't provide information about the model's ability to identify the minority class (defaults), which is the primary objective in this credit risk problem.

### Better Metrics for This Problem:

1. **Precision**: Measures the proportion of predicted defaults that are actual defaults
2. **Recall (Sensitivity)**: Measures the proportion of actual defaults that are correctly identified
3. **F1-Score**: Harmonic mean of precision and recall, providing a balanced measure
4. **AUC-ROC**: Measures the model's ability to distinguish between classes across all thresholds
5. **Precision-Recall AUC**: Particularly useful for imbalanced datasets
6. **Business-specific metrics**: Such as expected loss or profit-based evaluation

### Conclusion:
For this credit default prediction problem, metrics like F1-score, AUC, precision, and recall provide much more meaningful insights into model performance than accuracy alone.

## 14. Model Saving
Save the trained Random Forest model to disk for future use.

In [18]:
# Save the trained model
model_path = "./koalas_random_forest_model"
print(f"Saving model to: {model_path}")

try:
    rf_model.write().overwrite().save(model_path)
    print("Model saved successfully!")
except Exception as e:
    print(f"Error saving model: {e}")
    # Alternative: save to a different path
    import tempfile
    import os
    temp_path = os.path.join(tempfile.gettempdir(), "koalas_rf_model")
    rf_model.write().overwrite().save(temp_path)
    print(f"Model saved to temporary location: {temp_path}")

Saving model to: ./koalas_random_forest_model
Model saved successfully!


## 15. Feature Importance Analysis
Analyze and display the most important features identified by the Random Forest model.

In [19]:
# Extract and display feature importances
print("Analyzing feature importances...")

# Get feature importances
importances = rf_model.featureImportances.toArray()

# Create feature importance DataFrame using Koalas (pandas-like syntax)
feature_importance_data = list(zip(feature_cols, importances))
feature_importance_df = ps.DataFrame(feature_importance_data, columns=['feature', 'importance'])

# Sort by importance (pandas-like syntax)
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

print("\nTop 15 Most Important Features:")
print(feature_importance_df.head(15))

# Show summary statistics
print("\nFeature Importance Statistics:")
print(f"Total features: {len(feature_cols)}")
print(f"Sum of importances: {importances.sum():.6f}")
print(f"Mean importance: {importances.mean():.6f}")
print(f"Max importance: {importances.max():.6f}")
print(f"Min importance: {importances.min():.6f}")

Analyzing feature importances...

Top 15 Most Important Features:
                                              feature  importance
13                                       EXT_SOURCE_3    0.218866
11                                       EXT_SOURCE_1    0.210531
12                                       EXT_SOURCE_2    0.135659
15               NAME_EDUCATION_TYPE_Higher education    0.059525
4                                       CODE_GENDER_F    0.057114
26                           NAME_INCOME_TYPE_Working    0.049414
18  NAME_EDUCATION_TYPE_Secondary / secondary special    0.037667
7                                       DAYS_EMPLOYED    0.035960
6                                          DAYS_BIRTH    0.026908
2                                     AMT_GOODS_PRICE    0.018784
5                                       CODE_GENDER_M    0.018425
85                                        OWN_CAR_AGE    0.015661
10                                  DAYS_REGISTRATION    0.014688
22        