# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import spearmanr
# Import chosen baseline model
from sklearn.ensemble import RandomForestRegressor

## Model Choice

I have chosen the **Random Forest Regressor** as my baseline model.
It is excellent for tabular data, handles non-linear relationships well, and is generally more robust against overfitting than a single decision tree. It requires minimal data preprocessing and serves as a strong benchmark for future improvements.And it came on top in most papers in this topic.


## Feature Selection

I am using all available features from the dataset.
To make the categorical features (`day_of_week`, `month_of_year`, `hour`) usable for the model, I apply **One-Hot Encoding**. I also ensure that the input features (`x_train`) and targets (`y_train`) are correctly aligned by merging them on the ID column.


In [10]:
# Load the datasets
# Note: Ensure the file paths match your uploaded files
x_train_path = '/content/x_train_final_asAbTs5.csv'
y_train_path = '/content/y_train_final_YYyFil7.csv'

X_raw = pd.read_csv(x_train_path)
y_raw = pd.read_csv(y_train_path)

# Merge on ID to align rows (assuming 'Column1' or first column is ID)
id_col = 'Column1' if 'Column1' in X_raw.columns else X_raw.columns[0]
data = pd.merge(X_raw, y_raw, on=id_col, how='inner')

# Separate Target and Features
target_col = 'invalid_ratio'
y = data[target_col]
X = data.drop([target_col, id_col], axis=1)

# Feature selection / Engineering
# Convert categorical columns to numeric using One-Hot Encoding
categorical_cols = ['day_of_week', 'month_of_year', 'hour']
X = pd.get_dummies(X, columns=categorical_cols, dummy_na=False)
X = X.fillna(0) # Handle potential missing values

# Splitting the dataset into Training and Validation sets
# I rename X_test to X_val here to avoid confusion with the real test submission file
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")


Training shape: (4861236, 39)
Validation shape: (1215310, 39)


## Implementation
# Initialize the baseline model




In [11]:
# Initialize the baseline model
model = RandomForestRegressor(
    n_estimators=50,       # Number of trees
    max_depth=15,          # Limit depth to prevent overfitting and memory issues
    max_samples=0.2,       # Use 20% of data per tree for speed
    n_jobs=2,              # Parallel processing
    random_state=42,       # Reproducible results
    verbose=1
)

# Train the model
model.fit(X_train, y_train)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 10.5min
[Parallel(n_jobs=2)]: Done  50 out of  50 | elapsed: 11.3min finished


## Evaluation

I am using three metrics to evaluate the model's performance from different angles:

1.  **Spearman Correlation Coefficient (Primary Metric):**
    This is the official metric for the challenge. It evaluates how well the model ranks the `invalid_ratio` (ordering) rather than just checking the absolute values. This makes it robust against outliers.

2.  **MSE (Mean Squared Error):**
    I calculate MSE because it squares the prediction errors. This means it penalizes larger deviations much more heavily than smaller ones, helping me identify if the model makes any drastic mistakes or produces extreme outliers.

3.  **R² Score (Coefficient of Determination):**
    I use the R² Score to understand how much of the variance in the target variable is explained by my model. It provides a normalized value (usually between 0 and 1) that is easier to interpret and compare than raw error values. An R² of 0.30, for example, would mean the model captures 30% of the patterns in the data.



In [12]:
# Evaluate the baseline model on the validation set
y_pred = model.predict(X_val)

# Calculate metrics
spearman_score, _ = spearmanr(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print(f"Spearman Correlation: {spearman_score:.5f}")
print(f"Mean Squared Error:   {mse:.5f}")
print(f"R² Score:             {r2:.5f}")


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    8.7s
[Parallel(n_jobs=2)]: Done  50 out of  50 | elapsed:    9.2s finished


Spearman Correlation: 0.53389
Mean Squared Error:   0.09299
R² Score:             0.31544


In [13]:
# --- BONUS: Generate Submission File (RAM Safe Version) ---
import gc # Garbage Collector for memory management

print("Phase 1: Retraining on full data...")

# 1. Train on the full dataset (X and y are currently in memory)
model.fit(X, y)
print("Training complete.")

# 2. Save the list of columns the model learned!
# We need this later to align the test data
model_columns = X.columns.tolist()

# 3. CRITICAL STEP: Delete training data to free up RAM
# Now that the model is trained, we don't need the data anymore.
del X, y, X_train, X_val, y_train, y_val
gc.collect() # Force memory cleanup
print("Training data deleted from RAM. Memory freed.")

# 4. Load Real Test Data
print("Phase 2: Loading test data...")
X_test_real = pd.read_csv('/content/x_test_final_fIrnA7Q.csv')

# 5. Preprocess Test Data (Independent of Train data)
if id_col in X_test_real.columns:
    X_test_real = X_test_real.drop(id_col, axis=1)
if target_col in X_test_real.columns:
    X_test_real = X_test_real.drop(target_col, axis=1)

# One-Hot Encoding (Only on Test Data)
X_test_real = pd.get_dummies(X_test_real, columns=categorical_cols, dummy_na=False)

# 6. Align Columns (The Magic Step)
# Instead of combining datasets, we simply force the test data
# to have exactly the same columns as the training data.
# - Missing columns are filled with 0.
# - Extra columns in test are dropped.
print("Aligning features...")
X_test_ready = X_test_real.reindex(columns=model_columns, fill_value=0)

# Cleanup raw test dataframe to save more RAM
del X_test_real
gc.collect()

# 7. Predict
print("Phase 3: Predicting...")
final_predictions = model.predict(X_test_ready)

# 8. Save
submission = pd.DataFrame(final_predictions, columns=['invalid_ratio'])
submission.index = np.arange(len(submission))
submission.index.name = None
submission.to_csv('submission_baseline.csv', index=True)

print("Success! Submission file 'submission_baseline.csv' saved.")

# Trigger download (Colab only)
try:
    from google.colab import files
    files.download('submission_baseline.csv')
except:
    print("Download not available automatically.")


Phase 1: Retraining on full data...


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 10.3min
[Parallel(n_jobs=2)]: Done  50 out of  50 | elapsed: 10.9min finished


Training complete.
Training data deleted from RAM. Memory freed.
Phase 2: Loading test data...
Aligning features...
Phase 3: Predicting...


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    8.7s
[Parallel(n_jobs=2)]: Done  50 out of  50 | elapsed:    9.5s finished


Success! Submission file 'submission_baseline.csv' saved.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [15]:
# Successful submission !

# Your submission score is : 0.46875158558649865

# benchmark 	0.1965