## Model Training

Train regression and classification models for shelter occupancy prediction.

**Regression Models:**
- Linear Regression (baseline)
- Random Forest Regressor (production-ready)
- XGBoost Regressor (best performance)

**Classification Models:**
- Logistic Regression (baseline)
- XGBoost Classifier (best performance)
- Random Forest Classifier (backup)


In [37]:
import pandas as pd
import numpy as np
import joblib
import sys
from pathlib import Path
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from xgboost import XGBRegressor, XGBClassifier

from pathlib import Path
import os

def get_project_root():
    """Find the project root directory by looking for a marker file/directory."""
    # Start from current working directory
    current = Path(os.getcwd())
    
    # Look for project markers (like .git, requirements.txt, or data/ directory)
    markers = ['.git', 'requirements.txt', 'data', 'notebooks']
    
    # Walk up the directory tree
    for path in [current] + list(current.parents):
        # Check if this looks like the project root
        if any((path / marker).exists() for marker in markers):
            return path
    
    # Fallback: if we're in notebooks/, go up one level
    if current.name == 'notebooks':
        return current.parent
    
    # Last resort: current directory
    return current

project_root = get_project_root()
processed_dir = project_root / "data" / "processed"
models_dir = project_root / "models"
models_dir.mkdir(exist_ok=True)

print("Loading train data...")
train_df = pd.read_csv(processed_dir / "train.csv")

# Load feature list
with open(processed_dir / "feature_list.txt", 'r') as f:
    feature_cols = [line.strip() for line in f.readlines()]

# Prepare features and targets
X_train = train_df[feature_cols].values
y_train_reg = train_df['OCCUPANCY_RATE_BEDS'].values
y_train_clf = train_df['overcapacity'].values

print(f"Training on {len(X_train)} samples with {len(feature_cols)} features")
print(f"Regression target range: {y_train_reg.min():.2f} to {y_train_reg.max():.2f}")
print(f"Classification target distribution: {np.bincount(y_train_clf)}")


Loading train data...
Training on 132641 samples with 33 features
Regression target range: 3.33 to 101.61
Classification target distribution: [ 25216 107425]


### Regression Models


In [18]:
# Baseline: Linear Regression
print("1. Training Linear Regression...")
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train_reg)
joblib.dump(lr_reg, models_dir / "regression_lr.pkl")
print("   ✓ Saved to models/regression_lr.pkl")


1. Training Linear Regression...
   ✓ Saved to models/regression_lr.pkl


In [19]:
# Production: Random Forest Regressor
print("2. Training Random Forest Regressor...")
rf_reg = RandomForestRegressor(
    n_estimators=300,
    max_depth=20,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train_reg)
joblib.dump(rf_reg, models_dir / "regression_rf.pkl")
print("   ✓ Saved to models/regression_rf.pkl")


2. Training Random Forest Regressor...
   ✓ Saved to models/regression_rf.pkl


In [20]:
# Advanced: XGBoost Regressor
print("3. Training XGBoost Regressor...")
xgb_reg = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=8,
    subsample=0.9,
    colsample_bytree=0.9,
    random_state=42,
    n_jobs=-1
)
xgb_reg.fit(X_train, y_train_reg)
joblib.dump(xgb_reg, models_dir / "regression_xgb.pkl")
print("   ✓ Saved to models/regression_xgb.pkl")


3. Training XGBoost Regressor...
   ✓ Saved to models/regression_xgb.pkl


### Classification Models


In [25]:
# Baseline: Logistic Regression
print("1. Training Logistic Regression...")
log_clf = LogisticRegression(class_weight='balanced', max_iter=10000, random_state=42, n_jobs=-1)
log_clf.fit(X_train, y_train_clf)
joblib.dump(log_clf, models_dir / "classification_lr.pkl")
print("   ✓ Saved to models/classification_lr.pkl")


1. Training Logistic Regression...
   ✓ Saved to models/classification_lr.pkl


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=10000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
# Production: XGBoost Classifier
print("2. Training XGBoost Classifier...")
xgb_clf = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    scale_pos_weight=3,
    random_state=42,
    n_jobs=-1
)
xgb_clf.fit(X_train, y_train_clf)
joblib.dump(xgb_clf, models_dir / "classification_xgb.pkl")
print("   ✓ Saved to models/classification_xgb.pkl")


2. Training XGBoost Classifier...
   ✓ Saved to models/classification_xgb.pkl


In [23]:
# Backup: Random Forest Classifier
print("3. Training Random Forest Classifier...")
rf_clf = RandomForestClassifier(
    n_estimators=300,
    max_depth=20,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_clf.fit(X_train, y_train_clf)
joblib.dump(rf_clf, models_dir / "classification_rf.pkl")
print("   ✓ Saved to models/classification_rf.pkl")


3. Training Random Forest Classifier...
   ✓ Saved to models/classification_rf.pkl


In [24]:
# Save feature columns for inference
joblib.dump(feature_cols, models_dir / "feature_columns.pkl")
print("✓ Saved feature columns to models/feature_columns.pkl")
print("\n✓ All models trained successfully!")


✓ Saved feature columns to models/feature_columns.pkl

✓ All models trained successfully!
