## Notes to the report (not part of delivery)

Jeg forsøkte først kun én modell, koden til denne finnes under. 

Som jeg har kommentert i koden så prøvde jeg først polynomgrad 2–6, 4 og 5 ga nesten likt resultat på CV R^2 (0.93 ish), så 4 er nok bedre da den er lavere risiko for overfitting, og bruker heller det i det mymodel, men verdt å kommentere i rapporten for å vise refleksjon. 

In [None]:
# train_regression.py
# -------------------
# 1) Leser treningsdata
# 2) Gjør modellvalg med GridSearchCV (PolynomialFeatures + StandardScaler + Ridge)
# 3) Trener beste modell på hele datasettet
# 4) Lagrer modellen (best_model.pkl) + en liten JSON med beste hyperparametre

import json, joblib, numpy as np
from pathlib import Path
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

import os
print("Current working directory:", os.getcwd())


# Data 
base = Path(__file__).resolve().parent  # path to this script
X = np.load(base / "X_train.npy")   # shape (700, 6)
y = np.load(base / "Y_train.npy")   # shape (700,)

# Pipeline og hyperparametre 
pipe = Pipeline([
    ("poly",   PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),
    ("ridge",  Ridge())
])

param_grid = {
    "poly__degree": [2, 3, 4], #[2, 3, 4, 5, 6],  prøvde polynomgrad 2–6, både 4 og 5 ga nesten likt resultat på CV R^2, så 4 er nok bedre da den er lavere risiko for overfitting
    "ridge__alpha": [1e-4 ,1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 10000],  # regulariseringsstyrke
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, param_grid, cv=cv, scoring="r2", refit=True, n_jobs=-1)

print("Running GridSearchCV...")
gs.fit(X, y)
print("Best params:", gs.best_params_, " | CV R^2:", gs.best_score_)

# Trener beste modell på hele datasettet ---
best_model = gs.best_estimator_
best_model.fit(X, y)

# Lagrer modell og metadata ---
joblib.dump(best_model, base / "best_model.pkl")
with open(base / "best_model_info.json", "w") as f:
    json.dump({
        "best_params": gs.best_params_,
        "best_cv_r2": gs.best_score_,
        "model_family": "PolynomialFeatures + StandardScaler + Ridge"
    }, f, indent=2)

print("Lagret: best_model.pkl og best_model_info.json")


Videre la jeg til to andre modeller, og sammenlignet. Den nye koden tar nå å sjekker de 3 modellene opp mot hverandre, og lagrer den beste til best_model.pkl. Som videre loades av mymodel.py. Ny kode finnes under: 

In [None]:
# Tests 3 model families (Poly+Ridge, Poly+Lasso, KernelRidge RBF).
# Saves best_model.pkl (the winner), best_model_info.json (structured),
# and cv_results.txt (I added this log to use when writing the report, not necessarily for the model).

import json, joblib, numpy as np
from pathlib import Path
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso
from sklearn.kernel_ridge import KernelRidge

# --- Load data ---
base = Path(__file__).resolve().parent
X = np.load(base / "X_train.npy")
y = np.load(base / "Y_train.npy")

cv = KFold(n_splits=5, shuffle=True, random_state=42) 
# 42 = the answer to life, the universe, and everything - ref to Hitchhiker's Guide to the Galaxy our inside joke
# when choosing random states :) its "random" but fixed so results are reproducible
results = {}

def summarize_gs(gs):
    """Pick out best score/params + std for the best row."""
    idx = gs.best_index_
    mean = float(gs.cv_results_["mean_test_score"][idx])
    std  = float(gs.cv_results_["std_test_score"][idx])
    return {
        "best_params": gs.best_params_,
        "cv_r2_mean": mean,
        "cv_r2_std": std,
    }

# Model 1: Poly+Ridge 
pipe_ridge = Pipeline([
    ("poly",   PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),
    ("ridge",  Ridge())
])
grid_ridge = {
    "poly__degree": [2, 3, 4], # 4 and 5 were very close, but 4 is likely better to avoid overfitting
    "ridge__alpha": [1e-3, 1e-2, 1e-1, 1, 10, 100],
}
gs_ridge = GridSearchCV(pipe_ridge, grid_ridge, cv=cv, scoring="r2", refit=True, n_jobs=-1)
gs_ridge.fit(X, y)
results["poly_ridge"] = summarize_gs(gs_ridge)

# Model 2: Poly+Lasso 
pipe_lasso = Pipeline([
    ("poly",   PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),
    ("lasso",  Lasso(max_iter=50000, tol=1e-3, selection="cyclic")) # Max iter increased to ensure convergence (It didn't converge with default 1000)
])
grid_lasso = {
    "poly__degree": [2, 3, 4], # 4 and 5 were very close, but 4 is likely better to avoid overfitting (see notes.ipynb)
    "lasso__alpha": [1e-3, 1e-2, 1e-1, 1, 10, 100], 
}
gs_lasso = GridSearchCV(pipe_lasso, grid_lasso, cv=cv, scoring="r2", refit=True, n_jobs=-1)
gs_lasso.fit(X, y)
results["poly_lasso"] = summarize_gs(gs_lasso)

# Model 3: Kernel Ridge (RBF)
pipe_kr = Pipeline([
    ("scaler", StandardScaler()),
    ("kr",     KernelRidge(kernel="rbf"))
])
grid_kr = {
    "kr__alpha": [0.01, 0.1, 1.0, 10.0],
    "kr__gamma": [1e-3, 1e-2, 1e-1, 1.0],
}
gs_kr = GridSearchCV(pipe_kr, grid_kr, cv=cv, scoring="r2", refit=True, n_jobs=-1)
gs_kr.fit(X, y)
results["kernel_ridge_rbf"] = summarize_gs(gs_kr)

# We select the best model based on CV R^2
best_name = max(results.keys(), key=lambda k: results[k]["cv_r2_mean"])
best_est  = {
    "poly_ridge": gs_ridge.best_estimator_,
    "poly_lasso": gs_lasso.best_estimator_,
    "kernel_ridge_rbf": gs_kr.best_estimator_,
}[best_name]

best_est.fit(X, y)
joblib.dump(best_est, base / "best_model.pkl")

# JSON log; source of inspiration: https://stackoverflow.com/questions/12309269/how-do-i-write-json-data-to-a-file
with open(base / "best_model_info.json", "w") as f:
    json.dump({
        "winner": best_name,
        "results": results,
        "notes": "cv_r2_mean ± cv_r2_std with 5-fold KFold(shuffle, rs=42). "
                 "best_model.pkl is trained on all 700."
    }, f, indent=2)

# Text log for easy reading when we are writing the report
log_path = base / "cv_results.txt"
with open(log_path, "w") as f:
    f.write("CV Results (5-fold, R^2)\n")
    for k, info in results.items():
        mean = info["cv_r2_mean"]; std = info["cv_r2_std"]
        f.write(f"{k:>18s}: {mean:.4f} ± {std:.4f} | params: {info['best_params']}\n")
    f.write(f"\nWinner: {best_name}\n")

print(f"\nWinner: {best_name} --> saved to best_model.pkl")
print(f"Details saved to best_model_info.json and cv_results.txt")




### Modell-sammenligning (5-fold CV, R²)

Nedenfor er resultatene fra GridSearchCV (KFold=5, shuffle=True, rs=42). Jeg sammenlignet tre modeller:  
- **Poly + Ridge** (polynom-features + standardisering + Ridge)  
- **Poly + Lasso** (polynom-features + standardisering + Lasso)  
- **Kernel Ridge (RBF)** (standardisering + RBF-kjerne)

| Modell                 | Hyperparametre (beste)                          | R² (mean ± std)   |
|-----------------------|--------------------------------------------------|-------------------|
| Poly + Ridge          | degree = 4, alpha = 10                           | **0.9331 ± 0.0142** |
| Poly + Lasso          | degree = 4, alpha = 0.01                         | **0.9414 ± 0.0067** |
| Kernel Ridge (RBF)    | alpha = 0.01, gamma = 0.1                        | **0.9829 ± 0.0028** |

**Observasjoner / tolkning**

- Kernel Ridge (RBF) skårer klart høyest på CV-R² og har også lav varians mellom foldene. Det tyder på at RBF-kjernen fanger opp en mer kompleks, men relativt “glatt” ikke-lineær sammenheng i dataene.  
- Poly + Ridge og Poly + Lasso ligger lavere, men fortsatt høyt (≈0.93–0.94). Lasso gjør det litt bedre enn Ridge her, trolig fordi ℓ1-regularisering straffer og nuller ut noen koeffisienter når antall polynomledd blir stort.  
- Grad 4 ga best balanse for polynom-modellene i mitt oppsett. Jeg testet også høyere grader og så marginale forbedringer, men med økt kompleksitet.  
- På bakgrunn av CV-resultatene valgte jeg **Kernel Ridge (RBF)** som endelig modell til innleveringen. (Den trenes så på alle 700 før lagring som `best_model.pkl`.)

*Tekniske detaljer:*  
- CV-oppsett: KFold(n_splits=5, shuffle=True, random_state=42), `scoring="r2"`.  
- Alle modellene kjøres i `Pipeline` for å unngå datalekkasje (skalering og feature-laging fit’es kun på treningsfold). 

In [None]:
{
  "winner": "kernel_ridge_rbf",
  "results": {
    "poly_ridge": {
      "best_params": {
        "poly__degree": 4,
        "ridge__alpha": 10
      },
      "cv_r2_mean": 0.933108596320227,
      "cv_r2_std": 0.014155560800823133
    },
    "poly_lasso": {
      "best_params": {
        "lasso__alpha": 0.01,
        "poly__degree": 4
      },
      "cv_r2_mean": 0.9413730769574187,
      "cv_r2_std": 0.0067469533124386045
    },
    "kernel_ridge_rbf": {
      "best_params": {
        "kr__alpha": 0.01,
        "kr__gamma": 0.1
      },
      "cv_r2_mean": 0.9828514788918407,
      "cv_r2_std": 0.002842091276182927
    }
  },
  "notes": "cv_r2_mean \u00b1 cv_r2_std with 5-fold KFold(shuffle, rs=42). best_model.pkl is trained on all 700."
}