# Codeforces Problem Rating Predictor (Machine Learning Project)

This notebook outlines the steps for building a regression model to predict the difficulty rating of a Codeforces competitive programming problem based on its features, such as algorithmic tags, problem index, and number of successful submissions.

---

## I. Data Acquisition and Preparation

This section handles connecting to the Codeforces API, fetching the problems data, merging problem details with statistics, and saving the raw datasets.

### Cell 1: Import Necessary Libraries

We import the fundamental libraries required for the project, including utilities for API calls, data handling, and the core ML components used throughout the notebook.

In [1]:
import requests
import pandas as pd
import numpy as np
import os
import re 
from sklearn.preprocessing import MultiLabelBinarizer # For one-hot encoding tags

### Cell 2: Create Data Storage Folder

A directory named `data` is created to store the fetched raw files and processed CSVs. `exist_ok=True` prevents errors if the folder already exists.

In [2]:
os.makedirs("data", exist_ok=True)

### Cell 3: Fetch Data from Codeforces API

This cell calls the Codeforces `problemset.problems` API endpoint. It fetches two main lists: `problems` (containing ratings and tags) and `problemStatistics` (containing solved counts). A check ensures the API request was successful.

In [3]:
url = "https://codeforces.com/api/problemset.problems"
resp = requests.get(url)
data = resp.json()

if data['status'] != 'OK':
    raise Exception("API request failed")

problems = data['result']['problems']
stats = data['result']['problemStatistics']


### Cells 4 & 5: Convert to DataFrame and Merge

Lists are converted to pandas DataFrames (`df_problems`, `df_stats`) and merged into a single master DataFrame (`df`) based on the unique problem identifiers: **`contestId`** and **`index`**.

In [4]:
df_problems = pd.DataFrame(problems)
df_stats = pd.DataFrame(stats)


In [5]:
df = pd.merge(df_problems, df_stats, on=['contestId', 'index'])

### Cell 6: Initial Data Inspection

Viewing the head confirms the successful merge. We can see that many problems lack a `rating` (NaN) and the `tags` column is stored as a list of strings.

In [6]:
df.head(20)

Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
0,2162,H,Beautiful Problem,PROGRAMMING,[dp],,,143
1,2162,G,Beautiful Tree,PROGRAMMING,"[constructive algorithms, math, probabilities,...",,,2328
2,2162,F,Beautiful Intervals,PROGRAMMING,"[constructive algorithms, greedy]",,,2372
3,2162,E,Beautiful Palindromes,PROGRAMMING,"[constructive algorithms, greedy, schedules]",,,6618
4,2162,D,Beautiful Permutation,PROGRAMMING,"[binary search, interactive]",,,10258
5,2162,C,Beautiful XOR,PROGRAMMING,"[bitmasks, constructive algorithms, greedy]",,,16736
6,2162,B,Beautiful String,PROGRAMMING,"[brute force, constructive algorithms]",,,21314
7,2162,A,Beautiful Average,PROGRAMMING,"[brute force, greedy]",,,31419
8,2160,C,Reverse XOR,PROGRAMMING,[bitmasks],1250.0,,11593
9,2160,B,Distinct Elements,PROGRAMMING,"[greedy, math]",1000.0,,13926


### Cell 7: Save Full Dataset

The complete dataset, including both rated and unrated problems, is saved to CSV for archival and easy reloading.

In [7]:
df.to_csv("data/codeforces_problems_full.csv", index=False)
print("Saved full dataset to data/codeforces_problems_full.csv")

Saved full dataset to data/codeforces_problems_full.csv


### Cell 8: Filter and Save Rated Problems (Training Data)

Problems with missing `rating` values are dropped. The resulting `df_rated` is the **training dataset** for the model, as it contains our target variable.

In [8]:
df_rated = df.dropna(subset=['rating'])
df_rated.to_csv("data/codeforces_problems_rated.csv", index=False)
print("Saved rated problems to data/codeforces_problems_rated.csv")

Saved rated problems to data/codeforces_problems_rated.csv


### Cell 9: Rated Data Head Check

We check the head of `df_rated` to ensure all rows now contain a non-NaN value in the `rating` column.

In [9]:
df_rated.head(5)

Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
25,2155,F,Juan's Colorful Tree,PROGRAMMING,"[data structures, dfs and similar, dfs and sim...",3000.0,2800.0,312
26,2155,E,Mimo & Yuyu,PROGRAMMING,"[games, greedy, math]",2250.0,2200.0,1648
27,2155,D,Batteries,PROGRAMMING,"[brute force, constructive algorithms, graph m...",2000.0,1800.0,4447
28,2155,C,The Ancient Wizards' Capes,PROGRAMMING,"[brute force, greedy, implementation]",1750.0,1500.0,7398
29,2155,B,Abraham's Great Escape,PROGRAMMING,"[constructive algorithms, graphs]",1000.0,1100.0,13405


### Cell 10: Filter and Save Unrated Problems

Problems with missing `rating` are isolated into `df_unrated`. This dataset is reserved for making final predictions.

In [10]:
df_unrated = df[df['rating'].isna()]
df_unrated.to_csv("data/codeforces_problems_unrated.csv", index=False)
print("Saved unrated problems to data/codeforces_problems_unrated.csv")


Saved unrated problems to data/codeforces_problems_unrated.csv


### Cell 9: Unrated Data Head Check

We check the head of `df_unrated` to ensure all rows now contain a NaN value in the `rating` column.

In [11]:
df_unrated.head(5)

Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
0,2162,H,Beautiful Problem,PROGRAMMING,[dp],,,143
1,2162,G,Beautiful Tree,PROGRAMMING,"[constructive algorithms, math, probabilities,...",,,2328
2,2162,F,Beautiful Intervals,PROGRAMMING,"[constructive algorithms, greedy]",,,2372
3,2162,E,Beautiful Palindromes,PROGRAMMING,"[constructive algorithms, greedy, schedules]",,,6618
4,2162,D,Beautiful Permutation,PROGRAMMING,"[binary search, interactive]",,,10258


### Cell 12 & 13: Unrated and Full Data Checks

Final confirmation that `df_unrated` only contains problems with `NaN` rating, and a check on the original `df`.

In [12]:
df_unrated.head()

Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
0,2162,H,Beautiful Problem,PROGRAMMING,[dp],,,143
1,2162,G,Beautiful Tree,PROGRAMMING,"[constructive algorithms, math, probabilities,...",,,2328
2,2162,F,Beautiful Intervals,PROGRAMMING,"[constructive algorithms, greedy]",,,2372
3,2162,E,Beautiful Palindromes,PROGRAMMING,"[constructive algorithms, greedy, schedules]",,,6618
4,2162,D,Beautiful Permutation,PROGRAMMING,"[binary search, interactive]",,,10258


In [13]:
df.head()

Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
0,2162,H,Beautiful Problem,PROGRAMMING,[dp],,,143
1,2162,G,Beautiful Tree,PROGRAMMING,"[constructive algorithms, math, probabilities,...",,,2328
2,2162,F,Beautiful Intervals,PROGRAMMING,"[constructive algorithms, greedy]",,,2372
3,2162,E,Beautiful Palindromes,PROGRAMMING,"[constructive algorithms, greedy, schedules]",,,6618
4,2162,D,Beautiful Permutation,PROGRAMMING,"[binary search, interactive]",,,10258


## II. Feature Engineering

This section loads the rated data and transforms categorical and text-based columns into a numeric format suitable for machine learning models.

### Cell 14: Load Rated Data for Processing 

We reload the `df_rated` to ensure we have a clean DataFrame before feature transformations begin. Note that the `tags` column may appear as a string representation of a list when loaded from CSV.

In [14]:
# ============================================================
# Step 1: Load rated dataset
# ============================================================
import pandas as pd

df_rated = pd.read_csv("data/codeforces_problems_rated.csv")
df_rated.head()


Unnamed: 0,contestId,index,name,type,tags,points,rating,solvedCount
0,2155,F,Juan's Colorful Tree,PROGRAMMING,"['data structures', 'dfs and similar', 'dfs an...",3000.0,2800.0,312
1,2155,E,Mimo & Yuyu,PROGRAMMING,"['games', 'greedy', 'math']",2250.0,2200.0,1648
2,2155,D,Batteries,PROGRAMMING,"['brute force', 'constructive algorithms', 'gr...",2000.0,1800.0,4447
3,2155,C,The Ancient Wizards' Capes,PROGRAMMING,"['brute force', 'greedy', 'implementation']",1750.0,1500.0,7398
4,2155,B,Abraham's Great Escape,PROGRAMMING,"['constructive algorithms', 'graphs']",1000.0,1100.0,13405


### Cell 15: Transform Features (Tags & Index) 🛠️

This is the core feature engineering step:

1.  **Tags Parsing (Multi-Label):** A robust function (`parse_tags_string`) is defined and applied to convert the string list representations of tags into actual Python lists, which is necessary for multi-label processing.
2.  **One-Hot Encoding:** The **`MultiLabelBinarizer`** converts the tag lists into binary feature columns, where each unique tag becomes a feature (0/1).
3.  **Index Conversion:** The single-letter problem `index` (e.g., 'A', 'B', 'C') is converted into a numeric rank **`index_num`** (1, 2, 3...) to capture the difficulty curve within a contest.
4.  The original `tags` and intermediate `tags_list` columns are dropped.

In [15]:
# ============================================================
# Step 2: Feature Engineering
# - One-hot encode tags
# - Convert index (A, B, C, ...) to numeric
# ============================================================

# 1. Prepare Tags for MultiLabelBinarizer
def parse_tags_string(tags_str):
    """Parses a tags string '[tag1, tag2]' into a list ['tag1', 'tag2']."""
    if not isinstance(tags_str, str):
        return []

    # Remove surrounding brackets and split by comma + optional space
    content = tags_str.strip().strip('[]')
    if not content:
        return []

    # Split by comma and clean up each tag
    tags_list = [tag.strip() for tag in re.split(r',\s*', content) if tag.strip()]
    return tags_list

# Apply the parsing function to create a new 'tags_list' column
df_rated['tags_list'] = df_rated['tags'].apply(parse_tags_string)

# 2. One-hot encode tags using MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Use the 'tags_list' column for transformation
tags_encoded = mlb.fit_transform(df_rated['tags_list'])
tags_df = pd.DataFrame(tags_encoded, columns=mlb.classes_, index=df_rated.index)

# Add encoded tags to the main dataframe
df_rated = pd.concat([df_rated, tags_df], axis=1)

# Drop the original 'tags' and intermediate 'tags_list' columns
df_rated = df_rated.drop(columns=['tags', 'tags_list'], errors='ignore')

# 3. Convert problem index to numeric: A=1, B=2, ...
df_rated['index_num'] = df_rated['index'].astype(str).str[0].str.upper().apply(lambda x: ord(x) - 64)


### Cell 16: Check Unique Tag Features

Displaying `mlb.classes_` confirms all unique tags extracted from the dataset are now the column headers for the one-hot encoded features.

In [16]:
print(mlb.classes_)

["'*special'" "'2-sat'" "'binary search'" "'bitmasks'" "'brute force'"
 "'chinese remainder theorem'" "'combinatorics'"
 "'constructive algorithms'" "'data structures'" "'dfs and similar'"
 "'divide and conquer'" "'dp'" "'dsu'" "'expression parsing'" "'fft'"
 "'flows'" "'games'" "'geometry'" "'graph matchings'" "'graphs'"
 "'greedy'" "'hashing'" "'implementation'" "'interactive'" "'math'"
 "'matrices'" "'meet-in-the-middle'" "'number theory'" "'probabilities'"
 "'schedules'" "'shortest paths'" "'sortings'"
 "'string suffix structures'" "'strings'" "'ternary search'" "'trees'"
 "'two pointers'"]


### Cell 17: Check First Row's Final Feature Vector

Inspects the first row of the processed data to verify that `index_num` has been correctly computed and the tag columns contain `1` for present tags and `0` for absent tags.

In [17]:
df_rated.head().iloc[0]

contestId                                      2155
index                                             F
name                           Juan's Colorful Tree
type                                    PROGRAMMING
points                                       3000.0
rating                                       2800.0
solvedCount                                     312
'*special'                                        0
'2-sat'                                           0
'binary search'                                   0
'bitmasks'                                        0
'brute force'                                     0
'chinese remainder theorem'                       0
'combinatorics'                                   0
'constructive algorithms'                         0
'data structures'                                 1
'dfs and similar'                                 1
'divide and conquer'                              0
'dp'                                              0
'dsu'       

## III. Model Training and Evaluation

This section prepares the final feature matrix $\mathbf{X}$ and target vector $\mathbf{y}$, splits the data, and trains several regression models to find the best predictor for the problem rating.

### Cell 18 & 19: Prepare Features (X) and Target (y)

1.  The `rating` column is isolated as the prediction target ($\mathbf{y}$).
2.  Non-numeric, non-feature columns (`index`, `name`, `type`) are dropped to create the feature matrix $\mathbf{X}$.
3.  The head of the final feature matrix $\mathbf{X}$ is displayed.

In [18]:
# ============================================================
# Step 3: Prepare features and target
# ============================================================

# Drop columns not used for ML
X = df_rated.drop(columns=['index','name','rating','type'])
y = df_rated['rating']


In [19]:
X.head(5)

Unnamed: 0,contestId,points,solvedCount,'*special','2-sat','binary search','bitmasks','brute force','chinese remainder theorem','combinatorics',...,'probabilities','schedules','shortest paths','sortings','string suffix structures','strings','ternary search','trees','two pointers',index_num
0,2155,3000.0,312,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,6
1,2155,2250.0,1648,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
2,2155,2000.0,4447,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4
3,2155,1750.0,7398,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,3
4,2155,1000.0,13405,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


### Cell 20: Split Dataset (Train / CV / Test)

The data is split into three subsets using two applications of `train_test_split` with `random_state=42` for reproducibility:

1.  **Train (60%):** Used for fitting models.
2.  **Cross-Validation (CV) (20%):** Used for hyperparameter tuning and model selection.
3.  **Test (20%):** Used for final, unbiased evaluation.

Any remaining NaN values (primarily in `points`) are imputed with `0` in all splits.

In [20]:
# ============================================================
# Step 4: Split dataset into train / cross-validation / test
# ============================================================

from sklearn.model_selection import train_test_split

RANDOM_STATE = 42

# Step 4a: Split train+cv vs test (80% train+cv, 20% test)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

# Step 4b: Split train vs cross-validation (75% train, 25% cv of temp → 60% train, 20% cv)
X_train, X_cv, y_train, y_cv = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=RANDOM_STATE
)

# Replace NaNs with 0 in all splits (mostly affects 'points')
X_train = X_train.fillna(0)
X_cv = X_cv.fillna(0)
X_test = X_test.fillna(0)


# Check sizes
print("Train set:", X_train.shape)
print("Cross-validation set:", X_cv.shape)
print("Test set:", X_test.shape)


Train set: (6271, 41)
Cross-validation set: (2091, 41)
Test set: (2091, 41)


### Cells 25-27: Inspect Data Splits

Verify the heads of the split feature datasets ($X_{\text{train}}, X_{\text{cv}}, X_{\text{test}}$) to confirm they align with the expected feature set.

In [21]:
X_train.head()

Unnamed: 0,contestId,points,solvedCount,'*special','2-sat','binary search','bitmasks','brute force','chinese remainder theorem','combinatorics',...,'probabilities','schedules','shortest paths','sortings','string suffix structures','strings','ternary search','trees','two pointers',index_num
2651,1716,0.0,1266,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6
2461,1748,2750.0,369,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,6
6052,1032,1000.0,5078,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3423,1571,0.0,360,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4
4167,1423,0.0,233,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7


In [22]:
X_cv.head()

Unnamed: 0,contestId,points,solvedCount,'*special','2-sat','binary search','bitmasks','brute force','chinese remainder theorem','combinatorics',...,'probabilities','schedules','shortest paths','sortings','string suffix structures','strings','ternary search','trees','two pointers',index_num
1232,1957,500.0,24992,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5420,1166,1000.0,11802,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4741,1304,2000.0,5633,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,5
8637,429,500.0,11445,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
6703,888,0.0,8530,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,5


In [23]:
X_test.head()

Unnamed: 0,contestId,points,solvedCount,'*special','2-sat','binary search','bitmasks','brute force','chinese remainder theorem','combinatorics',...,'probabilities','schedules','shortest paths','sortings','string suffix structures','strings','ternary search','trees','two pointers',index_num
6134,1012,2500.0,356,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
5270,1195,3500.0,977,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,6
3210,1613,0.0,22914,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,2
3643,1526,1000.0,27065,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
2908,1672,1250.0,894,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6


### Cell 28: Model Comparison and Selection

This crucial cell runs the machine learning pipeline:

1.  **Utilities:** Helper functions like `round_to_nearest_100` are defined to handle Codeforces' 100-point rating steps.
2.  **Training & Tuning:** Multiple models (Linear, Ridge, Polynomial, Decision Tree, Random Forest, and XGBoost) are trained and tuned using **`GridSearchCV`** on the **training set**.
3.  **Cross-Validation Evaluation:** Models are evaluated on the **CV set**. The best model is selected based on the lowest **RMSE on rounded predictions**.
4.  **Final Test Evaluation:** The best model (**XGBoostRegressor**) is retrained on the combined Train+CV set and evaluated on the reserved **Test Set** to get the final, unbiased performance metrics.

In [None]:
import warnings
warnings.filterwarnings("ignore")

from pathlib import Path
import joblib

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Optional XGBoost
try:
    from xgboost import XGBRegressor
    XGB_PRESENT = True
except Exception:
    XGB_PRESENT = False

# Create folder for models/results
Path("models").mkdir(exist_ok=True)

RANDOM_STATE = 42

# ---------------------------
# Utilities
# ---------------------------
def round_to_nearest_100(preds):
    """Round predictions to nearest 100. 850 -> 900 (ties upward)."""
    a = np.array(preds)
    return ((a + 50) // 100 * 100).astype(int)

def rmse(a, b):
    return np.sqrt(mean_squared_error(a, b))

def evaluate_continuous_and_rounded(y_true, y_pred):
    """Return a dict of metrics for raw and rounded predictions."""
    y_true_arr = np.array(y_true)
    raw_rmse = rmse(y_true_arr, y_pred)
    pred_rounded = round_to_nearest_100(y_pred)
    rounded_rmse = rmse(y_true_arr, pred_rounded)
    rounded_rmse_hundreds = rounded_rmse / 100.0
    mae_steps = np.mean(np.abs(pred_rounded - y_true_arr) / 100.0)
    exact_match = np.mean(pred_rounded == y_true_arr)
    within_one_step = np.mean(np.abs(pred_rounded - y_true_arr) <= 100)
    return {
        "rmse_raw": float(raw_rmse),
        "rmse_rounded": float(rounded_rmse),
        "rmse_hundreds": float(rounded_rmse_hundreds),
        "mae_steps": float(mae_steps),
        "exact_match": float(exact_match),
        "within_±1_step": float(within_one_step)
    }

# ---------------------------
# Prepare numeric column list (used for polynomial features)
# ---------------------------
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

# ---------------------------
# Model definitions & simple tuning
# - Train on X_train
# - Evaluate on X_cv
# ---------------------------
models_info = []

# 1) Linear Regression (scaled)
lr = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])
lr.fit(X_train.fillna(0), y_train)
pred_cv = lr.predict(X_cv.fillna(0))
models_info.append({
    "name": "LinearRegression",
    "estimator": lr,
    **evaluate_continuous_and_rounded(y_cv, pred_cv)
})

# 2) Ridge (small grid search)
ridge_pipe = Pipeline([("scaler", StandardScaler()), ("ridge", Ridge(random_state=RANDOM_STATE))])
ridge_grid = {"ridge__alpha": [0.1, 1.0, 10.0, 50.0]}
ridge_search = GridSearchCV(ridge_pipe, ridge_grid, cv=3, scoring="neg_mean_squared_error", n_jobs=-1)
ridge_search.fit(X_train.fillna(0), y_train)
best_ridge = ridge_search.best_estimator_
pred_cv = best_ridge.predict(X_cv.fillna(0))
models_info.append({
    "name": "Ridge",
    "estimator": best_ridge,
    "best_params": ridge_search.best_params_,
    **evaluate_continuous_and_rounded(y_cv, pred_cv)
})

# 3 & 4) Polynomial regression (degree 2 & 3) on numeric columns only
for deg in (2, 3):
    poly_pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
        ("scaler", StandardScaler()),
        ("lr", LinearRegression())
    ])
    X_train_num = X_train[numeric_cols].fillna(0)
    X_cv_num = X_cv[numeric_cols].fillna(0)
    poly_pipe.fit(X_train_num, y_train)
    pred_cv = poly_pipe.predict(X_cv_num)
    models_info.append({
        "name": f"Polynomial_deg{deg}",
        "estimator": poly_pipe,
        "used_cols": numeric_cols,
        **evaluate_continuous_and_rounded(y_cv, pred_cv)
    })

# 5) Decision Tree Regressor (small grid)
dt_params = {"max_depth": [5, 10, 20, None], "min_samples_leaf": [1, 5]}
dt_search = GridSearchCV(DecisionTreeRegressor(random_state=RANDOM_STATE), dt_params, cv=3,
                         scoring="neg_mean_squared_error", n_jobs=-1)
dt_search.fit(X_train.fillna(0), y_train)
best_dt = dt_search.best_estimator_
pred_cv = best_dt.predict(X_cv.fillna(0))
models_info.append({
    "name": "DecisionTreeRegressor",
    "estimator": best_dt,
    "best_params": dt_search.best_params_,
    **evaluate_continuous_and_rounded(y_cv, pred_cv)
})

# 6) RandomForestRegressor (small grid)
rf_params = {"n_estimators": [50, 100], "max_depth": [10, 20, None], "min_samples_leaf": [1, 5]}
rf_search = GridSearchCV(RandomForestRegressor(random_state=RANDOM_STATE, n_jobs=-1),
                         rf_params, cv=3, scoring="neg_mean_squared_error", n_jobs=-1)
rf_search.fit(X_train.fillna(0), y_train)
best_rf = rf_search.best_estimator_
pred_cv = best_rf.predict(X_cv.fillna(0))
models_info.append({
    "name": "RandomForestRegressor",
    "estimator": best_rf,
    "best_params": rf_search.best_params_,
    **evaluate_continuous_and_rounded(y_cv, pred_cv)
})

# 7) XGBoost (if available)
if XGB_PRESENT:
    xgb_params = {"n_estimators": [50, 100], "max_depth": [3, 6], "learning_rate": [0.05, 0.1]}
    xgb_search = GridSearchCV(XGBRegressor(objective="reg:squarederror", random_state=RANDOM_STATE, n_jobs=-1),
                              xgb_params, cv=3, scoring="neg_mean_squared_error", n_jobs=-1)
    xgb_search.fit(X_train.fillna(0), y_train)
    best_xgb = xgb_search.best_estimator_
    pred_cv = best_xgb.predict(X_cv.fillna(0))
    models_info.append({
        "name": "XGBoostRegressor",
        "estimator": best_xgb,
        "best_params": xgb_search.best_params_,
        **evaluate_continuous_and_rounded(y_cv, pred_cv)
    })
    
# --------------------------
# SAVE ALL TRAINED MODELS HERE (Using .joblib) 💾
# --------------------------
print("Saving all individual trained models...")
for info in models_info:
    model_name = info["name"]
    estimator = info["estimator"]
    # Save the model trained ONLY on the train set (for individual inspection)
    joblib.dump(estimator, f"models/{model_name}_trained_on_train.joblib")
    print(f"  Saved: models/{model_name}_trained_on_train.joblib")
print("Finished saving individual models.")
    
# --------------------------
# Summary table (sorted by cv_rmse_rounded)
# --------------------------
summary = []
for info in models_info:
    summary.append({
        "model": info["name"],
        "cv_rmse_raw": info["rmse_raw"],
        "cv_rmse_rounded": info["rmse_rounded"],
        "cv_rmse_hundreds": info["rmse_hundreds"],
        "cv_mae_steps": info["mae_steps"],
        "cv_exact_match": info["exact_match"],
        "cv_within_±1_step": info["within_±1_step"],
        "best_params": info.get("best_params", None)
    })
summary_df = pd.DataFrame(summary).sort_values("cv_rmse_rounded").reset_index(drop=True)
print("\nModel comparison (sorted by CV RMSE on rounded predictions):")
display(summary_df)

# --------------------------
# Select best by cv_rmse_rounded, retrain on train+cv, evaluate on test
# --------------------------
best_model_row = summary_df.iloc[0]
best_name = best_model_row["model"]
print(f"\nSelected best model by CV (rounded): {best_name}")

# retrieve estimator object (the one trained on X_train/y_train)
best_est = None
for info in models_info:
    if info["name"] == best_name:
        best_est = info["estimator"]
        break

# prepare training on combined train+cv
X_traincv = pd.concat([X_train, X_cv], axis=0)
y_traincv = pd.concat([y_train, y_cv], axis=0)

# handle polynomial models (they expect numeric subset)
if best_name.startswith("Polynomial"):
    X_traincv_in = X_traincv[numeric_cols].fillna(0)
    X_test_in = X_test[numeric_cols].fillna(0)
else:
    X_traincv_in = X_traincv.fillna(0)
    X_test_in = X_test.fillna(0)

# retrain and evaluate
best_est.fit(X_traincv_in, y_traincv)
pred_test_raw = best_est.predict(X_test_in)
test_metrics = evaluate_continuous_and_rounded(y_test, pred_test_raw)

print("\nTest metrics (selected model):")
print(f"  RMSE (raw preds)      = {test_metrics['rmse_raw']:.2f} rating points")
print(f"  RMSE (rounded preds)  = {test_metrics['rmse_rounded']:.2f} rating points")
print(f"  RMSE (in 100s)        = {test_metrics['rmse_hundreds']:.3f}")
print(f"  MAE (in steps)        = {test_metrics['mae_steps']:.3f}")
print(f"  Exact-match acc       = {test_metrics['exact_match']:.3%}")
print(f"  Within ±1 step acc    = {test_metrics['within_±1_step']:.3%}")

# show a small sample of test predictions
test_sample = pd.DataFrame({
    "y_true": y_test.values,
    "pred_raw": pred_test_raw,
    "pred_rounded": round_to_nearest_100(pred_test_raw)
}).reset_index(drop=True)
display(test_sample.head(20))

# Save summary and the best model (retrained on train+cv)
summary_df.to_csv("models/model_comparison_by_cv_rmse_rounded.csv", index=False)
# Saving the final best model (retrained on Train+CV)
joblib.dump(best_est, f"models/best_model_{best_name}_final.joblib")
print(f"\nSaved comparison to models/model_comparison_by_cv_rmse_rounded.csv")
print(f"Saved **final best model** to models/best_model_{best_name}_final.joblib")

## IV. Prediction on Unrated Problem

This section demonstrates how to use the final selected model to predict the rating of a specific, unrated Codeforces problem.

In [None]:
# Predict rating for an unrated problem using all models + best model
import requests
import re
import random
import numpy as np
import pandas as pd

# --------------------------
# Helpers
# --------------------------
def fetch_problem_from_api(contestId=None, index=None, url=None):
    """
    Return the problem dict from Codeforces API (problemset.problems result).
    Provide either contestId+index (ints/str) or a problem url (contest+index).
    If nothing provided, returns None.
    """
    api_url = "https://codeforces.com/api/problemset.problems"
    resp = requests.get(api_url)
    resp.raise_for_status()
    data = resp.json()
    if data.get("status") != "OK":
        raise RuntimeError("Codeforces API error")
    problems = data["result"]["problems"]

    if url:
        # try to extract contestId and index from common URL patterns
        m = re.search(r'/contest/(\d+)/problem/([A-Za-z0-9]+)', url)
        if not m:
            m = re.search(r'/problemset/problem/(\d+)/([A-Za-z0-9]+)', url)
        if m:
            contestId = int(m.group(1))
            index = m.group(2)

    if contestId is not None and index is not None:
        # find matching problem
        for p in problems:
            # contestId might be int or str in source
            if str(p.get("contestId")) == str(contestId) and str(p.get("index")).upper() == str(index).upper():
                return p
        raise ValueError(f"Problem {contestId}-{index} not found in API result.")
    return None

def sample_unrated_problem_from_csv(csv_path="data/codeforces_problems_unrated.csv"):
    df_unrated = pd.read_csv(csv_path)
    if df_unrated.empty:
        raise ValueError("No unrated problems in CSV")
    row = df_unrated.sample(1).iloc[0].to_dict()
    return row

def build_feature_vector(problem, X_columns, numeric_cols):
    """
    Build a single-row DataFrame aligned with training feature columns (X_columns).
    """
    # Start with zeros
    feat = {c: 0 for c in X_columns}

    # contestId
    if "contestId" in problem and not pd.isna(problem.get("contestId")):
        feat["contestId"] = int(problem.get("contestId"))
    # index -> index_num (A->1, B->2 ...)
    idx = problem.get("index", None)
    if isinstance(idx, str) and len(idx) > 0:
        try:
            feat["index_num"] = ord(idx[0].upper()) - 64
        except Exception:
            feat["index_num"] = 0
    # points (may be missing)
    points = problem.get("points", None)
    if points is not None and not pd.isna(points):
        try:
            feat["points"] = float(points)
        except Exception:
            feat["points"] = 0.0
    
    # tags: API gives list of tags
    tags = problem.get("tags", [])
    # Some CSV rows store tags as "tag1,tag2,..."
    if isinstance(tags, str):
        tags = [t.strip() for t in tags.split(",") if t.strip()]

    # For every tag column in X_columns, set 1 if tag present
    for tag in tags:
        # Handle cases where tags may be quoted (e.g. from CSV load) or unquoted
        tag_clean = tag.strip("' ")
        if tag_clean in feat:
            feat[tag_clean] = 1
        # Check for quoted version (as columns in X are quoted)
        tag_quoted = f"'{tag_clean}'"
        if tag_quoted in feat:
            feat[tag_quoted] = 1

    # Ensure dtypes & final DataFrame
    row_df = pd.DataFrame([feat], columns=X_columns)
    row_df = row_df.fillna(0)
    return row_df

def predict_with_models_on_row(row_df, models_info, numeric_cols):
    """
    For each model in models_info, call .predict() appropriately and return results list.
    """
    out = []
    for info in models_info:
        name = info["name"]
        est = info["estimator"]
        if name.startswith("Polynomial"):
            # polynomial models were trained on numeric subset only
            X_in = row_df[numeric_cols].fillna(0)
        else:
            X_in = row_df.fillna(0)
        pred_raw = est.predict(X_in)[0]
        pred_rounded = int(((pred_raw + 50) // 100) * 100)
        out.append({
            "model": name,
            "pred_raw": float(pred_raw),
            "pred_rounded": pred_rounded
        })
    return out

# --------------------------
# 1) Choose a problem:
# --------------------------
# Example problem to predict (currently unrated on the latest API dump):
contestId_input = 2156
index_input = "A"
problem_url = "https://codeforces.com/contest/2156/problem/A"

# fetch problem info
if contestId_input or index_input or problem_url:
    try:
        prob = fetch_problem_from_api(contestId=contestId_input, index=index_input, url=problem_url)
        print("Fetched problem from Codeforces API:")
        print(f"  contestId={prob.get('contestId')}, index={prob.get('index')}, name={prob.get('name')}")
    except Exception as e:
        # Fallback to sampling unrated from CSV if API fails or problem isn't found in API dump.
        print("API fetch failed or problem not found. Sampling a random unrated problem from CSV.")
        sampled = sample_unrated_problem_from_csv("data/codeforces_problems_unrated.csv")
        prob = {
            "contestId": sampled.get("contestId"),
            "index": sampled.get("index"),
            "name": sampled.get("name"),
            "points": sampled.get("points", None),
            "tags": sampled.get("tags")
        }
        if isinstance(prob["tags"], str):
            prob["tags"] = [t.strip() for t in prob["tags"].split(",") if t.strip()]
        print("Sampled unrated problem:")
        print(f"  contestId={prob.get('contestId')}, index={prob.get('index')}, name={prob.get('name')}")
else:
    # sample a random unrated problem from CSV you saved earlier
    print("No contestId/index/url provided — sampling a random unrated problem from data/codeforces_problems_unrated.csv")
    sampled = sample_unrated_problem_from_csv("data/codeforces_problems_unrated.csv")
    prob = {
        "contestId": sampled.get("contestId"),
        "index": sampled.get("index"),
        "name": sampled.get("name"),
        "points": sampled.get("points", None),
        "tags": sampled.get("tags")
    }
    if isinstance(prob["tags"], str):
        prob["tags"] = [t.strip() for t in prob["tags"].split(",") if t.strip()]
    print("Sampled unrated problem:")
    print(f"  contestId={prob.get('contestId')}, index={prob.get('index')}, name={prob.get('name')}")

# --------------------------
# 2) Build feature vector matching X_train columns
# --------------------------
X_columns = X_train.columns.tolist()
row_feat = build_feature_vector(prob, X_columns, numeric_cols)
print("\nFeature vector prepared (showing non-zero entries):")
display(row_feat.loc[:, (row_feat != 0).any(axis=0)])

# --------------------------
# 3) Predict with every model and display results
# --------------------------
preds = predict_with_models_on_row(row_feat, models_info, numeric_cols)
preds_df = pd.DataFrame(preds).sort_values("pred_rounded").reset_index(drop=True)
print("\nPredictions from all models (rounded to nearest 100):")
display(preds_df)

# --------------------------
# 4) Best model prediction (best_est was retrained on train+cv in the previous cell)
# --------------------------
# best_est is the retrained best estimator
try:
    # detect polynomial best case
    best_name_local = best_model_row["model"]
    if best_name_local.startswith("Polynomial"):
        X_in = row_feat[numeric_cols].fillna(0)
    else:
        X_in = row_feat.fillna(0)
    best_raw = best_est.predict(X_in)[0]
    best_rounded = int(((best_raw + 50) // 100) * 100)
    print(f"\nBest model ({best_name_local}) prediction:")
    print(f"  raw prediction  = {best_raw:.2f}")
    print(f"  rounded (±50 rule) = {best_rounded}")
except Exception as e:
    print("Could not compute best-model prediction — make sure previous cell ran and defined `best_est` and `best_model_row`.")
    raise

# --------------------------
# 5) Save the single-row feature vector and predictions for later inspection (optional)
# --------------------------
row_feat.to_csv("models/last_unrated_problem_features.csv", index=False)
preds_df.to_csv("models/last_unrated_problem_predictions.csv", index=False)
print("\nSaved feature vector to models/last_unrated_problem_features.csv")
print("Saved predictions to models/last_unrated_problem_predictions.csv")


Fetched problem from Codeforces API:
  contestId=2156, index=A, name=Pizza Time

Feature vector prepared (showing non-zero entries):


Unnamed: 0,contestId,points,'brute force','constructive algorithms','greedy',index_num
0,2156,500.0,1,1,1,1



Predictions from all models (rounded to nearest 100):


Unnamed: 0,model,pred_raw,pred_rounded
0,LinearRegression,1402.003759,1400
1,Ridge,1403.494502,1400
2,Polynomial_deg3,1528.238148,1500
3,Polynomial_deg2,1731.337233,1700
4,RandomForestRegressor,3441.0,3400
5,DecisionTreeRegressor,3500.0,3500



Best model (RandomForestRegressor) prediction:
  raw prediction  = 3441.00
  rounded (±50 rule) = 3400

Saved feature vector to models/last_unrated_problem_features.csv
Saved predictions to models/last_unrated_problem_predictions.csv
