# FDS Challenge: Starter Notebook

This notebook will guide you through the first steps of the competition. Our goal here is to show you how to:

1.  Load the `train.jsonl` and `test.jsonl` files from the competition data.
2.  Create a very simple set of features from the data.
3.  Train a basic model.
4.  Generate a `submission.csv` file in the correct format.
5.  Submit your results.

Let's get started!

### 1. Loading and Inspecting the Data

When you create a notebook within a Kaggle competition, the competition's data is automatically attached and available in the `../input/` directory.

The dataset is in a `.jsonl` format, which means each line is a separate JSON object. This is great because we can process it one line at a time without needing to load the entire large file into memory.

Let's write a simple loop to load the training data and inspect the first battle.

In [None]:
import json
import pandas as pd
import os

# --- Define the path to our data ---
train_file_path = os.path.join("data", 'train.jsonl')
test_file_path = os.path.join("data", 'test.jsonl')
train_data = []

# Read the file line by line
print(f"Loading data from '{train_file_path}'...")
try:
    with open(train_file_path, 'r') as f:
        for line in f:
            # json.loads() parses one line (one JSON object) into a Python dictionary
            train_data.append(json.loads(line))

    print(f"Successfully loaded {len(train_data)} battles.")

    # Let's inspect the first battle to see its structure
    print("\n--- Structure of the first train battle: ---")
    if train_data:
        first_battle = train_data[0]
        
        # To keep the output clean, we can create a copy and truncate the timeline
        battle_for_display = first_battle.copy()
        battle_for_display['battle_timeline'] = battle_for_display.get('battle_timeline', [])[:10] # Show first 2 turns
        
        # Use json.dumps for pretty-printing the dictionary
        print(json.dumps(battle_for_display, indent=4))
        if len(first_battle.get('battle_timeline', [])) > 3:
            print("    ...")
            print("    (battle_timeline has been truncated for display)")


except FileNotFoundError:
    print(f"ERROR: Could not find the training file at '{train_file_path}'.")
    print("Please make sure you have added the competition data to this notebook.")

### 2. Basic Feature Engineering

A successful model will likely require creating many complex features. For this starter notebook, however, we will create a very simple feature set based **only on the initial team stats**. This will be enough to train a model and generate a submission file.

It's up to you to engineer more powerful features!

In [None]:
from tqdm.notebook import tqdm
import numpy as np

def create_simple_features(data: list[dict]) -> pd.DataFrame:
    """
    A very basic feature extraction function.
    It only uses the aggregated base stats of the player's team and opponent's lead.
    """
    feature_list = []
    for battle in tqdm(data, desc="Extracting features"):
        features = {}
        
        # --- Player 1 Team Features ---
        p1_team = battle.get('p1_team_details', [])
        if p1_team:
            features['p1_mean_hp'] = np.mean([p.get('base_hp', 0) for p in p1_team])
            features['p1_mean_spe'] = np.mean([p.get('base_spe', 0) for p in p1_team])
            features['p1_mean_atk'] = np.mean([p.get('base_atk', 0) for p in p1_team])
            features['p1_mean_def'] = np.mean([p.get('base_def', 0) for p in p1_team])
            features['p1_mean_spa'] = np.mean([p.get('base_spa', 0) for p in p1_team])
            features['p1_mean_spd'] = np.mean([p.get('base_spd', 0) for p in p1_team])
        # --- Player 2 Lead Features ---
        p2_lead = battle.get('p2_lead_details')
        if p2_lead:
            # Player 2's lead Pokémon's stats
            features['p2_lead_hp'] = p2_lead.get('base_hp', 0)
            features['p2_lead_spe'] = p2_lead.get('base_spe', 0)
            features['p2_lead_atk'] = p2_lead.get('base_atk', 0)
            features['p2_lead_def'] = p2_lead.get('base_def', 0)
            features['p2_lead_spa'] = p2_lead.get('base_spa', 0)
            features['p2_lead_spd'] = p2_lead.get('base_spd', 0)

        # We also need the ID and the target variable (if it exists)
        features['battle_id'] = battle.get('battle_id')
        if 'player_won' in battle:
            features['player_won'] = int(battle['player_won'])
            
        feature_list.append(features)
        
    return pd.DataFrame(feature_list).fillna(0)

# Create feature DataFrames for both training and test sets
print("Processing training data...")
train_df = create_simple_features(train_data)

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))
test_df = create_simple_features(test_data)

print("\nTraining features preview:")
display(train_df.head())

#### Create Dynamic Features out of the battle timeline

In [None]:
from collections import Counter

TYPE_EFFECTIVENESS = {
    "Normal":   {"Rock": 0.5, "Ghost": 0.0,     "Steel": 0.5},
    "Fire":     {"Fire": 0.5, "Water": 0.5, "Grass": 2.0,  "Ice": 2.0,  "Bug": 2.0,  "Rock": 0.5, "Dragon": 0.5, "Steel": 2.0},
    "Water":    {"Fire": 2.0,  "Water": 0.5, "Grass": 0.5, "Ground": 2.0, "Rock": 2.0, "Dragon": 0.5},
    "Electric": {"Water": 2.0, "Electric": 0.5, "Grass": 0.5, "Ground": 0.0, "Flying": 2.0, "Dragon": 0.5},
    "Grass":    {"Fire": 0.5, "Water": 2.0, "Grass": 0.5, "Poison": 0.5, "Ground": 2.0, "Flying": 0.5, "Bug": 0.5, "Rock": 2.0, "Dragon": 0.5, "Steel": 0.5},
    "Ice":      {"Fire": 0.5, "Water": 0.5, "Grass": 2.0, "Ground": 2.0, "Flying": 2.0, "Dragon": 2.0, "Steel": 0.5},
    "Fighting": {"Normal": 2.0, "Ice": 2.0, "Rock": 2.0, "Dark": 2.0, "Steel": 2.0, "Poison": 0.5, "Flying": 0.5, "Psychic": 0.5, "Bug": 0.5, "Ghost": 0.0, "Fairy": 0.5},
    "Poison":   {"Grass": 2.0, "Fairy": 2.0, "Poison": 0.5, "Ground": 0.5, "Rock": 0.5, "Ghost": 0.5, "Steel": 0.0},
    "Ground":   {"Fire": 2.0, "Electric": 2.0, "Poison": 2.0, "Rock": 2.0, "Steel": 2.0, "Grass": 0.5, "Bug": 0.5, "Flying": 0.0},
    "Flying":   {"Grass": 2.0, "Fighting": 2.0, "Bug": 2.0, "Electric": 0.5, "Rock": 0.5, "Steel": 0.5},
    "Psychic":  {"Fighting": 2.0, "Poison": 2.0, "Psychic": 0.5, "Steel": 0.5, "Dark": 0.0},
    "Bug":      {"Grass": 2.0, "Psychic": 2.0, "Dark": 2.0, "Fire": 0.5, "Fighting": 0.5, "Poison": 0.5, "Flying": 0.5, "Ghost": 0.5, "Steel": 0.5, "Fairy": 0.5},
    "Rock":     {"Fire": 2.0, "Ice": 2.0, "Flying": 2.0, "Bug": 2.0, "Fighting": 0.5, "Ground": 0.5, "Steel": 0.5},
    "Ghost":    {"Psychic": 2.0, "Ghost": 2.0, "Dark": 0.5, "Normal": 0.0},
    "Dragon":   {"Dragon": 2.0, "Steel": 0.5, "Fairy": 0.0},
    "Dark":     {"Psychic": 2.0, "Ghost": 2.0, "Fighting": 0.5, "Dark": 0.5, "Fairy": 0.5},
    "Steel":    {"Ice": 2.0, "Rock": 2.0, "Fairy": 2.0, "Fire": 0.5, "Water": 0.5, "Electric": 0.5, "Steel": 0.5},
    "Fairy":    {"Fighting": 2.0, "Dragon": 2.0, "Dark": 2.0, "Fire": 0.5, "Poison": 0.5, "Steel": 0.5},
}

def compute_type_advantage(attacker_types, defender_types):
    if not attacker_types or not defender_types:
        return 1.0
    multipliers = []
    for atk_type in attacker_types:
        for def_type in defender_types:
            mult = TYPE_EFFECTIVENESS.get(atk_type, {}).get(def_type, 1.0)
            multipliers.append(mult)
    return np.mean(multipliers) if multipliers else 1.0

def build_player_dict(timeline, prefix):
    player_pokemons = {}

    for turn in timeline:
        state = turn.get(f"{prefix}_pokemon_state", {})
        if not state or "name" not in state:
            continue

        name = state["name"]
        if name not in player_pokemons:
            player_pokemons[name] = {
                "hp": 1,
                "status": "",
                "moves": [],
                "boosts": {k: [] for k in ["atk", "def", "spa", "spd", "spe"]}            }

        # HP and status
        player_pokemons[name]["hp"] = state.get("hp_pct", 0)
        player_pokemons[name]["status"] = state["status"]

        # Boosts
        boosts = state.get("boosts", {})
        for k in player_pokemons[name]["boosts"]:
            player_pokemons[name]["boosts"][k] = boosts.get(k, 0)

        # Moves used
        move_details = turn.get(f"{prefix}_move_details")
        if move_details != None:
            if move_details['name'] not in player_pokemons[name]["moves"]:
                player_pokemons[name]["moves"].append(move_details["name"])

    # Summarize per Pokémon
    return player_pokemons


def aggregate_player_stats(player_dict):
    """Aggregates all Pokémon stats for one player."""
    if not player_dict:
        return {
            "mean_hp": 0,
            "total_hp_left": 0,
            "num_seen": 0,
            "num_fainted": 0,
            "avg_boosts": {k: 0 for k in ["atk", "def", "spa", "spd", "spe"]},
            "status_freq": {s: 0 for s in ["par", "frz", "psn", "brn", "slp"]},
            "types": [],
        }

    pokemons = list(player_dict.values())

    total_hp_left = np.sum([p["hp"] for p in pokemons])
    num_fainted = sum(1 for pokemon in pokemons if pokemon['status'] == "fnt")
    status_counts = Counter(p['status'] for p in pokemons if p.get('status'))
    num_paralyzed = status_counts['par']
    num_frozen = status_counts['frz']
    num_psn = status_counts['psn']
    num_brn = status_counts['brn']
    num_slp = status_counts['slp']

    #types = [p["types"] for p in pokemons]
    #boosts = {k: np.mean([p["boosts"][k] for p in pokemons]) for k in ["atk", "def", "spa", "spd", "spe"]}
    #all_statuses = sum([p["status"] for p in pokemons], [])
    #status_freq = {s: all_statuses.count(s) / (len(all_statuses) + 1e-9)
    #               for s in ["par", "frz", "psn", "brn", "slp"]}

    return {
        "total_hp_left": total_hp_left,
        "num_seen": len(pokemons),
        "num_fainted": num_fainted,
        "num_paralyzed": num_paralyzed,
        "num_frozen": num_frozen,
        "num_psn": num_psn,
        "num_brn": num_brn,
        "num_slp": num_slp
    }

def create_dynamic_features(data: list[dict]) -> pd.DataFrame:
    feature_list = []

    for battle in tqdm(data, desc="Extracting two-dict features"):
        timeline = battle.get("battle_timeline", [])
        if not timeline:
            continue

        p1_dict = build_player_dict(timeline, "p1")
        p2_dict = build_player_dict(timeline, "p2")

        p1_stats = aggregate_player_stats(p1_dict)
        p2_stats = aggregate_player_stats(p2_dict)

        features = {
            "battle_id": battle.get("battle_id"),
            "hp_ratio": p1_stats["total_hp_left"] / (p2_stats["total_hp_left"] + 1e-9),
            "p1_num_seen": p1_stats["num_seen"],
            "p2_num_seen": p2_stats["num_seen"],
            "p1_num_fainted": p1_stats["num_fainted"],
            "p2_num_fainted": p2_stats["num_fainted"],
            "num_paralyzed_diff": p1_stats["num_paralyzed"] - p2_stats["num_paralyzed"],
            "num_frozen_diff": p1_stats["num_frozen"] - p2_stats["num_frozen"],
            "num_psn_diff": p1_stats["num_psn"] - p2_stats["num_psn"],
            "num_brn_diff": p1_stats["num_brn"] - p2_stats["num_brn"],
            "num_slp_diff": p1_stats["num_slp"] - p2_stats["num_slp"],
            "num_seen_diff": p1_stats["num_seen"] - p2_stats["num_seen"],
            "num_fainted_diff": p1_stats["num_fainted"] - p2_stats["num_fainted"],
        }

        # Boosts
        #for stat in ["atk", "def", "spa", "spd", "spe"]:
        #    features[f"p1_avg_boost_{stat}"] = p1_stats["avg_boosts"][stat]
        #    features[f"p2_avg_boost_{stat}"] = p2_stats["avg_boosts"][stat]

        # Status frequencies
        #for s in ["par", "frz", "psn", "brn", "slp"]:
        #    features[f"p1_freq_{s}"] = p1_stats["status_freq"][s]
        #    features[f"p2_freq_{s}"] = p2_stats["status_freq"][s]

        # Type advantage
        #p1_type_adv = compute_type_advantage(p1_stats["types"], p2_stats["types"])
        #p2_type_adv = compute_type_advantage(p2_stats["types"], p1_stats["types"])
        #features["p1_type_adv"] = p1_type_adv
        #features["p2_type_adv"] = p2_type_adv
        #features["type_adv_diff"] = p1_type_adv - p2_type_adv

        feature_list.append(features)

    return pd.DataFrame(feature_list).fillna(0)

print("Processing training data...")
train_df_dynamic = create_dynamic_features(train_data)

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))
test_df_dynamic = create_dynamic_features(test_data)

print("\nTraining features preview:")
display(train_df_dynamic.head())

In [None]:
# Combine Dynamic Features with Simple Features
train_df_combined = pd.merge(train_df, train_df_dynamic, on="battle_id", how="inner")
test_df_combined = pd.merge(test_df, test_df_dynamic, on="battle_id", how="inner")
print(train_df_combined.head(), train_df_combined.columns)

### 3. Training a Baseline Model

Now that we have some features, let's train a simple `LogisticRegression` model. This will give us a starting point for our predictions.

In [None]:
features = [col for col in train_df_combined.columns if col not in ['battle_id', 'player_won']]
X_train = train_df_combined[features]
y_train = train_df_combined['player_won']
X_test = test_df_combined[features]

## Training Pipeline

In [None]:
!pip install xgboost
!pip install scikit-learn
!pip install lightgbm
!pip install catboost
!pip install optuna

In [None]:
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, accuracy_score


In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(accuracy_score)

# --- Define objective function ---
def objective(trial):
    model_name = trial.suggest_categorical("model", ["LogReg", "RandomForest", "XGBoost", "LightGBM", "CatBoost", "NeuralNet"])
    
    if model_name == "LogReg":
        C = trial.suggest_loguniform("logreg_C", 0.01, 10)
        model = Pipeline([("scaler", StandardScaler()),
                          ("model", LogisticRegression(C=C, penalty="l2", solver="lbfgs", max_iter=1000))])
    
    elif model_name == "RandomForest":
        n_estimators = trial.suggest_int("rf_n_estimators", 100, 500)
        max_depth = trial.suggest_int("rf_max_depth", 3, 20)
        min_samples_split = trial.suggest_int("rf_min_samples_split", 2, 10)
        min_samples_leaf = trial.suggest_int("rf_min_samples_leaf", 1, 5)
        model = RandomForestClassifier(n_estimators=n_estimators,
                                       max_depth=max_depth,
                                       min_samples_split=min_samples_split,
                                       min_samples_leaf=min_samples_leaf,
                                       random_state=42)
    
    elif model_name == "XGBoost":
        n_estimators = trial.suggest_int("xgb_n_estimators", 100, 500)
        max_depth = trial.suggest_int("xgb_max_depth", 3, 10)
        learning_rate = trial.suggest_loguniform("xgb_lr", 0.01, 0.3)
        subsample = trial.suggest_float("xgb_subsample", 0.6, 1.0)
        model = xgb.XGBClassifier(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  learning_rate=learning_rate,
                                  subsample=subsample,
                                  use_label_encoder=False,
                                  eval_metric="logloss",
                                  random_state=42)
    
    elif model_name == "LightGBM":
        n_estimators = trial.suggest_int("lgb_n_estimators", 100, 500)
        num_leaves = trial.suggest_int("lgb_num_leaves", 15, 127)
        learning_rate = trial.suggest_loguniform("lgb_lr", 0.01, 0.3)
        colsample_bytree = trial.suggest_float("lgb_colsample", 0.6, 1.0)
        model = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   learning_rate=learning_rate,
                                   colsample_bytree=colsample_bytree,
                                   random_state=42)
    
    elif model_name == "CatBoost":
        iterations = trial.suggest_int("cat_iterations", 100, 500)
        depth = trial.suggest_int("cat_depth", 3, 10)
        learning_rate = trial.suggest_loguniform("cat_lr", 0.01, 0.3)
        model = CatBoostClassifier(iterations=iterations,
                                   depth=depth,
                                   learning_rate=learning_rate,
                                   verbose=False,
                                   random_state=42)
    
    elif model_name == "NeuralNet":
        hidden_layer_sizes = trial.suggest_categorical("nn_hidden", [(128, 64), (256, 128, 64)])
        activation = trial.suggest_categorical("nn_activation", ["relu", "tanh"])
        alpha = trial.suggest_loguniform("nn_alpha", 1e-4, 1e-2)
        learning_rate_init = trial.suggest_loguniform("nn_lr", 1e-4, 5e-3)
        model = Pipeline([("scaler", StandardScaler()),
                          ("model", MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                                                  activation=activation,
                                                  alpha=alpha,
                                                  learning_rate_init=learning_rate_init,
                                                  max_iter=500,
                                                  random_state=42))])
    
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring=scorer, n_jobs=-1)
    return score.mean()

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

print("Best model and params:")
print(study.best_trial.params)
print("Best CV accuracy:", study.best_trial.value)


In [None]:
best_params = study.best_trial.params
model_prefix = best_params["model"].lower()
if model_prefix.startswith("cat"):
    model_prefix = "cat"
elif model_prefix.startswith("xgb"):
    model_prefix = "xgb"
elif model_prefix.startswith("lgb"):
    model_prefix = "lgb"
elif model_prefix.startswith("nn"):
    model_prefix = "nn"
elif model_prefix.startswith("logreg"):
    model_prefix = "logreg"
elif model_prefix.startswith("rf"):
    model_prefix = "rf"

clean_params = {k[len(model_prefix)+1:] if k.startswith(model_prefix + "_") else k: v
                for k, v in best_params.items()}
print("Best params:", clean_params)

best_model_name = clean_params.pop("model")
print("Training best model:", best_model_name)
if best_model_name == "logreg":
    model = Pipeline([("scaler", StandardScaler()),
                      ("model", LogisticRegression(**clean_params, penalty="l2", solver="lbfgs", max_iter=1000))])
elif best_model_name == "randomforest":
    model = RandomForestClassifier(**clean_params, random_state=42)
elif best_model_name == "XGBoost":
    model = xgb.XGBClassifier(**clean_params, use_label_encoder=False, eval_metric="logloss", random_state=42)
elif best_model_name == "lightgbm":
    model = lgb.LGBMClassifier(**clean_params, random_state=42)
elif best_model_name == "catboost":
    model = CatBoostClassifier(**clean_params, verbose=False, random_state=42)
elif best_model_name == "neuralnet":
    model = Pipeline([("scaler", StandardScaler()), ("model", MLPClassifier(**clean_params, max_iter=500, random_state=42))])

model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = model.predict(X_train)
acc = accuracy_score(y_train, y_pred)
print(f"Training Accuracy: {acc:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_train, y_pred))

### 4. Creating the Submission File

The competition requires a `.csv` file with two columns: `battle_id` and `player_won`. Let's use our trained model to make predictions on the test set and format them correctly.

In [None]:
# Make predictions on the test data
print("Generating predictions on the test set...")
test_predictions = model.predict(X_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'battle_id': test_df['battle_id'],
    'player_won': test_predictions
})

# Save the DataFrame to a .csv file
submission_df.to_csv('submission.csv', index=False)

print("\n'submission.csv' file created successfully!")
display(submission_df.head())

### 5. Submitting Your Results

Once you have generated your `submission.csv` file, there are two primary ways to submit it to the competition.

---

#### Method A: Submitting Directly from the Notebook

This is the standard method for code competitions. It ensures that your submission is linked to the code that produced it, which is crucial for reproducibility.

1.  **Save Your Work:** Click the **"Save Version"** button in the top-right corner of the notebook editor.
2.  **Run the Notebook:** In the pop-up window, select **"Save & Run All (Commit)"** and then click the **"Save"** button. This will run your entire notebook from top to bottom and save the output, including your `submission.csv` file.
3.  **Go to the Viewer:** Once the save process is complete, navigate to the notebook viewer page. 
4.  **Submit to Competition:** In the viewer, find the **"Submit to Competition"** section. This is usually located in the header of the output section or in the vertical "..." menu on the right side of the page. Clicking the **Submit** button this will submit your generated `submission.csv` file.

After submitting, you will see your score in the **"Submit to Competition"** section or in the [Public Leaderboard](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?).

---

#### Method B: Manual Upload

You can also generate your predictions and submission file using any environment you prefer (this notebook, Google Colab, or your local machine).

1.  **Generate the `submission.csv` file** using your model.
2.  **Download the file** to your computer.
3.  **Navigate to the [Leaderboard Page](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?)** and click on the **"Submit Predictions"** button.
4.  **Upload Your File:** Drag and drop or select your `submission.csv` file to upload it.

This method is quick, but keep in mind that for the final evaluation, you might be required to provide the code that generated your submission.

Good luck!