# FDS Challenge: Starter Notebook

This notebook will guide you through the first steps of the competition. Our goal here is to show you how to:

1.  Load the `train.jsonl` and `test.jsonl` files from the competition data.
2.  Create a very simple set of features from the data.
3.  Train a basic model.
4.  Generate a `submission.csv` file in the correct format.
5.  Submit your results.

Let's get started!

### 1. Loading and Inspecting the Data

When you create a notebook within a Kaggle competition, the competition's data is automatically attached and available in the `../input/` directory.

The dataset is in a `.jsonl` format, which means each line is a separate JSON object. This is great because we can process it one line at a time without needing to load the entire large file into memory.

Let's write a simple loop to load the training data and inspect the first battle.

In [1]:
import json
import pandas as pd
import os

# --- Define the path to our data ---
train_file_path = os.path.join("data", 'train.jsonl')
test_file_path = os.path.join("data", 'test.jsonl')
train_data = []

# Read the file line by line
print(f"Loading data from '{train_file_path}'...")
try:
    with open(train_file_path, 'r') as f:
        for line in f:
            # json.loads() parses one line (one JSON object) into a Python dictionary
            train_data.append(json.loads(line))

    print(f"Successfully loaded {len(train_data)} battles.")

    # Let's inspect the first battle to see its structure
    print("\n--- Structure of the first train battle: ---")
    if train_data:
        first_battle = train_data[0]
        
        # To keep the output clean, we can create a copy and truncate the timeline
        battle_for_display = first_battle.copy()
        battle_for_display['battle_timeline'] = battle_for_display.get('battle_timeline', [])[:10] # Show first 2 turns
        
        # Use json.dumps for pretty-printing the dictionary
        print(json.dumps(battle_for_display, indent=4))
        if len(first_battle.get('battle_timeline', [])) > 3:
            print("    ...")
            print("    (battle_timeline has been truncated for display)")


except FileNotFoundError:
    print(f"ERROR: Could not find the training file at '{train_file_path}'.")
    print("Please make sure you have added the competition data to this notebook.")

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))

Loading data from 'data\train.jsonl'...
Successfully loaded 10000 battles.

--- Structure of the first train battle: ---
{
    "player_won": true,
    "p1_team_details": [
        {
            "name": "starmie",
            "level": 100,
            "types": [
                "psychic",
                "water"
            ],
            "base_hp": 60,
            "base_atk": 75,
            "base_def": 85,
            "base_spa": 100,
            "base_spd": 100,
            "base_spe": 115
        },
        {
            "name": "exeggutor",
            "level": 100,
            "types": [
                "grass",
                "psychic"
            ],
            "base_hp": 95,
            "base_atk": 95,
            "base_def": 85,
            "base_spa": 125,
            "base_spd": 125,
            "base_spe": 55
        },
        {
            "name": "chansey",
            "level": 100,
            "types": [
                "normal",
                "notype"
            ],

### 2. Basic Feature Engineering

A successful model will likely require creating many complex features. For this starter notebook, however, we will create a very simple feature set based **only on the initial team stats**. This will be enough to train a model and generate a submission file.

It's up to you to engineer more powerful features!

In [2]:
def extract_pokemon_and_types(battles):
    all_pokemon = []
    all_types = set()

    for battle in battles:
        # Team PokÃ©mon
        for p in battle["p1_team_details"]:
            all_pokemon.append({
                "name": p["name"].lower(),
                "types": p["types"],
                "base_hp": p["base_hp"],
                "base_atk": p["base_atk"],
                "base_def": p["base_def"],
                "base_spa": p["base_spa"],
                "base_spd": p["base_spd"],
                "base_spe": p["base_spe"],
            })
            all_types.update([t for t in p["types"] if t != "notype"])
        enemy_pokemon = battle["p2_lead_details"]
        all_pokemon.append({
                "name": enemy_pokemon["name"].lower(),
                "types": enemy_pokemon["types"],
                "base_hp": enemy_pokemon["base_hp"],
                "base_atk": enemy_pokemon["base_atk"],
                "base_def": enemy_pokemon["base_def"],
                "base_spa": enemy_pokemon["base_spa"],
                "base_spd": enemy_pokemon["base_spd"],
                "base_spe": enemy_pokemon["base_spe"],
            })
        all_types.update([t for t in enemy_pokemon["types"] if t != "notype"])
    # Als DataFrame, Duplikate entfernen
    pokemon_df = pd.DataFrame(all_pokemon).drop_duplicates(subset="name").reset_index(drop=True)
    types_list = sorted(list(all_types))

    return pokemon_df, types_list


# Beispiel:
pokemon_df, all_types = extract_pokemon_and_types(train_data)

print("ðŸ§© Alle Typen:")
print(all_types)
print("\nðŸ“Š PokÃ©mon DataFrame:")
print(pokemon_df)


ðŸ§© Alle Typen:
['dragon', 'electric', 'fire', 'flying', 'ghost', 'grass', 'ground', 'ice', 'normal', 'poison', 'psychic', 'rock', 'water']

ðŸ“Š PokÃ©mon DataFrame:
          name               types  base_hp  base_atk  base_def  base_spa  \
0      starmie    [psychic, water]       60        75        85       100   
1    exeggutor    [grass, psychic]       95        95        85       125   
2      chansey    [normal, notype]      250         5         5       105   
3      snorlax    [normal, notype]      160       110        65        65   
4       tauros    [normal, notype]       75       100        95        70   
5     alakazam   [notype, psychic]       55        50        45       135   
6         jynx      [ice, psychic]       65        50        35        95   
7      slowbro    [psychic, water]       95        75       110        80   
8       gengar     [ghost, poison]       60        65        60       130   
9       rhydon      [ground, rock]      105       130       120

In [3]:
# now compute the type advantage and stat advantage out of the pokemons the enemy has and player 1 has
import numpy as np

# simplified Gen 1 type chart
type_chart = {
    "normal":    {"rock": 0.5, "ghost": 0.0},
    "fire":      {"fire": 0.5, "water": 0.5, "grass": 2, "ice": 2, "bug": 2, "rock": 0.5, "dragon": 0.5},
    "water":     {"fire": 2, "water": 0.5, "grass": 0.5, "ground": 2, "rock": 2, "dragon": 0.5},
    "electric":  {"water": 2, "electric": 0.5, "grass": 0.5, "ground": 0, "flying": 2, "dragon": 0.5},
    "grass":     {"fire": 0.5, "water": 2, "grass": 0.5, "poison": 0.5, "ground": 2, "flying": 0.5, "rock": 2},
    "ice":       {"water": 0.5, "grass": 2, "ice": 0.5, "ground": 2, "flying": 2, "dragon": 2},
    "fighting":  {"normal": 2, "ice": 2, "rock": 2, "ghost": 0, "psychic": 0.5},
    "poison":    {"grass": 2, "poison": 0.5, "ground": 0.5, "rock": 0.5, "ghost": 0.5},
    "ground":    {"fire": 2, "electric": 2, "grass": 0.5, "poison": 2, "flying": 0, "rock": 2},
    "flying":    {"electric": 0.5, "grass": 2, "fighting": 2, "bug": 2, "rock": 0.5},
    "psychic":   {"fighting": 2, "poison": 2, "psychic": 0.5},
    "bug":       {"fire": 0.5, "grass": 2, "fighting": 0.5, "poison": 2, "flying": 0.5, "psychic": 2},
    "rock":      {"fire": 2, "ice": 2, "fighting": 0.5, "ground": 0.5, "flying": 2, "bug": 2},
    "ghost":     {"normal": 0, "psychic": 0},
    "dragon":    {"dragon": 2},
    "notype":    {}
}

def compute_type_advantage(p1_types, p2_types, chart):
    multipliers = []
    for atk_type in p1_types:
        for def_type in p2_types:
            if atk_type in chart:
                mult = chart[atk_type].get(def_type, 1.0)
            else:
                mult = 1.0
            multipliers.append(mult)
    if not multipliers:
        return 1.0
    return np.mean(multipliers)

def compute_type_and_stat_advantages(battles, pokemon_df, all_types):
    type_advantages = []
    stat_advantages = []

    type_to_index = {t: i for i, t in enumerate(all_types)}

    for battle in battles:
        p1_pokemon = []
        for p in battle["p1_team_details"]:
            p1_pokemon.append(p["name"].lower())
        # get all pokemon the enemy has
        enemy_pokemon = [battle["p2_lead_details"]['name']]
        for turn in battle["battle_timeline"]:
            e_p = turn['p2_pokemon_state']['name']
            if e_p.lower() not in enemy_pokemon:
                enemy_pokemon.append(e_p.lower())
        
        p1_types = []
        p1_stats = []
        for p_name in p1_pokemon:
            p_data = pokemon_df[pokemon_df['name'] == p_name].iloc[0]
            p1_types.append(p_data['types'])
            p1_stats.append([
                p_data['base_hp'],
                p_data['base_atk'],
                p_data['base_def'],
                p_data['base_spa'],
                p_data['base_spd'],
                p_data['base_spe'],
            ])
        enemy_types = []
        enemy_stats = []
        for e_name in enemy_pokemon:
            e_data = pokemon_df[pokemon_df['name'] == e_name].iloc[0]
            enemy_types.append(e_data['types'])
            enemy_stats.append([
                e_data['base_hp'],
                e_data['base_atk'],
                e_data['base_def'],
                e_data['base_spa'],
                e_data['base_spd'],
                e_data['base_spe'],
            ])
        #print(p1_pokemon, enemy_pokemon)
        #print(p1_types, enemy_types)
        #print(p1_stats, enemy_stats)
        
        # now compute the type advantage and stat advantage
        # flatten type lists, remove "notype"
        p1_all_types = [t for ts in p1_types for t in ts if t != "notype"]
        p2_all_types = [t for ts in enemy_types for t in ts if t != "notype"]
        # team vs team type advantage
        p1_type_adv = compute_type_advantage(p1_all_types, p2_all_types, type_chart)
        p2_type_adv = compute_type_advantage(p2_all_types, p1_all_types, type_chart)

        type_advantage = p1_type_adv / p2_type_adv
        
        # convert to numpy arrays
        p1_stats = np.array(p1_stats, dtype=float)
        p2_stats = np.array(enemy_stats, dtype=float)

        # compute mean overall stat
        p1_avg = p1_stats.mean()
        p2_avg = p2_stats.mean()

        stat_advantage = p1_avg / p2_avg

        type_advantages.append(type_advantage)
        stat_advantages.append(stat_advantage)

    return type_advantages, stat_advantages



In [4]:
from tqdm.notebook import tqdm
import numpy as np

def create_simple_features(data: list[dict]) -> pd.DataFrame:
    """
    A very basic feature extraction function.
    It only uses the aggregated base stats of the player's team and opponent's lead.
    """
    feature_list = []
    for battle in tqdm(data, desc="Extracting features"):
        features = {}
        
        # --- Player 1 Team Features ---
        p1_team = battle.get('p1_team_details', [])
        if p1_team:
            features['p1_mean_hp'] = np.mean([p.get('base_hp', 0) for p in p1_team])
            features['p1_mean_spe'] = np.mean([p.get('base_spe', 0) for p in p1_team])
            features['p1_mean_atk'] = np.mean([p.get('base_atk', 0) for p in p1_team])
            features['p1_mean_def'] = np.mean([p.get('base_def', 0) for p in p1_team])
            features['p1_mean_spa'] = np.mean([p.get('base_spa', 0) for p in p1_team])
            features['p1_mean_spd'] = np.mean([p.get('base_spd', 0) for p in p1_team])
        # --- Player 2 Lead Features ---
        p2_lead = battle.get('p2_lead_details')
        if p2_lead:
            # Player 2's lead PokÃ©mon's stats
            features['p2_lead_hp'] = p2_lead.get('base_hp', 0)
            features['p2_lead_spe'] = p2_lead.get('base_spe', 0)
            features['p2_lead_atk'] = p2_lead.get('base_atk', 0)
            features['p2_lead_def'] = p2_lead.get('base_def', 0)
            features['p2_lead_spa'] = p2_lead.get('base_spa', 0)
            features['p2_lead_spd'] = p2_lead.get('base_spd', 0)

        # We also need the ID and the target variable (if it exists)
        features['battle_id'] = battle.get('battle_id')
        if 'player_won' in battle:
            features['player_won'] = int(battle['player_won'])
            
        feature_list.append(features)
        
    return pd.DataFrame(feature_list).fillna(0)

# Create feature DataFrames for both training and test sets
print("Processing training data...")
train_df = create_simple_features(train_data)

test_df = create_simple_features(test_data)

print("\nTraining features preview:")
display(train_df.head())

Processing training data...


Extracting features:   0%|          | 0/10000 [00:00<?, ?it/s]

Extracting features:   0%|          | 0/5000 [00:00<?, ?it/s]


Training features preview:


Unnamed: 0,p1_mean_hp,p1_mean_spe,p1_mean_atk,p1_mean_def,p1_mean_spa,p1_mean_spd,p2_lead_hp,p2_lead_spe,p2_lead_atk,p2_lead_def,p2_lead_spa,p2_lead_spd,battle_id,player_won
0,115.833333,80.0,72.5,63.333333,100.0,100.0,60,115,75,85,100,100,0,1
1,123.333333,61.666667,72.5,65.833333,90.0,90.0,55,120,50,45,135,135,1,1
2,124.166667,65.833333,84.166667,71.666667,90.0,90.0,250,50,5,5,105,105,2,1
3,121.666667,75.833333,77.5,65.833333,103.333333,103.333333,75,110,100,95,70,70,3,1
4,114.166667,72.5,75.833333,79.166667,97.5,97.5,60,115,75,85,100,100,4,1


#### Create Dynamic Features out of the battle timeline

In [5]:
from collections import Counter

# has recovery move

STALL_MOVES = {
    # healing
    "recover", "softboiled", "rest",

    # defensive buffs / shields
    "lightscreen", "reflect", "substitute",

    # passive damage / tempo control
    "toxic", "leechseed",

    # status spreading
    "thunderwave", "stunspore", "poisonpowder",
    "sleeppowder", "sing", "hypnosis",

    # accuracy / confusion stall
    "confuseray", "supersonic",
    "flash", "kinesis", "smokescreen", "sandattack",

    # utility disruption
    "disable"
}

def get_sum_stall_moves(timeline, player_prefix="p1"):
    stall_move_count = 0

    for turn in timeline:
        move_details = turn.get(f"{player_prefix}_move_details")
        if move_details and move_details["name"].lower() in STALL_MOVES:
            stall_move_count += 1

    return stall_move_count

BOOST_MOVES = {
    "swordsdance", "meditate", "sharpen",
    "amnesia", "growth",
    "agility",
    "harden", "defensecurl", "barrier", "acidarmor",
    "doubleteam", "minimize"
}

def get_sum_boost_moves(timeline, player_prefix="p1"):
    boost_move_count = 0

    for turn in timeline:
        move_details = turn.get(f"{player_prefix}_move_details")
        if move_details and move_details["name"].lower() in BOOST_MOVES:
            boost_move_count += 1

    return boost_move_count

def get_number_of_switches(timeline, player_prefix="p1"):
    switch_count = 0

    previous_pokemon = None
    for turn in timeline:
        state = turn.get(f"{player_prefix}_pokemon_state", {})
        if not state or "name" not in state:
            continue

        current_pokemon = state["name"]
        if previous_pokemon is not None and current_pokemon != previous_pokemon:
            switch_count += 1
        previous_pokemon = current_pokemon

    return switch_count

SETUP_POKEMON = {
    # Amnesia users
    "slowbro", "snorlax", "mewtwo", "mew",
    
    # Swords Dance users
    "pinsir", "kingler", "scyther", "sandslash", "mew",
    
    # Growth users
    "victreebel", "venusaur", "tangela",
    
    # Agility sweepers
    "jolteon", "zapdos", "dragonite", "fearow",
    
    # Barrier / Acid Armor
    "mrmime", "dewgong", "muk", "vaporeon", "mewtwo",
}

def get_setup_pokemon(pokemon_dict):
    setup_users = []

    for pokemon, values in pokemon_dict.items():
        moves = values.get("moves", [])
        for m in moves:
            normalized = m.lower().replace(" ", "")
            if normalized in BOOST_MOVES:
                setup_users.append(pokemon)
                break  # no need to check more moves

    return setup_users

def num_setup(pokemon_dict):
    return len(get_setup_pokemon(pokemon_dict))

def build_player_dict(timeline, prefix):
    player_pokemons = {}

    for turn in timeline:
        state = turn.get(f"{prefix}_pokemon_state", {})
        if not state or "name" not in state:
            continue

        name = state["name"]
        if name not in player_pokemons:
            player_pokemons[name] = {
                "hp": 1,
                "status": "",
                "moves": [],
                "boosts": {k: [] for k in ["atk", "def", "spa", "spd", "spe"]}            }

        # HP and status
        player_pokemons[name]["hp"] = state.get("hp_pct", 0)
        player_pokemons[name]["status"] = state["status"]

        # Boosts
        boosts = state.get("boosts", {})
        for k in player_pokemons[name]["boosts"]:
            player_pokemons[name]["boosts"][k] = boosts.get(k, 0)

        # Moves used
        move_details = turn.get(f"{prefix}_move_details")
        if move_details != None:
            if move_details['name'] not in player_pokemons[name]["moves"]:
                player_pokemons[name]["moves"].append(move_details["name"])

    # Summarize per PokÃ©mon
    return player_pokemons


def aggregate_player_stats(player_dict):
    """Aggregates all PokÃ©mon stats for one player."""
    if not player_dict:
        return {
            "mean_hp": 0,
            "total_hp_left": 0,
            "num_seen": 0,
            "num_fainted": 0,
            "avg_boosts": {k: 0 for k in ["atk", "def", "spa", "spd", "spe"]},
            "status_freq": {s: 0 for s in ["par", "frz", "psn", "brn", "slp"]},
            "types": [],
        }

    pokemons_names = player_dict.keys()
    pokemons = list(player_dict.values())
    num_fainted = sum(1 for pokemon in pokemons if pokemon['status'] == "fnt")
    
    # return the pokemon left
    pokemon_left = [name for name, p in zip(pokemons_names, pokemons) if p['status'] != "fnt"]
    pokemon_left_stats = [p for name, p in zip(pokemons_names, pokemons) if p['status'] != "fnt"]
    total_hp_left = sum(p["hp"] for p in pokemon_left_stats)
    status_counts = Counter(p['status'] for p in pokemon_left_stats if p.get('status'))
    num_paralyzed = status_counts['par']
    num_frozen = status_counts['frz']
    num_psn = status_counts['psn']
    num_brn = status_counts['brn']
    num_slp = status_counts['slp']
    boosts = {k: np.mean([p["boosts"][k] for p in pokemon_left_stats]) for k in ["atk", "def", "spa", "spd", "spe"]}
    #print("Boosts:", boosts)

    return {
        "total_hp_left": total_hp_left,
        "num_seen": len(pokemons),
        "num_fainted": num_fainted,
        "num_paralyzed": num_paralyzed,
        "num_frozen": num_frozen,
        "num_psn": num_psn,
        "num_brn": num_brn,
        "num_slp": num_slp,
        "avg_boosts": boosts,
    }, pokemon_left


def create_dynamic_features(data: list[dict]) -> pd.DataFrame:
    feature_list = []

    for battle in tqdm(data, desc="Extracting two-dict features"):
        timeline = battle.get("battle_timeline", [])
        if not timeline:
            continue

        p1_dict = build_player_dict(timeline, "p1")
        p2_dict = build_player_dict(timeline, "p2")
        #print("Player 1 dict:", p1_dict, "\n")
        #print("Player 2 dict:", p2_dict, "\n")
        p1_stats, p1_pokemon_left = aggregate_player_stats(p1_dict)
        p2_stats, p2_pokemon_left = aggregate_player_stats(p2_dict)

        n_stall_moves_p1 = get_sum_stall_moves(timeline)
        n_stall_moves_p2 = get_sum_stall_moves(timeline, player_prefix="p2")
        
        stall_move_usage_diff = n_stall_moves_p1 - n_stall_moves_p2

        n_boost_moves_p1 = get_sum_boost_moves(timeline)
        n_boost_moves_p2 = get_sum_boost_moves(timeline, player_prefix="p2")

        boost_move_usage_diff = n_boost_moves_p1 - n_boost_moves_p2

        n_switches_p1 = get_number_of_switches(timeline)
        n_switches_p2 = get_number_of_switches(timeline, player_prefix="p2")
        switch_frequency_diff = n_switches_p1 - n_switches_p2
        
        num_setup_p1 = num_setup(p1_dict)
        num_setup_p2 = num_setup(p2_dict)
        num_setup_diff = num_setup_p1 - num_setup_p2

        has_recovery_p1 = any(move.lower() == 'recover' for p in p1_dict.values() for move in p['moves'])
        has_recovery_p2 = any(move.lower() == 'recover' for p in p2_dict.values() for move in p['moves'])
        has_recovery_diff = has_recovery_p1 - has_recovery_p2

        def type_and_stat_advantages(p1_pokemon_left, p2_pokemon_left, pokemon_df, all_types):
          
            type_to_index = {t: i for i, t in enumerate(all_types)}
  
            p1_types = []
            p1_stats = []
            for p_name in p1_pokemon_left:
                p_data = pokemon_df[pokemon_df['name'] == p_name].iloc[0]
                p1_types.append(p_data['types'])
                p1_stats.append([
                    p_data['base_hp'],
                    p_data['base_atk'],
                    p_data['base_def'],
                    p_data['base_spa'],
                    p_data['base_spd'],
                    p_data['base_spe'],
                ])
            enemy_types = []
            enemy_stats = []
            for e_name in p2_pokemon_left:
                e_data = pokemon_df[pokemon_df['name'] == e_name].iloc[0]
                enemy_types.append(e_data['types'])
                enemy_stats.append([
                    e_data['base_hp'],
                    e_data['base_atk'],
                    e_data['base_def'],
                    e_data['base_spa'],
                    e_data['base_spd'],
                    e_data['base_spe'],
                ])
                
                # now compute the type advantage and stat advantage
                # flatten type lists, remove "notype"
                p1_all_types = [t for ts in p1_types for t in ts if t != "notype"]
                p2_all_types = [t for ts in enemy_types for t in ts if t != "notype"]
                # team vs team type advantage
                p1_type_adv = compute_type_advantage(p1_all_types, p2_all_types, type_chart)
                p2_type_adv = compute_type_advantage(p2_all_types, p1_all_types, type_chart)

                type_advantage = p1_type_adv / p2_type_adv
                
                # convert to numpy arrays
                p1_stats = np.array(p1_stats, dtype=float)
                p2_stats = np.array(enemy_stats, dtype=float)

                # compute mean overall stat
                p1_avg = p1_stats.mean()
                p2_avg = p2_stats.mean()

                stat_advantage = p1_avg / p2_avg


            return type_advantage, stat_advantage

        type_advantage, stat_advantage = type_and_stat_advantages(p1_pokemon_left, p2_pokemon_left, 
                                                                                        pokemon_df, 
                                                                                        all_types)
        
        boost_advantage = np.mean(list(p1_stats["avg_boosts"].values())) - np.mean(list(p2_stats["avg_boosts"].values()))
        
        features = {
            "battle_id": battle.get("battle_id"),
            "hp_ratio": p1_stats["total_hp_left"] / (p2_stats["total_hp_left"] + 1e-9),
            "p1_num_seen": p1_stats["num_seen"],
            "p2_num_seen": p2_stats["num_seen"],
            "p1_num_fainted": p1_stats["num_fainted"],
            "p2_num_fainted": p2_stats["num_fainted"],
            "num_paralyzed_diff": p1_stats["num_paralyzed"] - p2_stats["num_paralyzed"],
            "num_frozen_diff": p1_stats["num_frozen"] - p2_stats["num_frozen"],
            "num_psn_diff": p1_stats["num_psn"] - p2_stats["num_psn"],
            "num_brn_diff": p1_stats["num_brn"] - p2_stats["num_brn"],
            "num_slp_diff": p1_stats["num_slp"] - p2_stats["num_slp"],
            "num_seen_diff": p1_stats["num_seen"] - p2_stats["num_seen"],
            "num_fainted_diff": p1_stats["num_fainted"] - p2_stats["num_fainted"],
            "type_advantage": type_advantage,
            "stat_advantage": stat_advantage,
            "boost_advantage": boost_advantage,
            "stall_move_usage_diff": stall_move_usage_diff,
            "boost_move_usage_diff": boost_move_usage_diff,
            "switch_frequency_diff": switch_frequency_diff,
            "num_setup_diff": num_setup_diff,
            "has_recovery_diff": has_recovery_diff
        }

        

        feature_list.append(features)

    return pd.DataFrame(feature_list).fillna(0)

print("Processing training data...")
train_df_dynamic = create_dynamic_features(train_data)

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))
test_df_dynamic = create_dynamic_features(test_data)

print("\nTraining features preview:")
display(train_df_dynamic.head())

Processing training data...


Extracting two-dict features:   0%|          | 0/10000 [00:00<?, ?it/s]

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  p1_avg = p1_stats.mean()



Processing test data...


Extracting two-dict features:   0%|          | 0/5000 [00:00<?, ?it/s]


Training features preview:


Unnamed: 0,battle_id,hp_ratio,p1_num_seen,p2_num_seen,p1_num_fainted,p2_num_fainted,num_paralyzed_diff,num_frozen_diff,num_psn_diff,num_brn_diff,...,num_seen_diff,num_fainted_diff,type_advantage,stat_advantage,boost_advantage,stall_move_usage_diff,boost_move_usage_diff,switch_frequency_diff,num_setup_diff,has_recovery_diff
0,0,2.892367,4,4,1,1,0,-1,0,0,...,0,0,0.923077,0.962733,0.0,7,0,-4,0,-1
1,1,0.614786,6,6,3,0,-2,0,0,0,...,0,3,0.951613,0.946708,0.066667,-2,0,0,0,0
2,2,0.546296,3,4,1,0,-1,0,0,0,...,-1,1,1.0,0.95082,0.1,-2,8,2,1,-1
3,3,1.209302,5,4,3,0,0,0,0,0,...,1,3,1.057143,1.032558,0.0,3,0,1,0,0
4,4,1.283721,5,5,1,0,-2,0,0,0,...,0,1,1.046875,1.018868,0.0,-6,0,-2,0,0


In [6]:
# Combine Dynamic Features with Simple Features
train_df_combined = pd.merge(train_df, train_df_dynamic, on="battle_id", how="inner")
test_df_combined = pd.merge(test_df, test_df_dynamic, on="battle_id", how="inner")

print(train_df_combined.head(), train_df_combined.columns)

   p1_mean_hp  p1_mean_spe  p1_mean_atk  p1_mean_def  p1_mean_spa  \
0  115.833333    80.000000    72.500000    63.333333   100.000000   
1  123.333333    61.666667    72.500000    65.833333    90.000000   
2  124.166667    65.833333    84.166667    71.666667    90.000000   
3  121.666667    75.833333    77.500000    65.833333   103.333333   
4  114.166667    72.500000    75.833333    79.166667    97.500000   

   p1_mean_spd  p2_lead_hp  p2_lead_spe  p2_lead_atk  p2_lead_def  ...  \
0   100.000000          60          115           75           85  ...   
1    90.000000          55          120           50           45  ...   
2    90.000000         250           50            5            5  ...   
3   103.333333          75          110          100           95  ...   
4    97.500000          60          115           75           85  ...   

   num_seen_diff  num_fainted_diff  type_advantage  stat_advantage  \
0              0                 0        0.923077        0.962733   


### 3. Training a Baseline Model

Now that we have some features, let's train a simple `LogisticRegression` model. This will give us a starting point for our predictions.

In [7]:
features = [col for col in train_df_combined.columns if col not in ['battle_id', 
                                                                    'player_won'
                                                                    ]]
print("Final feature set:", features)
print(test_df_combined.head(), test_df_combined.columns)
X_train = train_df_combined[features]
y_train = train_df_combined['player_won']
X_test = test_df_combined[features]
# Maybe compute the type and stat advantages only on the pokemon that are alive

Final feature set: ['p1_mean_hp', 'p1_mean_spe', 'p1_mean_atk', 'p1_mean_def', 'p1_mean_spa', 'p1_mean_spd', 'p2_lead_hp', 'p2_lead_spe', 'p2_lead_atk', 'p2_lead_def', 'p2_lead_spa', 'p2_lead_spd', 'hp_ratio', 'p1_num_seen', 'p2_num_seen', 'p1_num_fainted', 'p2_num_fainted', 'num_paralyzed_diff', 'num_frozen_diff', 'num_psn_diff', 'num_brn_diff', 'num_slp_diff', 'num_seen_diff', 'num_fainted_diff', 'type_advantage', 'stat_advantage', 'boost_advantage', 'stall_move_usage_diff', 'boost_move_usage_diff', 'switch_frequency_diff', 'num_setup_diff', 'has_recovery_diff']
   p1_mean_hp  p1_mean_spe  p1_mean_atk  p1_mean_def  p1_mean_spa  \
0  117.500000    78.333333    74.166667    61.666667    93.333333   
1   70.166667    95.833333    95.666667    96.666667    94.166667   
2  120.000000    61.666667    90.833333    88.333333    78.333333   
3  114.166667    71.666667    70.000000    71.666667    97.500000   
4  116.666667    78.333333    75.000000    65.833333    99.166667   

   p1_mean_spd

## Training Pipeline

In [18]:
!pip install xgboost
!pip install scikit-learn
!pip install lightgbm
!pip install catboost
!pip install optuna



In [8]:
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, accuracy_score


In [11]:
from sklearn.model_selection import train_test_split


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(accuracy_score)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train,
    test_size=0.2,     # 20% validation set
    random_state=42,   # makes it reproducible
    stratify=y_train         # keeps class balance
)

# --- Define objective function ---
def objective(trial):
    model_name = trial.suggest_categorical("model", ["LogReg", "RandomForest", "XGBoost", "LightGBM", "CatBoost", "NeuralNet"])
    
    if model_name == "LogReg":
        C = trial.suggest_loguniform("logreg_C", 0.01, 10)
        model = Pipeline([("scaler", StandardScaler()),
                          ("model", LogisticRegression(C=C, penalty="l2", solver="lbfgs", max_iter=1000))])
    
    elif model_name == "RandomForest":
        n_estimators = trial.suggest_int("rf_n_estimators", 20, 200)
        max_depth = trial.suggest_int("rf_max_depth", 3, 20)
        min_samples_split = trial.suggest_int("rf_min_samples_split", 2, 10)
        min_samples_leaf = trial.suggest_int("rf_min_samples_leaf", 1, 5)
        model = RandomForestClassifier(n_estimators=n_estimators,
                                       max_depth=max_depth,
                                       min_samples_split=min_samples_split,
                                       min_samples_leaf=min_samples_leaf,
                                       random_state=42)
    
    elif model_name == "XGBoost":
        n_estimators = trial.suggest_int("xgb_n_estimators", 20, 200)
        max_depth = trial.suggest_int("xgb_max_depth", 3, 10)
        learning_rate = trial.suggest_loguniform("xgb_lr", 0.01, 0.3)
        subsample = trial.suggest_float("xgb_subsample", 0.6, 1.0)
        model = xgb.XGBClassifier(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  learning_rate=learning_rate,
                                  subsample=subsample,
                                  use_label_encoder=False,
                                  eval_metric="logloss",
                                  random_state=42)
    
    elif model_name == "LightGBM":
        n_estimators = trial.suggest_int("lgb_n_estimators", 20, 200)
        num_leaves = trial.suggest_int("lgb_num_leaves", 5, 100)
        learning_rate = trial.suggest_loguniform("lgb_lr", 0.01, 0.3)
        colsample_bytree = trial.suggest_float("lgb_colsample", 0.6, 1.0)
        model = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   learning_rate=learning_rate,
                                   colsample_bytree=colsample_bytree,
                                   random_state=42)
    
    elif model_name == "CatBoost":
        iterations = trial.suggest_int("cat_iterations", 100, 500)
        depth = trial.suggest_int("cat_depth", 3, 10)
        learning_rate = trial.suggest_loguniform("cat_learning_rate", 0.01, 0.3)
        model = CatBoostClassifier(iterations=iterations,
                                   depth=depth,
                                   learning_rate=learning_rate,
                                   verbose=False,
                                   random_state=42)
    
    elif model_name == "NeuralNet":
        hidden_layer_sizes = trial.suggest_categorical("nn_hidden", [(8, 16, 32), (8, 16, 32)])
        activation = trial.suggest_categorical("nn_activation", ["relu", "tanh"])
        alpha = trial.suggest_loguniform("nn_alpha", 1e-4, 1e-2)
        learning_rate_init = trial.suggest_loguniform("nn_lr", 1e-4, 5e-3)
        model = Pipeline([("scaler", StandardScaler()),
                          ("model", MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                                                  activation=activation,
                                                  alpha=alpha,
                                                  learning_rate_init=learning_rate_init,
                                                  learning_rate='adaptive',
                                                  max_iter=500,
                                                  random_state=42))])
        
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring=scorer, n_jobs=-1)
    return score.mean()

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print("Best model and params:")
print(study.best_trial.params)
print("Best CV accuracy:", study.best_trial.value)

In [28]:
nn_model = Pipeline([
    ("scaler", StandardScaler()),
    ("nn", MLPClassifier(
        hidden_layer_sizes=(64,32,16),   # You can tune this
        activation="relu",
        solver="adam",
        learning_rate_init=0.001,
        alpha=0.0005,          # L2 regularization
        max_iter=500,
        early_stopping=True,
        random_state=42
    ))
])

nn_model.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('nn', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,hidden_layer_sizes,"(64, ...)"
,activation,'relu'
,solver,'adam'
,alpha,0.0005
,batch_size,'auto'
,learning_rate,'constant'
,learning_rate_init,0.001
,power_t,0.5
,max_iter,500
,shuffle,True


In [None]:
preds = nn_model.predict(X_val)
acc = accuracy_score(y_val, preds)
print("Validation accuracy:", acc)

# Make predictions on the test data
print("Generating predictions on the test set...")
test_predictions = nn_model.predict(X_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'battle_id': test_df['battle_id'],
    'player_won': test_predictions
})

# Save the DataFrame to a .csv file
submission_df.to_csv('submission.csv', index=False)

print("\n'submission.csv' file created successfully!")
display(submission_df.head())

Validation accuracy: 0.8405


In [58]:
best_params = study.best_trial.params
print(best_params)

print("Training best model:", best_params["model"])
if best_params["model"] == "logreg":
    model = Pipeline([("scaler", StandardScaler()),
                      ("model", LogisticRegression(**best_params, penalty="l2", solver="lbfgs", max_iter=1000))])
elif best_params["model"] == "randomforest":
    model = RandomForestClassifier(**best_params, random_state=42)
elif best_params["model"] == "XGBoost":
    model = xgb.XGBClassifier(**best_params, use_label_encoder=False, eval_metric="logloss", random_state=42)
elif best_params["model"] == "LightGBM":
    model = lgb.LGBMClassifier(num_leaves=best_params['lgb_num_leaves'],
                                n_estimators=best_params['lgb_n_estimators'],
                                learning_rate=best_params['lgb_lr'],
                                colsample_bytree=best_params['lgb_colsample'])
elif best_params["model"] == "CatBoost":
    model = CatBoostClassifier(iterations=best_params['cat_iterations'], 
                               depth=best_params['cat_depth'], 
                               learning_rate=best_params['cat_learning_rate'], verbose=False, random_state=42)
elif best_params["model"] == "neuralnet":
    model = Pipeline([("scaler", StandardScaler()), ("model", MLPClassifier(**best_params, max_iter=500, random_state=42))])

model.fit(X_train, y_train)


{'model': 'CatBoost', 'cat_iterations': 275, 'cat_depth': 5, 'cat_learning_rate': 0.08029789776807468}
Training best model: CatBoost


<catboost.core.CatBoostClassifier at 0x20896bcc7d0>

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = model.predict(X_val)
acc = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {acc:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_pred))

Validation Accuracy: 0.8405

Confusion Matrix:
[[859 141]
 [178 822]]


### 4. Creating the Submission File

The competition requires a `.csv` file with two columns: `battle_id` and `player_won`. Let's use our trained model to make predictions on the test set and format them correctly.

In [None]:
# Make predictions on the test data
print("Generating predictions on the test set...")
test_predictions = model.predict(X_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'battle_id': test_df['battle_id'],
    'player_won': test_predictions
})

# Save the DataFrame to a .csv file
submission_df.to_csv('submission.csv', index=False)

print("\n'submission.csv' file created successfully!")
display(submission_df.head())

Generating predictions on the test set...

'submission.csv' file created successfully!


Unnamed: 0,battle_id,player_won
0,0,0
1,1,1
2,2,1
3,3,1
4,4,1


### 5. Submitting Your Results

Once you have generated your `submission.csv` file, there are two primary ways to submit it to the competition.

---

#### Method A: Submitting Directly from the Notebook

This is the standard method for code competitions. It ensures that your submission is linked to the code that produced it, which is crucial for reproducibility.

1.  **Save Your Work:** Click the **"Save Version"** button in the top-right corner of the notebook editor.
2.  **Run the Notebook:** In the pop-up window, select **"Save & Run All (Commit)"** and then click the **"Save"** button. This will run your entire notebook from top to bottom and save the output, including your `submission.csv` file.
3.  **Go to the Viewer:** Once the save process is complete, navigate to the notebook viewer page. 
4.  **Submit to Competition:** In the viewer, find the **"Submit to Competition"** section. This is usually located in the header of the output section or in the vertical "..." menu on the right side of the page. Clicking the **Submit** button this will submit your generated `submission.csv` file.

After submitting, you will see your score in the **"Submit to Competition"** section or in the [Public Leaderboard](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?).

---

#### Method B: Manual Upload

You can also generate your predictions and submission file using any environment you prefer (this notebook, Google Colab, or your local machine).

1.  **Generate the `submission.csv` file** using your model.
2.  **Download the file** to your computer.
3.  **Navigate to the [Leaderboard Page](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?)** and click on the **"Submit Predictions"** button.
4.  **Upload Your File:** Drag and drop or select your `submission.csv` file to upload it.

This method is quick, but keep in mind that for the final evaluation, you might be required to provide the code that generated your submission.

Good luck!