# Download and Load MABe Mouse Behavior Detection Dataset

This notebook downloads the [MABe Mouse Behavior Detection](https://www.kaggle.com/competitions/MABe-mouse-behavior-detection) competition dataset using the Python `kaggle` module.

## Prerequisites
1. **Kaggle Account**: Ensure you have joined the competition on Kaggle to accept the rules.
2. **API Token**: You need a `kaggle.json` file (API Token) from your account settings.
   - Place it in `~/.kaggle/kaggle.json` (Linux/Mac) or `%USERPROFILE%\.kaggle\kaggle.json` (Windows).
   - Or set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables.

## Manual Dataset Placement

If you have already downloaded the dataset manually, please create a directory named `data` in the same location as this notebook.
Then, place the **unzipped contents** of the competition dataset directly into this `data` directory.

For example, if you downloaded `MABe-mouse-behavior-detection.zip`, after unzipping it, you would copy the folders and files (e.g., `train`, `test`, `sample_submission.csv`) directly into `./data/`.

If you proceed this way, you can skip running the download cells (Cells 3 and 4) in this notebook.

In [1]:
import os
from pathlib import Path
import ast
from collections import Counter, defaultdict

import numpy as np
import pandas as pd

DATA_ROOT = Path("data")  # adjust if your folder name is different

In [2]:
meta = pd.read_csv(DATA_ROOT / "train.csv")

def parse_list_column(s):
    if isinstance(s, str):
        return ast.literal_eval(s)
    return []

# body parts: string -> list[str]
meta["body_parts_tracked_list"] = meta["body_parts_tracked"].apply(parse_list_column)

# behaviors: string -> list[str]
meta["behaviors_labeled_list"] = meta["behaviors_labeled"].apply(parse_list_column)

# convenience: count mice per video
def count_mice(row):
    c = 0
    for i in range(1, 5):
        if isinstance(row.get(f"mouse{i}_strain", np.nan), str):
            c += 1
    return c

meta["n_mice"] = meta.apply(count_mice, axis=1)

print("Number of videos:", len(meta))
print("Example metadata row:")
print(meta.iloc[0])

Number of videos: 8789
Example metadata row:
lab_id                                                        AdaptableSnail
video_id                                                            44566106
mouse1_strain                                                     CD-1 (ICR)
mouse1_color                                                           white
mouse1_sex                                                              male
mouse1_id                                                               10.0
mouse1_age                                                        8-12 weeks
mouse1_condition                                             wireless device
mouse2_strain                                                     CD-1 (ICR)
mouse2_color                                                           white
mouse2_sex                                                              male
mouse2_id                                                               24.0
mouse2_age                     

In [3]:
bp_counter = Counter()
for bps in meta["body_parts_tracked_list"]:
    bp_counter.update(bps)

n_videos = len(meta)
bp_stats = (
    pd.DataFrame(
        {"bodypart": list(bp_counter.keys()),
         "count": list(bp_counter.values())}
    )
    .assign(freq=lambda d: d["count"] / n_videos)
    .sort_values("freq", ascending=False)
)

print("\nBodypart frequency stats:")
print(bp_stats.head(20))

# relax: keep bodyparts present in at least 80% of videos
THRESH = 0.8
common_bodyparts = (
    bp_stats.query("freq >= @THRESH")["bodypart"]
    .sort_values()
    .tolist()
)

print("\nCommon bodyparts (freq >= 0.8):")
print(common_bodyparts)

# fallback if threshold is too strict
if len(common_bodyparts) == 0:
    common_bodyparts = sorted(bp_counter.keys())
    print("\nThreshold produced empty set, using all bodyparts instead:")
    print(common_bodyparts)



Bodypart frequency stats:
         bodypart  count      freq
2       ear_right   8789  1.000000
15      tail_base   8789  1.000000
1        ear_left   8789  1.000000
14           nose   8772  0.998066
13           neck   8577  0.975879
0     body_center   8114  0.923199
17       tail_tip   8030  0.913642
16  tail_midpoint   7943  0.903743
24  hindpaw_right   7926  0.901809
23   hindpaw_left   7926  0.901809
22  forepaw_right   7926  0.901809
21   forepaw_left   7926  0.901809
19      hip_right    655  0.074525
18       hip_left    655  0.074525
11   lateral_left    169  0.019229
12  lateral_right    169  0.019229
27  tail_middle_1     21  0.002389
26        spine_2     21  0.002389
25        spine_1     21  0.002389
28  tail_middle_2     21  0.002389

Common bodyparts (freq >= 0.8):
['body_center', 'ear_left', 'ear_right', 'forepaw_left', 'forepaw_right', 'hindpaw_left', 'hindpaw_right', 'neck', 'nose', 'tail_base', 'tail_midpoint', 'tail_tip']


In [4]:
beh_counter = Counter()
for b_list in meta["behaviors_labeled_list"]:
    beh_counter.update(b_list)

def extract_action(s):
    # "mouse1,mouse2,chase" -> "chase"
    return s.split(",")[-1]

all_actions = sorted({extract_action(s) for s in beh_counter.keys()})
print("\nBehavior action vocabulary:")
print(all_actions)


Behavior action vocabulary:
["'attack'", "'dominance'", "'sniff'", 'allogroom', 'approach', 'attack', 'attemptmount', 'avoid', 'biteobject', 'chase', 'chaseattack', 'climb', 'defend', 'dig', 'disengage', 'dominance', 'dominancegroom', 'dominancemount', 'ejaculate', 'escape', 'exploreobject', 'flinch', 'follow', 'freeze', 'genitalgroom', 'huddle', 'intromit', 'mount', 'rear', 'reciprocalsniff', 'rest', 'run', 'selfgroom', 'shepherd', 'sniff', 'sniffbody', 'sniffface', 'sniffgenital', 'submit', 'tussle']
