# Get to Know the ivrit-ai Knesset Plenums Dataset

## ðŸ§© 1. Introduction

This notebook introduces **the ivrit-ai / knesset-plenums dataset** â€” long-form Hebrew parliamentary audio aligned to human-written protocols and refined with Stable-Whisper force alignment.

The dataset contains:

* Full audio from each plenum session (audio.m4a)

* Raw protocol text

* Cleaned transcript text

* Timestamp-indexed transcript text (transcript.json)

* Force-aligned transcript with word-level timing and probabilities (transcript.aligned.json)

* Detailed metadata (metadata.json)

* A dataset-wide manifest (manifest.csv)

* A lookup table for official plenum IDs (knesset_plenums_metadata.csv)

ðŸ“Œ Important: This dataset is stored as a repository of raw files, NOT a Hugging Face Arrow dataset.

## ðŸ”§ 2. Setup

In [None]:
import os
import json
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import Audio, display


## 3. Load the Dataset

Because this is a file-based dataset, we use:


In [None]:
!pip install huggingface_hub
from huggingface_hub import snapshot_download


### Download (recommended settings)

In [None]:
DATA_ROOT = Path("knesset_plenums")

snapshot_download(
    repo_id="ivrit-ai/knesset-plenums",
    repo_type="dataset",
    local_dir=DATA_ROOT,
    etag_timeout=60,      # More stable for large repos
    max_workers=1         # HF recommends reducing workers for huge datasets
)


After download:

DATA_ROOT = Path("knesset_plenums")
print("Top-level contents:", list(DATA_ROOT.iterdir()))


## 4. Explore the Manifest

The root contains:

* manifest.csv

* knesset_plenums_metadata.csv

* Folders for each plenum session

### Load manifest.csv

In [None]:
manifest_path = DATA_ROOT / "manifest.csv"
manifest = pd.read_csv(manifest_path)
manifest.head()


### Quick stats

In [None]:
len(manifest), manifest["plenum_date"].min(), manifest["plenum_date"].max()


In [None]:
# Plot session durations:
manifest["duration"] = manifest["duration"].astype(float)

manifest.sort_values("duration", ascending=False).head(20).plot(
    x="plenum_id",
    y="duration",
    kind="bar",
    figsize=(12,4),
    title="Top 20 Longest Plenum Sessions"
)


## 5. Explore Session-level Metadata

In [None]:
# Pick a session:
example_plenum = manifest.iloc[0]["plenum_id"]
folder = DATA_ROOT / str(example_plenum)
folder, list(folder.iterdir())


In [None]:
# Load metadata:

with open(folder / "metadata.json", "r", encoding="utf-8") as f:
    meta = json.load(f)

meta


Key fields include:

* duration
* quality_score
* per_segment_quality_scores
* segments_count
* words_count
* avg_words_per_minute, etc.

## 6. Plot Segment-Level Quality Score Distribution

In [None]:
scores = [s["probability"] for s in meta["per_segment_quality_scores"]]

plt.hist(scores, bins=50, figsize=(10,4))
plt.title("Segment Alignment Quality Scores")
plt.xlabel("Score")
plt.ylabel("Count")
plt.show()

np.percentile(scores, [0, 25, 50, 75, 90, 95, 99])


## 7. Listen to the Audio

Each plenum folder contains `audio.m4a`

In [None]:
audio_file = folder / "audio.m4a"
display(Audio(str(audio_file)))

## 8. Load the Aligned Transcript

File: `transcript.aligned.json` â†’ Stable-Whisper WhisperResult dump with word-level alignment

In [None]:
with open(folder / "transcript.aligned.json", "r", encoding="utf-8") as f:
    aligned = json.load(f)

aligned.keys()


In [None]:
segments = aligned["segments"]
len(segments)


In [None]:
segments[0]


## 9. Extract Words, Timings, and Probabilities

In [None]:
words = []
for seg in segments:
    for w in seg.get("words", []):
        words.append({
            "text": w.get("text"),
            "start": w.get("start"),
            "end": w.get("end"),
            "probability": w.get("probability")
        })

df_words = pd.DataFrame(words)
df_words.head()


plot probability:

In [None]:
plt.figure(figsize=(12,4))
df_words["probability"].hist(bins=50)
plt.title("Word-level Alignment Probability Distribution")
plt.xlabel("Probability")


## License
This dataset is released under the ivrit.ai license, a modified CC-BY license permitting use for machine-learning model training while prohibiting deepfake generation and certain misuse scenarios.
Full terms:
https://www.ivrit.ai/en/license-faqs/

## Citation
If you use this dataset, cite:

> Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret (2025). *Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing*. Interspeech 2025.

