:#ðŸ“˜ GET TO KNOW A DATASET: ivrit-ai Crowd-Transcribe v5


## ðŸ§© 1. Introduction

This notebook introduces the ivrit-ai Crowd-Transcribe v5 dataset â€” a high-quality, volunteer-validated Hebrew speech transcription dataset.
It contains:

* Raw audio
* Machine-generated original sentences
* Human-corrected sentences
* Metadata including segment duration, quality flags, and transcription worker IDs

The dataset is stored as Parquet files on [Hugging Face](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5)

The goal of this notebook is to help users:
1. Explore and understand the dataset structure
2. Inspect transcription quality
3. Play audio segments
4. Compute statistics (durations, worker activity, error patterns)

In [None]:
DATA_ROOT = "/path/to/ivrit-ai-audio-v2"


## ðŸ”§ 2. Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display
from datasets import load_dataset

import json
import ast

## 3. Load the Dataset

Load directly from Hugging Face:


In [None]:
ds = load_dataset("ivrit-ai/crowd-transcribe-v5")
ds


f your dataset has a single split, it loads as "train".
If it has multiple splits, adjust accordingly.

Convert to a pandas DataFrame for convenience:

In [None]:
df = ds["train"].to_pandas()
df.head()


# 4. Inspect Schema

In [None]:
df.info()

Expected columns:

* uuid - formatted as "Tziun3/2021.06.03 ×¤×¨×§ #207 â€“ ×—×–×¨×ª×• ×©×œ ×”×™×•× ×™/343" â†’ meaning: `<source>/<episode name>/<segment #>`

* audio - an audio object

* orig_sentence - machine transcription

* sentence - human-corrected transcription

* is_retranscribe - True if re-labeled for quality control

* transcriber - numeric worker ID

* extra_data - a dict with metadata

## 5. Parse extra_data Into Columns

The extra_data field is a dict.

Extract key values into normal columns.

In [None]:
def extract_extra(row):
    d = row
    if isinstance(row, str):
        d = json.loads(row)
    return {
        "duration": d.get("duration"),
        "foreign_language": d.get("foreign_language"),
        "multiple_speakers": d.get("multiple_speakers"),
        "noisy": d.get("noisy"),
        "skipped": d.get("skipped"),
        "max_logprob": d.get("max_logprob"),
        "orig_text_adapted": d.get("orig_text"),
        "too_long": d.get("too_long"),
        "unintelligible": d.get("unintelligible")
    }

extra_df = df["extra_data"].apply(extract_extra).apply(pd.Series)
df = pd.concat([df, extra_df], axis=1)
df.head()


## 6. Derive Source, Episode, and Segment ID

In [None]:
parts = df["uuid"].str.split("/", n=2, expand=True)
df["source"] = parts[0]
df["episode"] = parts[1]
df["segment_id"] = parts[2]
df.head()


## 7. Basic Dataset Statistics

In [None]:
# Number of Sources

df["source"].nunique()


In [None]:
# Segments Per Source

df_source_counts = df["source"].value_counts()
df_source_counts.head()

In [None]:
df_source_counts.head(20).plot(kind="bar", figsize=(12,4), title="Top 20 Sources by Segment Count")


## 8. Duration Statistics

In [None]:
df["duration"].describe()

# total hours
total_hours = df["duration"].sum() / 3600
total_hours

In [None]:
df["duration"].hist(bins=50, figsize=(12,4))
plt.title("Segment Duration Distribution (sec)")
plt.xlabel("seconds")

## 9. Transcription Quality Comparison

In [None]:
!pip install jiwer


jiwer computes standard transcription similarity metrics:

* WER - Word Error Rate
* WIL - Word Information Lost
* CER - Character Error Rate

In [None]:
from jiwer import wer, cer, process_words

def compute_jiwer_metrics(orig, corrected):
    try:
        w = wer(orig, corrected)
        c = cer(orig, corrected)
        transform = process_words(orig, corrected)
        mer = transform.mer
        wil = transform.wil
        return pd.Series({
            "wer": w,
            "cer": c,
            "wil": wil
        })
    except:
        return pd.Series({
            "wer": np.nan,
            "cer": np.nan,
            "wil": np.nan
        })

metrics_df = df.apply(lambda r: compute_jiwer_metrics(r["orig_sentence"], r["sentence"]), axis=1)
df = pd.concat([df, metrics_df], axis=1)

df[["wer", "cer", "wil"]].head()


## License
This dataset is released under the ivrit.ai license, a modified CC-BY license permitting use for machine-learning model training while prohibiting deepfake generation and certain misuse scenarios.
Full terms:
https://www.ivrit.ai/en/license-faqs/

## Citation
If you use this dataset, cite:

> Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret (2025). *Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing*. Interspeech 2025.

