Skip to content

Add earnings config #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add earnings config #130

wants to merge 2 commits into from

Conversation

nithinraok
Copy link
Collaborator

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

  • CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
  • CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
  • NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
  • CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
  • SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

  • Earnings21 support (full dataset + eval10 subset)
  • Earnings22 support
  • Dual NLP file location handling for flexible dataset structures
  • Speaker metadata CSV integration for name mapping

Audio Processing:

  • Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
  • Accurate duration calculation from audio files
  • Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

  1. Initial Audio Manifest → Full audio files with duration
  2. Text Population → Add ground truth transcripts from NLP files
  3. Text Cleaning → Remove artifacts, brackets, special characters
  4. Forced Alignment → Generate word-level CTM files with timestamps
  5. Sentence Segmentation → Create sentence-level segments from CTM data
  6. Speaker Segmentation → Create speaker-level segments (optional)
  7. Field Filtering → Keep only required manifest fields

Key Configuration Options:

  • dataset_type: "earnings21" | "earnings22"
  • subset: "full" | "eval10" (earnings21 only)
  • forced_alignment_model: Configurable NeMo ASR model
  • preserve_punctuation / preserve_capitalization: Text processing options
  • include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nithinraok. Thanks!

Overall looks good to me. I left some comments about docs and docstrings, that need to be fixed and I will review code once more. Also, please, consider adding end2end tests.

@@ -0,0 +1,24 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

@@ -0,0 +1,90 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025

@@ -0,0 +1,1010 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025


logger.info(f"Loaded {len(self.file_ids)} file IDs for {self.dataset_type} subset {self.subset}.")

def _convert_audio_if_needed(self, audio_file: Path, file_id: str) -> Path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a dedicated processor for this functionality: see FfmegConvert

Let's reuse this processor to avoid duplicating functionality.

@@ -0,0 +1,197 @@
# Configuration for processing Earnings21/22 datasets to NeMo format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proper docs building we need documentation in specific format. You can refer to this config.

You need to have documentation section with dataset name. Without this section, the dataset will not appear in the generated docs.

Also, please, consider moving any file-level comments (i.e. "Expected output from this 5-step pipeline:..") into the documentation section to ensure they are captured by the doc builder.

You can test docs locally by

cd docs
make clean
make html SPHINXOPTS="-b linkcheck -W --keep-going -n

then opening docs/html/index.html in your browser to verify. There must be single warning 404 Client Error: Not Found for url: https://github.com/NVIDIA/.../dataset_configs/english/earnings21/config.yaml



# Step 1: Create Initial Audio and Manifest (Full Audio)
class CreateInitialAudioAndManifest(BaseParallelProcessor):
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings in the following format. Examples may be skipped if not necessary.

class DuplicateFields(BaseParallelProcessor):
    This processor duplicates fields in all manifest entries.

    It is useful for when you want to do downstream processing of a variant
    of the entry. E.g. make a copy of "text" called "text_no_pc", and
    remove punctuation from "text_no_pc" in downstream processors.

    Args:
        duplicate_fields (dict): dictionary where keys are the original
            fields to be copied and their values are the new names of
            the duplicate fields.

    Returns:
        The same data as in the input manifest with duplicated fields
        as specified in the ``duplicate_fields`` input dictionary.

    Example:
        .. code-block:: yaml

            - _target_: sdp.processors.modify_manifest.common.DuplicateFields
               input_manifest_file: ${workspace_dir}/test1.json
               output_manifest_file: ${workspace_dir}/test2.json
               duplicate_fields: {"text":"answer"}

Also, please update docs/src/sdp/api.rst with this processors.


# Step 2: Populate Full Text for Manifest
class CreateFullAudioManifestEarnings21(BaseParallelProcessor):
"""
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please, add proper docstrings and update api.rst


# Step 3: Create Speaker-level Segmented Manifest (renamed from CreateFinalSegmentedManifest)
class SpeakerSegmentedManifest(BaseParallelProcessor):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, docstrings)


class NeMoForcedAligner(BaseProcessor):
"""
Step 4: Apply NeMo Forced Aligner to get word-level timestamps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too

@@ -21,6 +21,9 @@ RUN apt-get update \
# Update pip
RUN pip install --upgrade pip

#install typing-ext manually
RUN pip install typing-extensions
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorjeous Let’s move this change to a separate PR.
Also, please add an informative comment explaining why this needs to be installed manually, if the reasoning is clear.

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
@nithinraok
Copy link
Collaborator Author

Updated based on comments. @lilithgrigoryan pls have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants