Add earnings config #130

nithinraok · 2025-06-17T17:35:14Z

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

Earnings21 support (full dataset + eval10 subset)
Earnings22 support
Dual NLP file location handling for flexible dataset structures
Speaker metadata CSV integration for name mapping

Audio Processing:

Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
Accurate duration calculation from audio files
Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

Initial Audio Manifest → Full audio files with duration
Text Population → Add ground truth transcripts from NLP files
Text Cleaning → Remove artifacts, brackets, special characters
Forced Alignment → Generate word-level CTM files with timestamps
Sentence Segmentation → Create sentence-level segments from CTM data
Speaker Segmentation → Create speaker-level segments (optional)
Field Filtering → Keep only required manifest fields

Key Configuration Options:

dataset_type: "earnings21" | "earnings22"
subset: "full" | "eval10" (earnings21 only)
forced_alignment_model: Configurable NeMo ASR model
preserve_punctuation / preserve_capitalization: Text processing options
include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

lilithgrigoryan

Hi @nithinraok. Thanks!

Overall looks good to me. I left some comments about docs and docstrings, that need to be fixed and I will review code once more. Also, please, consider adding end2end tests.

lilithgrigoryan · 2025-06-18T07:08:12Z

sdp/processors/datasets/earnings21/__init__.py

@@ -0,0 +1,24 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


lilithgrigoryan · 2025-06-18T07:08:20Z

sdp/processors/datasets/earnings21/apply_normalizations.py

@@ -0,0 +1,90 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


lilithgrigoryan · 2025-06-18T07:08:40Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

@@ -0,0 +1,1010 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


lilithgrigoryan · 2025-06-18T07:13:04Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+        logger.info(f"Loaded {len(self.file_ids)} file IDs for {self.dataset_type} subset {self.subset}.")
+
+    def _convert_audio_if_needed(self, audio_file: Path, file_id: str) -> Path:


We already have a dedicated processor for this functionality: see FfmegConvert

Let's reuse this processor to avoid duplicating functionality.

lilithgrigoryan · 2025-06-18T08:32:02Z

dataset_configs/english/earnings21/config.yaml

@@ -0,0 +1,197 @@
+# Configuration for processing Earnings21/22 datasets to NeMo format


For proper docs building we need documentation in specific format. You can refer to this config.

You need to have documentation section with dataset name. Without this section, the dataset will not appear in the generated docs.

Also, please, consider moving any file-level comments (i.e. "Expected output from this 5-step pipeline:..") into the documentation section to ensure they are captured by the doc builder.

You can test docs locally by

cd docs make clean make html SPHINXOPTS="-b linkcheck -W --keep-going -n

then opening docs/html/index.html in your browser to verify. There must be single warning 404 Client Error: Not Found for url: https://github.com/NVIDIA/.../dataset_configs/english/earnings21/config.yaml

lilithgrigoryan · 2025-06-18T08:33:33Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+
+# Step 1: Create Initial Audio and Manifest (Full Audio)
+class CreateInitialAudioAndManifest(BaseParallelProcessor):


Please add docstrings in the following format. Examples may be skipped if not necessary.

class DuplicateFields(BaseParallelProcessor): This processor duplicates fields in all manifest entries. It is useful for when you want to do downstream processing of a variant of the entry. E.g. make a copy of "text" called "text_no_pc", and remove punctuation from "text_no_pc" in downstream processors. Args: duplicate_fields (dict): dictionary where keys are the original fields to be copied and their values are the new names of the duplicate fields. Returns: The same data as in the input manifest with duplicated fields as specified in the ``duplicate_fields`` input dictionary. Example: .. code-block:: yaml - _target_: sdp.processors.modify_manifest.common.DuplicateFields input_manifest_file: ${workspace_dir}/test1.json output_manifest_file: ${workspace_dir}/test2.json duplicate_fields: {"text":"answer"}

Also, please update docs/src/sdp/api.rst with this processors.

lilithgrigoryan · 2025-06-18T08:34:09Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+# Step 2: Populate Full Text for Manifest
+class CreateFullAudioManifestEarnings21(BaseParallelProcessor):
+    """


Same here. Please, add proper docstrings and update api.rst

lilithgrigoryan · 2025-06-18T08:35:05Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+# Step 3: Create Speaker-level Segmented Manifest (renamed from CreateFinalSegmentedManifest)
+class SpeakerSegmentedManifest(BaseParallelProcessor):
+    """


also, docstrings)

lilithgrigoryan · 2025-06-18T08:36:03Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+class NeMoForcedAligner(BaseProcessor):
+    """
+    Step 4: Apply NeMo Forced Aligner to get word-level timestamps.


This one too

lilithgrigoryan · 2025-06-18T08:39:29Z

docker/Dockerfile

@@ -21,6 +21,9 @@ RUN apt-get update \
 # Update pip
 RUN pip install --upgrade pip

+#install typing-ext manually
+RUN pip install typing-extensions


@Jorjeous Let’s move this change to a separate PR.
Also, please add an informative comment explaining why this needs to be installed manually, if the reasoning is clear.

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

nithinraok · 2025-06-26T20:53:33Z

Updated based on comments. @lilithgrigoryan pls have a look

Add earnings config

fd81561

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

nithinraok mentioned this pull request Jun 17, 2025

Add earnings config #123

Closed

nithinraok requested review from Jorjeous and lilithgrigoryan June 17, 2025 17:36

lilithgrigoryan requested changes Jun 18, 2025

View reviewed changes

lilithgrigoryan reviewed Jun 18, 2025

View reviewed changes

address comments

a4dd69f

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

nithinraok force-pushed the earnings_pc branch from 8ff5143 to a4dd69f Compare June 26, 2025 20:52

nithinraok requested a review from lilithgrigoryan June 26, 2025 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add earnings config #130

Add earnings config #130

Uh oh!

nithinraok commented Jun 17, 2025

Uh oh!

lilithgrigoryan left a comment •

edited

Loading

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025 •

edited

Loading

Uh oh!

lilithgrigoryan Jun 18, 2025 •

edited

Loading

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025

Uh oh!

lilithgrigoryan Jun 18, 2025 •

edited

Loading

Uh oh!

nithinraok commented Jun 26, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,24 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,90 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,1010 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.


		logger.info(f"Loaded {len(self.file_ids)} file IDs for {self.dataset_type} subset {self.subset}.")

		def _convert_audio_if_needed(self, audio_file: Path, file_id: str) -> Path:

		@@ -0,0 +1,197 @@
		# Configuration for processing Earnings21/22 datasets to NeMo format



		# Step 1: Create Initial Audio and Manifest (Full Audio)
		class CreateInitialAudioAndManifest(BaseParallelProcessor):

Add earnings config #130

Are you sure you want to change the base?

Add earnings config #130

Uh oh!

Conversation

nithinraok commented Jun 17, 2025

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

High-Level Changelog

New Features

Pipeline Configuration

Output Formats

Usage Examples

Uh oh!

lilithgrigoryan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nithinraok commented Jun 26, 2025

Uh oh!

Uh oh!

lilithgrigoryan left a comment •

edited

Loading

lilithgrigoryan Jun 18, 2025 •

edited

Loading

lilithgrigoryan Jun 18, 2025 •

edited

Loading

lilithgrigoryan Jun 18, 2025 •

edited

Loading