-
Notifications
You must be signed in to change notification settings - Fork 29
Add earnings config #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add earnings config #130
Conversation
Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nithinraok. Thanks!
Overall looks good to me. I left some comments about docs and docstrings, that need to be fixed and I will review code once more. Also, please, consider adding end2end tests.
@@ -0,0 +1,24 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2025
@@ -0,0 +1,90 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2025
@@ -0,0 +1,1010 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2025
|
||
logger.info(f"Loaded {len(self.file_ids)} file IDs for {self.dataset_type} subset {self.subset}.") | ||
|
||
def _convert_audio_if_needed(self, audio_file: Path, file_id: str) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a dedicated processor for this functionality: see FfmegConvert
Let's reuse this processor to avoid duplicating functionality.
@@ -0,0 +1,197 @@ | |||
# Configuration for processing Earnings21/22 datasets to NeMo format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For proper docs building we need documentation in specific format. You can refer to this config.
You need to have documentation section with dataset name. Without this section, the dataset will not appear in the generated docs.
Also, please, consider moving any file-level comments (i.e. "Expected output from this 5-step pipeline:..") into the documentation section to ensure they are captured by the doc builder.
You can test docs locally by
cd docs
make clean
make html SPHINXOPTS="-b linkcheck -W --keep-going -n
then opening docs/html/index.html
in your browser to verify. There must be single warning 404 Client Error: Not Found for url: https://github.com/NVIDIA/.../dataset_configs/english/earnings21/config.yaml
|
||
|
||
# Step 1: Create Initial Audio and Manifest (Full Audio) | ||
class CreateInitialAudioAndManifest(BaseParallelProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstrings in the following format. Examples may be skipped if not necessary.
class DuplicateFields(BaseParallelProcessor):
This processor duplicates fields in all manifest entries.
It is useful for when you want to do downstream processing of a variant
of the entry. E.g. make a copy of "text" called "text_no_pc", and
remove punctuation from "text_no_pc" in downstream processors.
Args:
duplicate_fields (dict): dictionary where keys are the original
fields to be copied and their values are the new names of
the duplicate fields.
Returns:
The same data as in the input manifest with duplicated fields
as specified in the ``duplicate_fields`` input dictionary.
Example:
.. code-block:: yaml
- _target_: sdp.processors.modify_manifest.common.DuplicateFields
input_manifest_file: ${workspace_dir}/test1.json
output_manifest_file: ${workspace_dir}/test2.json
duplicate_fields: {"text":"answer"}
Also, please update docs/src/sdp/api.rst
with this processors.
|
||
# Step 2: Populate Full Text for Manifest | ||
class CreateFullAudioManifestEarnings21(BaseParallelProcessor): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please, add proper docstrings and update api.rst
|
||
# Step 3: Create Speaker-level Segmented Manifest (renamed from CreateFinalSegmentedManifest) | ||
class SpeakerSegmentedManifest(BaseParallelProcessor): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, docstrings)
|
||
class NeMoForcedAligner(BaseProcessor): | ||
""" | ||
Step 4: Apply NeMo Forced Aligner to get word-level timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too
docker/Dockerfile
Outdated
@@ -21,6 +21,9 @@ RUN apt-get update \ | |||
# Update pip | |||
RUN pip install --upgrade pip | |||
|
|||
#install typing-ext manually | |||
RUN pip install typing-extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jorjeous Let’s move this change to a separate PR.
Also, please add an informative comment explaining why this needs to be installed manually, if the reasoning is clear.
Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
Updated based on comments. @lilithgrigoryan pls have a look |
Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment
Overview
This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.
High-Level Changelog
New Features
Core Pipeline Processors:
CreateInitialAudioAndManifest
: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)CreateFullAudioManifestEarnings21
: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservationNeMoForcedAligner
: Word-level forced alignment using NeMo ASR models with CTC headsCreateSentenceSegmentedManifest
: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splittingSpeakerSegmentedManifest
: Speaker-change detection and segmentation with optional metadata mappingDataset Support:
Audio Processing:
Pipeline Configuration
7-Step Processing Workflow:
Key Configuration Options:
dataset_type
: "earnings21" | "earnings22"subset
: "full" | "eval10" (earnings21 only)forced_alignment_model
: Configurable NeMo ASR modelpreserve_punctuation
/preserve_capitalization
: Text processing optionsinclude_speaker_info
/include_tags
: Optional metadata inclusionOutput Formats
Sentence-Level Segments (Primary Output):
Speaker-Level Segments (Optional):
Usage Examples