Skip to content

Remove pandas dependency from meldataset.py#362

Open
netlinux-ai wants to merge 1 commit intoyl4579:mainfrom
netlinux-ai:chore/drop-pandas
Open

Remove pandas dependency from meldataset.py#362
netlinux-ai wants to merge 1 commit intoyl4579:mainfrom
netlinux-ai:chore/drop-pandas

Conversation

@netlinux-ai
Copy link
Copy Markdown

Problem

meldataset.py imports pandas and uses it for one operation: filtering data_list by speaker_id when sampling a reference clip. Pandas is otherwise unused.

This adds a heavyweight dependency for a single filter call. It also pulls in pandas's transitive pyarrow dependency, which is the source of compatibility friction:

  • pyarrow's published wheels assume a v2-baseline x86 CPU (SSE4.1 in static initialisers); older CPUs cannot load them.
  • pyarrow tracks newer Python release schedules aggressively, sometimes lagging behind by a release.
  • pip install weight goes up by ~80 MB for one filter line.

Fix

Replace the single pandas usage with a list comprehension + random.choice:

# before
ref_data = (self.df[self.df[2] == str(speaker_id)]).sample(n=1).iloc[0].tolist()

# after
matching = [r for r in self.data_list if r[2] == str(speaker_id)]
ref_data = random.choice(matching)

Drops both pandas and pyarrow from the dependency graph. Unused self.df member is removed; unused import pandas as pd is removed.

Tested with

  • Full fine-tune training run on PyTorch 2.7.0
  • Reference-clip sampling behaviour confirmed identical: same uniform-random selection within speaker
  • ~80 MB dep removal verified

The only use of pandas was a single speaker_id filter when sampling a
reference clip; replaced with a list comprehension and random.choice.
Drops pandas (and its transitive pyarrow dep) from the dependency graph.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant