Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label.from_category is O(num_classes) complexity, for every single sample #6368

Closed
NicolasHug opened this issue Aug 4, 2022 · 6 comments
Closed
Assignees

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Aug 4, 2022

This function is called when loading every single sample, but its complexity is O(num_classes) because of the .index() call, which is probably overkill. A simple dict lookup should be enough and much faster here.

@classmethod
def from_category(
cls,
category: str,
*,
categories: Sequence[str],
**kwargs: Any,
) -> Label:
return cls(categories.index(category), categories=categories, **kwargs)

CC @pmeier

@pmeier
Copy link
Collaborator

pmeier commented Aug 10, 2022

I agree that there is a theoretical benefit, but does this an impact on a practical case? Benchmarking it on my machine gives me roughly 3 ns per category in categories. Meaning even for ImageNet (the dataset with the most categories we currently have on record) it takes roughly 3 µs for this call:

import random
import timeit

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = list(range(num_categories))

total = timeit.timeit(lambda: categories.index(random.randint(0, num_categories - 1)), number=num_runs)

print(f"Looking up the category by index with {num_categories} total categories took {total / num_runs * 1e6:.1f} µs")

Your proposal means we should switch catgories attribute to a dictionary with the categories as keys and the index as value, correct? Do you want to keep the functionality to create a label with a categories sequence?

@pmeier pmeier self-assigned this Aug 10, 2022
@NicolasHug
Copy link
Member Author

NicolasHug commented Aug 10, 2022

Categories are strings, not integers. String comparison will take much longer. I did observe a significant difference last time I benchmarked this

@pmeier
Copy link
Collaborator

pmeier commented Aug 10, 2022

Makes no significant difference for me:

import random
from time import perf_counter_ns
import string

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = ["".join(random.choices(string.ascii_lowercase, k=20)) for _ in range(num_categories)]

timings = []

for _ in range(num_runs):
    category = random.choice(categories)

    tic = perf_counter_ns()
    categories.index(category)
    tac = perf_counter_ns()

    timings.append((tac - tic) * 1e-9)

print(
    f"Looking up the category by index with {num_categories} total categories took "
    f"{sum(timings) / num_runs * 1e6:.1f} µs"
)

With string of 20 characters lookup is now linear with 4 ns, i.e. 4 µs for ImageNet.

I did observe a significant difference last time I benchmarked this

Could you share your benchmark?

@NicolasHug
Copy link
Member Author

import random
import string

str_len = 20
num_categories = 1000
batch_size = 512

categories = [''.join(random.choices(string.ascii_lowercase, k=str_len)) for _ in range(num_categories)]
mapping = {cat:i for (i, cat) in enumerate(categories)}
batch_targets = [random.choice(categories) for _ in range(batch_size)]
%%timeit
[categories.index(cat) for cat in batch_targets]

2.59 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
[mapping[cat] for cat in batch_targets]

14 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

A few ms per batch is always good to take, especially considering how simple the solution is, and when we're trying to exhibit benefits of using datapipes over our current datasets. Data-loading speed will be the main factor for adoption.

Do you want to keep the functionality to create a label with a categories sequence?

Sorry, not sure what you mean here?

@pmeier
Copy link
Collaborator

pmeier commented Aug 10, 2022

Right now we create the Label with a sequence of categories:

categories: Optional[Sequence[str]] = None,

categories: Sequence[str],

Do we want to keep this and only internally use dictionaries or should we switch to dictionaries everywhere? That would also mean that we need to touch the info dictionary of each dataset that provides categories:

NAME = "fer2013"
@register_info(NAME)
def _info() -> Dict[str, Any]:
return dict(categories=("angry", "disgust", "fear", "happy", "sad", "surprise", "neutral"))

@NicolasHug
Copy link
Member Author

I don't know yet, honestly. But I'm open to removing this categories logic altogether. We're still in prototype stage and our main focus is loading speed right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants