New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Label.from_category
is O(num_classes) complexity, for every single sample
#6368
Comments
I agree that there is a theoretical benefit, but does this an impact on a practical case? Benchmarking it on my machine gives me roughly 3 ns per category in import random
import timeit
num_categories = 1_000 # maximum that we currently have
num_runs = 1_000_000
categories = list(range(num_categories))
total = timeit.timeit(lambda: categories.index(random.randint(0, num_categories - 1)), number=num_runs)
print(f"Looking up the category by index with {num_categories} total categories took {total / num_runs * 1e6:.1f} µs") Your proposal means we should switch |
Categories are strings, not integers. String comparison will take much longer. I did observe a significant difference last time I benchmarked this |
Makes no significant difference for me: import random
from time import perf_counter_ns
import string
num_categories = 1_000 # maximum that we currently have
num_runs = 1_000_000
categories = ["".join(random.choices(string.ascii_lowercase, k=20)) for _ in range(num_categories)]
timings = []
for _ in range(num_runs):
category = random.choice(categories)
tic = perf_counter_ns()
categories.index(category)
tac = perf_counter_ns()
timings.append((tac - tic) * 1e-9)
print(
f"Looking up the category by index with {num_categories} total categories took "
f"{sum(timings) / num_runs * 1e6:.1f} µs"
) With string of 20 characters lookup is now linear with 4 ns, i.e. 4 µs for ImageNet.
Could you share your benchmark? |
import random
import string
str_len = 20
num_categories = 1000
batch_size = 512
categories = [''.join(random.choices(string.ascii_lowercase, k=str_len)) for _ in range(num_categories)]
mapping = {cat:i for (i, cat) in enumerate(categories)}
batch_targets = [random.choice(categories) for _ in range(batch_size)] %%timeit
[categories.index(cat) for cat in batch_targets]
2.59 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit
[mapping[cat] for cat in batch_targets]
14 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) A few ms per batch is always good to take, especially considering how simple the solution is, and when we're trying to exhibit benefits of using datapipes over our current datasets. Data-loading speed will be the main factor for adoption.
Sorry, not sure what you mean here? |
Right now we create the
Do we want to keep this and only internally use dictionaries or should we switch to dictionaries everywhere? That would also mean that we need to touch the info dictionary of each dataset that provides categories: vision/torchvision/prototype/datasets/_builtin/fer2013.py Lines 12 to 17 in 2e70ee1
|
I don't know yet, honestly. But I'm open to removing this |
This function is called when loading every single sample, but its complexity is
O(num_classes)
because of the.index()
call, which is probably overkill. A simple dict lookup should be enough and much faster here.vision/torchvision/prototype/features/_label.py
Lines 35 to 43 in 96aa3d9
CC @pmeier
The text was updated successfully, but these errors were encountered: