`Label.from_category` is O(num_classes) complexity, for every single sample #6368

NicolasHug · 2022-08-04T09:50:30Z

This function is called when loading every single sample, but its complexity is O(num_classes) because of the .index() call, which is probably overkill. A simple dict lookup should be enough and much faster here.

vision/torchvision/prototype/features/_label.py

Lines 35 to 43 in 96aa3d9

    
           @classmethod 
        
           def from_category( 
        
               cls, 
        
               category: str, 
        
               *, 
        
               categories: Sequence[str], 
        
               **kwargs: Any, 
        
           ) -> Label: 
        
               return cls(categories.index(category), categories=categories, **kwargs)

CC @pmeier

The text was updated successfully, but these errors were encountered:

pmeier · 2022-08-10T09:55:40Z

I agree that there is a theoretical benefit, but does this an impact on a practical case? Benchmarking it on my machine gives me roughly 3 ns per category in categories. Meaning even for ImageNet (the dataset with the most categories we currently have on record) it takes roughly 3 µs for this call:

import random
import timeit

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = list(range(num_categories))

total = timeit.timeit(lambda: categories.index(random.randint(0, num_categories - 1)), number=num_runs)

print(f"Looking up the category by index with {num_categories} total categories took {total / num_runs * 1e6:.1f} µs")

Your proposal means we should switch catgories attribute to a dictionary with the categories as keys and the index as value, correct? Do you want to keep the functionality to create a label with a categories sequence?

NicolasHug · 2022-08-10T09:59:58Z

Categories are strings, not integers. String comparison will take much longer. I did observe a significant difference last time I benchmarked this

pmeier · 2022-08-10T10:09:02Z

Makes no significant difference for me:

import random
from time import perf_counter_ns
import string

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = ["".join(random.choices(string.ascii_lowercase, k=20)) for _ in range(num_categories)]

timings = []

for _ in range(num_runs):
    category = random.choice(categories)

    tic = perf_counter_ns()
    categories.index(category)
    tac = perf_counter_ns()

    timings.append((tac - tic) * 1e-9)

print(
    f"Looking up the category by index with {num_categories} total categories took "
    f"{sum(timings) / num_runs * 1e6:.1f} µs"
)

With string of 20 characters lookup is now linear with 4 ns, i.e. 4 µs for ImageNet.

I did observe a significant difference last time I benchmarked this

Could you share your benchmark?

NicolasHug · 2022-08-10T10:31:47Z

import random
import string

str_len = 20
num_categories = 1000
batch_size = 512

categories = [''.join(random.choices(string.ascii_lowercase, k=str_len)) for _ in range(num_categories)]
mapping = {cat:i for (i, cat) in enumerate(categories)}
batch_targets = [random.choice(categories) for _ in range(batch_size)]

%%timeit
[categories.index(cat) for cat in batch_targets]

2.59 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
[mapping[cat] for cat in batch_targets]

14 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

A few ms per batch is always good to take, especially considering how simple the solution is, and when we're trying to exhibit benefits of using datapipes over our current datasets. Data-loading speed will be the main factor for adoption.

Do you want to keep the functionality to create a label with a categories sequence?

Sorry, not sure what you mean here?

pmeier · 2022-08-10T11:07:44Z

Right now we create the Label with a sequence of categories:

vision/torchvision/prototype/features/_label.py

Line 18 in 2e70ee1

categories: Optional[Sequence[str]] = None,

vision/torchvision/prototype/features/_label.py

Line 40 in 2e70ee1

categories: Sequence[str],

Do we want to keep this and only internally use dictionaries or should we switch to dictionaries everywhere? That would also mean that we need to touch the info dictionary of each dataset that provides categories:

vision/torchvision/prototype/datasets/_builtin/fer2013.py

Lines 12 to 17 in 2e70ee1

    
           NAME = "fer2013" 
        
           @register_info(NAME) 
        
           def _info() -> Dict[str, Any]: 
        
               return dict(categories=("angry", "disgust", "fear", "happy", "sad", "surprise", "neutral"))

NicolasHug · 2022-08-10T11:14:01Z

I don't know yet, honestly. But I'm open to removing this categories logic altogether. We're still in prototype stage and our main focus is loading speed right now.

pmeier self-assigned this Aug 10, 2022

NicolasHug closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Label.from_category` is O(num_classes) complexity, for every single sample #6368

`Label.from_category` is O(num_classes) complexity, for every single sample #6368

NicolasHug commented Aug 4, 2022 •

edited

pmeier commented Aug 10, 2022 •

edited

NicolasHug commented Aug 10, 2022 •

edited

pmeier commented Aug 10, 2022 •

edited

NicolasHug commented Aug 10, 2022

pmeier commented Aug 10, 2022

NicolasHug commented Aug 10, 2022

Label.from_category is O(num_classes) complexity, for every single sample #6368

Label.from_category is O(num_classes) complexity, for every single sample #6368

Comments

NicolasHug commented Aug 4, 2022 • edited

pmeier commented Aug 10, 2022 • edited

NicolasHug commented Aug 10, 2022 • edited

pmeier commented Aug 10, 2022 • edited

NicolasHug commented Aug 10, 2022

pmeier commented Aug 10, 2022

NicolasHug commented Aug 10, 2022

`Label.from_category` is O(num_classes) complexity, for every single sample #6368

`Label.from_category` is O(num_classes) complexity, for every single sample #6368

NicolasHug commented Aug 4, 2022 •

edited

pmeier commented Aug 10, 2022 •

edited

NicolasHug commented Aug 10, 2022 •

edited

pmeier commented Aug 10, 2022 •

edited