add frequency metric to determine some average per-second metrics #760

erip · 2020-02-01T18:54:43Z

Fixes # N/A

Description:

This code is to compute X per-second performance metrics (like words per second, images per second, etc). Likely this will be used in conjunction with ignite.metrics.RunningAverage for most utility.

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

vfdev-5 · 2020-02-01T20:39:59Z

@erip thanks for the PR! I see the idea, maybe we can iterate over the implementation...
Could you please provide a snippet of usage you think about?

And to make our CI happy I can add some tests :)

erip · 2020-02-01T20:50:24Z

@vfdev-5 absolutely! I mostly wanted to put some pen to paper quickly - happy to add some tests and see where that leads the implementation.

vfdev-5 · 2020-02-01T21:17:25Z

Thanks for the update. Just saw the docs on usage. How about doing as here for GPU Info ? In this it is configured without using RunningAverage and just fills the metric on iteration.

erip · 2020-02-01T21:20:56Z

@vfdev-5 I have fixed some simple flake issues and added a docstring with an envisioned usage. I suspect that there's some improvements to be made. Since I'd like to compute average throughput, I think it might be good to include a class that inherits from RunningAverage and just wraps this notional FrequencyMetric (like in the docstring). Do you have strong opinions one way or the other?

erip · 2020-02-01T21:21:51Z

Ah, you beat me to the comment. :-)

ignite/contrib/metrics/frequency.py

…ke wps)

…ges.

erip · 2020-02-02T13:34:38Z

@vfdev-5 Ok, hopefully the distributed is-init'd checks are consistent with the rest of ignite.

vfdev-5 · 2020-02-02T14:11:30Z

@erip maybe we can add a single CPU distrib test to ensure the correct behavior.

For example, like here:

ignite/tests/ignite/metrics/test_accumulation.py

Line 374 in 918746b

def test_distrib_cpu(distributed_context_single_node_gloo):

If any questions, please do not hesitate to ask about how to do.

This can run on CPU with the following command:

ignite/.travis.yml

Line 79 in 918746b

    
           - py.test --cov ignite --cov-append --cov-report term-missing --dist=each --tx $WORLD_SIZE*popen//python=python$TRAVIS_PYTHON_VERSION tests -m distributed -vvv

erip · 2020-02-02T14:14:58Z

Brilliant. That's the next thing I wanted to add. 😄

erip · 2020-02-02T14:48:32Z

Yay, it looks like it works. 😄

vfdev-5 · 2020-02-02T14:52:10Z

Okay, let's then wait until CI accomplishes its job and go on with merging.

erip · 2020-02-02T14:54:19Z

Awesome! For your awareness, I'm hoping to begin tackling facebookresearch/fairseq#1648 and this is the first step in that journey. You may see more of me as I run into features that ignite doesn't currently support that fairseq needs for parity.

vfdev-5

LGTM! Thanks @erip

vfdev-5 · 2020-02-02T14:57:10Z

That would be great! So, yes, feel free to send other PRs and we can work out them too as well,

vfdev-5 · 2020-02-02T15:00:27Z

Just another point, I would like to discuss before merging and seeing the context of where it could be potentially used.

Maybe we can put this metric directly into core part as it does not require any additional packages ?
Maybe we can also rename it as just Frequency instead of FrequencyMetric ?

cc @justusschock thoughts ?

erip · 2020-02-02T15:02:04Z

I'm happy to do that - I thought contrib was the landing ground for external contributions; I'll make the two changes separately and they can be cherrypicked as desired.

erip · 2020-02-02T16:15:16Z

@vfdev-5 looks like I've run into some flakiness in the tests which is likely just a result of misunderstanding torch.distributed -- the logic of my test is basically that the X per second of some process with intentional delays should fall in some range between 90% of the "ideal" frequency with these delays and the ideal frequency. When in a distributed environment, these "ideals" (and indeed Xps) should scale up with the number of workers. I assumed this scaling factor would be in torch.distributed.get_world_size(), but it seems this is not the case. My new hypothesis is that it scales in get_world_size() * num_nodes where the world size defines the number of workers on a given node.

Is this a correct understanding? If so, is num_nodes equal to the local rank?

vfdev-5 · 2020-02-02T20:27:07Z

@erip well, I think we did not implemented that correctly. This is my fault, when I suggested to all_reduce elapsed time. What happens now is the following, for example we have 4 processes (world_size=4), and for a single iteration we get _n per process [10, 11, 10, 10] and elapsed per process [1.0, 1.0, 0.99, 1.0]. What we do in compute is "all reduce" with sum operation for _n and elapsed (i.e. compute total of processed _n and elapsed), so _n becomes sum([10, 11, 10, 10]) and elapsed=sum([1.0, 1.0, 0.99, 1.0]). So we have actually a mean of processed objects (tokens) per process.
Possibly, we would like to compute (all_reduce(_n) / all_reduce(elapsed)) * world_size ?

vfdev-5 · 2020-02-02T21:33:47Z

@erip to help you with your random search for a good testing, you can execute a script with Frequency class and the following

class Frequency:
   # ...

def _test_frequency_with_engine(device, workers):
    artificial_time = 2  # seconds
    batch_size = 4
    n_tokens = 10000
    total_tokens = n_tokens * batch_size
    time_per_epoch = total_tokens / artificial_time
    average_upper_bound = time_per_epoch * workers
    average_lower_bound = average_upper_bound * 0.9

    def update_fn(engine, batch):
        time.sleep(artificial_time)
        return {"ntokens": len(batch)}

    engine = Engine(update_fn)
    wps_metric = Frequency(output_transform=lambda x: x["ntokens"], device=device)
    wps_metric.attach(engine, 'wps')
    data = [list(range(n_tokens))] * batch_size
    wps = engine.run(data, max_epochs=1).metrics['wps']
    print("{} | {} | wps: {} | {}".format(dist.get_rank(), average_lower_bound, wps, average_upper_bound))


def test_frequency_with_engine_nondistributed():
    device = "cpu"
    _test_frequency_with_engine(device, workers=1)

if __name__ == "__main__":

    dist.init_process_group("gloo", init_method="env://")
    
    device = "cpu"
    _test_frequency_with_engine(device, workers=dist.get_world_size())

like that

python3 -u -m torch.distributed.launch --nproc_per_node=1 frequency_metric_distrib.py

erip · 2020-02-02T21:36:41Z

Thanks! I tried the local testing prescribed according to the Travis script, but had run into a weird issue. 😓

vfdev-5 · 2020-02-02T21:41:03Z

Actually, another thing I forgot to mention about distributed. Idea is to perform DDP. So we split the data of tokens by process. So, if world size is 4, each process sees 1/4 of total data. And I think this is not coded in the tests...

erip · 2020-02-02T21:44:27Z

There's the missing factor. 😅

vfdev-5 · 2020-02-02T21:49:36Z

Another thing on notations I do not get right is batch_size=4, but data=[list(range(n_tokens))] * batch_size so we have number of iterations is 4 and a batch of n_tokens.

In distributed code, generally, they scale the batch size by number of processes (world_size) to have the same batch size regarding the configuration.

erip · 2020-02-02T22:00:48Z

Interestingly I find that when the world size is 2, there seems to be two passes over the data? I added print(f"Batch size: {len(batch)}") to the update_fn and I see:

Batch size: 10000
Batch size: 10000
Batch size: 10000
Batch size: 10000
Batch size: 10000
Batch size: 10000
Batch size: 10000
Batch size: 10000

My algebra is failing me for some reason today...

erip · 2020-02-02T22:03:51Z

Ok, I think I really found it this time... whew

erip · 2020-02-02T22:05:46Z

I believe my own bastardization of "batch_size" was causing a lot of confusion.

vfdev-5 · 2020-02-02T22:18:45Z

Interestingly I find that when the world size is 2, there seems to be two passes over the data?

Yes, there are two processes who run the training. That's why we need DDP to make it like that:

world_size = 2

Total Data = 4 batches
[----|----|----|----]

Processes seeing data:
[----|----|----|----]
[1111|2222|1111|2222]

So we have over all batch_size of 8 elements and 4 per process. So the epoch lasts now twice less: 2 iterations vs 4 iterations (no distrib).

So when you print, you see the std out of all processes. Normally, we do an if on that

if dist.get_rank() == 0:
    print(...)

erip · 2020-02-02T22:40:29Z

Ok, I think this is better now. Thanks for your patience and help!

vfdev-5 · 2020-02-02T22:59:14Z

@erip I'm playing with tests and the code and probably it is not the end :)
Actual test has only one iteration, so there is a single update done and there is a bug with multiple updates. Here is my code:

def _test_frequency_with_engine(device, workers):
    artificial_time = 0.1  # seconds    
    total_tokens = 2000
    batch_size = 128 // workers    

    def update_fn(engine, batch):
        time.sleep(artificial_time)
        return {"ntokens": len(batch)}

    engine = Engine(update_fn)
    wps_metric = Frequency(output_transform=lambda x: x["ntokens"], device=device)
    wps_metric.attach(engine, 'wps')
    
    @engine.on(Events.ITERATION_COMPLETED)
    def assert_wps(e):
        wps = e.state.metrics['wps']
        if dist.get_rank() == 0:
            print("{}: wps={}".format(e.state.iteration, wps))
    
    data = [[i] * batch_size for i in range(0, total_tokens, batch_size)]    
    engine.run(data, max_epochs=1)

if __name__ == "__main__":

    dist.init_process_group("gloo", init_method="env://")
    device = "cpu"
    _test_frequency_with_engine(device, workers=dist.get_world_size())

if executed as

python3 -u -m torch.distributed.launch --nproc_per_node=1 frequency_metric_distrib.py

the output is

1: wps=1258
2: wps=1263
3: wps=1247
4: wps=1243
....
13: wps=1241
14: wps=1240
15: wps=1239

It is OK, as it is about 128 samples per 0.1 seconds.

If executed as

python3 -u -m torch.distributed.launch --nproc_per_node=2 frequency_metric_distrib.py

we have

1: wps=1185
2: wps=1805
3: wps=2820
4: wps=4544
5: wps=7540
...
30: wps=43612306556
31: wps=84412682684
32: wps=163572946560

Something still to fix with distrib config.

PS: I'm curious about what they do in fairseq for this ?

erip · 2020-02-02T23:04:35Z

Fairseq uses what they call a TimeMeter which is a less robust version of ignite's Metric. The avg property is logged after each batch. The meters are reset at each batch, I think. I don't think this is a great way to approach this because it makes the wps very sensitive to a lot of things including IO.

vfdev-5 · 2020-02-02T23:31:49Z

@erip I found the problem, let me commit directly the fix and updated test

vfdev-5 · 2020-02-02T23:48:34Z

@erip I made the changes vs your code:

measure the time in update instead of compute.
updated tests to measure wps on each iteration and check ranges.

it makes the wps very sensitive to a lot of things including IO.

In this code it will be also sensitive to IO, as timer measures the time between iterations: read data -> batch prep -> update model.
If we would like to exclude read data -> batch prep we need to stop timer on Events.GET_BATCH_STARTED and resume on Events.GET_BATCH_COMPLETED.

ignite/ignite/engine/engine.py

Line 408 in 918746b

self._fire_event(Events.GET_BATCH_STARTED)

erip · 2020-02-02T23:51:55Z

I suspect there's no way around it. :-)

vfdev-5 · 2020-02-02T23:56:38Z

I suspect there's no way around it. :-)

Well, we need to setup timer as here:

ignite/ignite/handlers/timing.py

Line 90 in 918746b

def attach(self, engine: Engine, start: str = Events.STARTED,

on the correct events...

vfdev-5 · 2020-02-03T00:18:30Z

@erip if you ok with this implementation we can merge it and if needed update the code to exclude data processing.

erip · 2020-02-03T00:19:35Z

I'm OK with the implementation as-is for now. I think there may be complications surrounding wiring the Frequency._timer for the right events because Frequency._timer will be None in the call to Frequency.attach.

vfdev-5 · 2020-02-03T00:28:03Z

Thanks for pointing out that. Actually, reset() is called at Frequency.__init__ with super().__init__. So, timer starts counting even before trainer.run. Normally, in metrics we attached reset to Events.EPOCH_STARTED. Such that every epoch (=run for validation) will compute metrics from scratch.
Here, we miss this.

So, in this way, Frequency._timer is already created and can be finer set up to avoid data IO if this is really what we would like to have.

erip · 2020-02-03T00:32:55Z

I will defer to you about whether to merge now or to wait for a more complete solution. For fairseq this is good enough. 👍

vfdev-5 reviewed Feb 1, 2020

View reviewed changes

ignite/contrib/metrics/frequency.py Outdated Show resolved Hide resolved

vfdev-5 reviewed Feb 1, 2020

View reviewed changes

ignite/contrib/metrics/frequency.py Outdated Show resolved Hide resolved

erip added 9 commits February 1, 2020 20:37

add frequency metric to determine some average per-second metrics (li…

59cf7d2

…ke wps)

add usage in docstring.

85f2f78

fix flake.

d294f9d

fill the metric on each iteration instead of relying on running avera…

08aedf4

…ges.

incorporate feedback

335f826

add missing import.

e3ed596

add initial tests. TODO: fix them!

e9b0bc8

fix distributed initialization check in the metric and fix tests.

dab4b74

fix flake

88322c9

erip added 3 commits February 2, 2020 09:24

add distributed test.

e0b49d8

add device to ensure it runs on CPU.

3bbde68

add non-distributed test

edaa55a

vfdev-5 approved these changes Feb 2, 2020

View reviewed changes

erip changed the title ~~[WIP] add frequency metric to determine some average per-second metrics~~ add frequency metric to determine some average per-second metrics Feb 2, 2020

fix distributed test to incorporate workers in assertions.

d2068ac

fix logic of test.

5ed622e

fix logic to factor the scaling across workers.

4fbfbe6

fix test.

9efa813

Fixed accumulation problem in distrib config

cbb2c28

Attached reset to epoch started

ee0f810

vfdev-5 merged commit 0375a6e into pytorch:master Feb 3, 2020

erip deleted the feature/add-frequency-metric branch February 3, 2020 13:22

add frequency metric to determine some average per-second metrics #760

add frequency metric to determine some average per-second metrics #760

Conversation

erip commented Feb 1, 2020 • edited

vfdev-5 commented Feb 1, 2020

erip commented Feb 1, 2020

vfdev-5 commented Feb 1, 2020

erip commented Feb 1, 2020

erip commented Feb 1, 2020

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020 • edited

erip commented Feb 2, 2020

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020

erip commented Feb 2, 2020

vfdev-5 left a comment

Choose a reason for hiding this comment

vfdev-5 commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020

erip commented Feb 2, 2020

erip commented Feb 2, 2020 • edited

vfdev-5 commented Feb 2, 2020 • edited

vfdev-5 commented Feb 2, 2020 • edited

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020 • edited

erip commented Feb 2, 2020

erip commented Feb 2, 2020

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020 • edited

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020

erip commented Feb 2, 2020 • edited

vfdev-5 commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020 • edited

erip commented Feb 2, 2020

vfdev-5 commented Feb 2, 2020

vfdev-5 commented Feb 3, 2020

erip commented Feb 3, 2020 • edited

vfdev-5 commented Feb 3, 2020 • edited

erip commented Feb 3, 2020

erip commented Feb 1, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

erip commented Feb 2, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

erip commented Feb 2, 2020 •

edited

vfdev-5 commented Feb 2, 2020 •

edited

erip commented Feb 3, 2020 •

edited

vfdev-5 commented Feb 3, 2020 •

edited