adding A100, other configs comparison for fun #6

ngam · 2021-10-27T15:32:37Z

Summary tables (more details below and in comments):

Model	M1 7c	M1 32c	A 100 (-)	V 100 (-)	P 100 (-)	T4 (-)	K 80 (-)	Q P 5000 (-)	Q M 4000 (-)	Q RTX 4000 (-)
RN50	10	135	611	347	211	134	F	131	F	F
MNV2	23	352	269	187	125	193	94	181	F	F
DBERT	15	120	761	187	149	94	47	109	39	129
BERTL	1	18	136	31	16	15	4	17	F	F

Model	M1 7c	M1 32c	A 100 (+)	V 100 (+)	P 100 (+)	T4 (+)	K 80 (+)	Q P 5000 (+)	Q M 4000 (+)	Q RTX 4000 (+)
RN50	10	135	1147	v100	252	na	na	na	na	na
MNV2	23	352	1870	v100	128	na	na	na	na	na
DBERT	15	120	1909	v100	209	na	na	na	na	na
BERTL	1	18	309	v100	23	na	na	na	na	na

Really good and useful work here --- thank you! I can potentially fill the remaining values if you think this would be of interest. Note: I simply copied-pasted the content of your .py files and ran them in a notebook. Also note: accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo. RTX 3090 and A100 are somewhat similar in my understanding in terms if benchmarks like these fwiw.

M1 on MBA (7-core, 8GB RAM). Completely froze my laptop! Even the trackpad stopped responding... first time ever I notice I have this slowdown/lag happens on this MBa, but results incoming! Obviously it's taking forever...

Model	GPU	BatchSize	Throughput	Power	Memory
ResNet50	M1 7c	64	10.3 img/sec	?	?
MobileNetV2	M1 7c	128	22.7 img/sec	?	?
DistilBERT	M1 7c	64	15.2 seq/sec	?	?
BERTLarge	M1 7c	32	0.6 seq/sec	?	?

with optimization:

Model	GPU	BatchSize	Throughput	Power	Memory
ResNet50	A100 40GB	64	1147.4 img/sec	?	?
MobileNetV2	A100 40GB	128	1869.7 img/sec	?	?
DistilBERT	A100 40GB	64	1909.3 seq/sec	?	?
BERTLarge	A100 40GB	32	309.3 seq/sec	?	?

without optimziation:

Model	GPU	BatchSize	Throughput	Power	Memory
ResNet50	A100 40GB	64	610.8 img/sec	?	?
MobileNetV2	A100 40GB	128	269.4 img/sec	?	?
DistilBERT	A100 40GB	64	761.1 seq/sec	?	?
BERTLarge	A100 40GB	32	135.5 seq/sec	?	?

Model	GPU	BatchSize	Throughput	Power	Memory
ResNet50	M1 Max 32c	64	135 img/sec	40W	13 GB
MobileNetV2	M1 Max 32c	128	352 img/sec	37W	15 GB
DistilBERT	M1 Max 32c	64	120 seq/sec	35W	9 GB
BERTLarge	M1 Max 32c	32	18 seq/sec	36W	14 GB

Model	GPU	BatchSize	Throughput	Power
ResNet50	3090	64	957 img/sec	300W
MobileNetV2	3090	128	1927 img/sec	310W
DistilBERT	3090	64	1040 seq/sec	310W
BERTLarge	3090	32	164 seq/sec	320W

ngam · 2021-10-27T16:15:54Z

@tlkh feel free to incorporate this somehow. I can fill in the remaining values if people want that. I am actually curious about your memory values --- these tests ran successfully on my M1 MBA with 8GB RAM and 7c GPU...

tlkh · 2021-10-27T16:23:25Z

@tlkh feel free to incorporate this somehow. I can fill in the remaining values if people want that. I am actually curious about your memory values --- these tests ran successfully on my M1 MBA with 8GB RAM and 7c GPU...

You mentioned that your MBA froze - it might have been swapping and using the SSD as RAM.

ngam · 2021-10-27T16:28:56Z

@tlkh feel free to incorporate this somehow. I can fill in the remaining values if people want that. I am actually curious about your memory values --- these tests ran successfully on my M1 MBA with 8GB RAM and 7c GPU...

You mentioned that your MBA froze - it might have been swapping and using the SSD as RAM.

Ah! Good point! If you have easy-to-conduct tests to show the behavior in more details, let me know and I can run those. I also can run on a variety of more GPUs I have access to: V100, P100, etc

tlkh · 2021-10-27T16:37:58Z

Unfortunately I have found it hard to observe the system state/metrics of these M1 macs while running code. I have put together a small CLI app, you can try it and see if it is useful.

https://github.com/tlkh/asitop

Currently it doesn’t display swap status, but now I that I think of it, it should haha. I’ll add that in eventually…

Note that I literally just built it, and I only tested on my M1 Max MacBook.

SubaruSpirit · 2021-10-28T00:37:39Z

Can you test other less powerful gpu that’s more comparable to M1 max please? For example, 3060, 3070, or even 3080 laptop gpu. Would be helpful on people’s buying decision. Not a very fair competition comparing 40w and 300w… but still very interesting, thank you

tlkh · 2021-10-28T01:42:01Z

Can you test other less powerful gpu that’s more comparable to M1 max please? For example, 3060, 3070, or even 3080 laptop gpu. Would be helpful on people’s buying decision. Not a very fair competition comparing 40w and 300w… but still very interesting, thank you

I don’t have any other GPU with which to test. I’m not a tech YouTuber, haha

ngam · 2021-10-28T02:14:08Z

LOL --- but @SubaruSpirit has a good point! Hopefully people who see this will add their own testing using your files!

tlkh · 2021-10-28T02:39:57Z

@ngam thanks for your testing by the way -- I did not respond earlier since I noticed you were iteratively editing your post to update the results. I do think that for more accurate measurements, we need to make sure MBA is not swapping, and also we need to make the dataset (hence epoch time) longer on A100. On my 3090, some of the tests finishes in seconds, which leads to more overhead in the measurement. Setting dataset size to batch*100 should be fine.

garyongguanjie · 2021-10-28T06:20:56Z

I tested this using a Kaggle notebooks (unoptmized).

https://www.kaggle.com/garyongguanjie/kaggle-p100-vs-apple-m1

Model	GPU	BatchSize	Throughput	Peak Power	Memory
ResNet50	P100	128	211 img/sec	?	?
MobileNetV2	P100	128	125 img/sec	?	?
DistilBERT	P100	64	149 seq/sec	?	?
BERTLarge	P100	16	16 seq/sec	?	?

tlkh · 2021-10-28T06:56:21Z

@garyongguanjie thanks, good idea to compare to P100 from Kaggle, and thanks for the notebook.

Added some cli flags to the scripts for better comparison to GPU performance. Updated notebook: https://www.kaggle.com/timothyliukh/kaggle-p100-vs-apple-m1-max-32core

Model	GPU	BatchSize	Throughput	Peak Power	Memory
ResNet50	P100	128	252 img/sec	?	?
MobileNetV2	P100	128	312 img/sec	?	?
DistilBERT	P100	64	209 seq/sec	?	?
BERTLarge	P100	16	23 seq/sec	?	?

Main change here is to enable the XLA compiler, which gives decent amount of improvement.

garyongguanjie · 2021-10-28T07:50:32Z

@garyongguanjie thanks, good idea to compare to P100 from Kaggle, and thanks for the notebook.

Added some cli flags to the scripts for better comparison to GPU performance. Updated notebook: https://www.kaggle.com/timothyliukh/kaggle-p100-vs-apple-m1-max-32core

Model GPU BatchSize Throughput Peak Power Memory
ResNet50 P100 128 252 img/sec ? ?
MobileNetV2 P100 128 312 img/sec ? ?
DistilBERT P100 64 209 seq/sec ? ?
BERTLarge P100 16 23 seq/sec ? ?
Main change here is to enable the XLA compiler, which gives decent amount of improvement.

Interestingly MobileNetV2 is faster on MacBook Pro. Probably because the MacBook uses same architecture as iPhone and which mobilenet was optimized for?

tlkh · 2021-10-28T08:26:26Z

Interestingly MobileNetV2 is faster on MacBook Pro. Probably because the MacBook uses same architecture as iPhone and which mobilenet was optimized for?

I think that would be an over simplification, MobileNet (using separable convolutions) is designed for minimum parameters and MAC count, and it has a much lower arithmetic intensity (ratio of compute : memory access) compared to normal convolution. Hence, it performs better on CPUs in general. I suspect on M1, the large on-chip cache is helping it a lot, but we wont know for sure unless someone does an in-depth profiling.

SubaruSpirit · 2021-10-28T08:57:46Z

I recommend Jordan Harris's YouTube channel, she's already bought the new MacBook Pro with M1 max and is doing the testing on dl, hopefully we will see a more in-depth review soon. But it's a little disappointing see how bad m1 max performs already compared to 3090, given how expensive it is. Just hope someone with a mobile 3000series can give it a go to set some benchmarks.

ngam · 2021-10-28T12:31:53Z

Makes sense: nvidia apparently refused to offer optimization for separableconv2d --- according to the guy who wrote keras

ngam · 2021-10-28T13:07:02Z

I will add V100, T4, RTX4000, P5000, and some of the k series today. Note that I don't have access to any consumer level GPUs, so I won't be able to provide any stuff on that. However, I think these values should suffice to get people to understand what the benefit/downside of M1 Max 32c compared with the NVIDIA GPUs in general. It's important to understand that only NVIDIA GPUs have been optimized for tensorflow/pytorch and the rest are never going to be equivalent for the time being.

Agree with @SubaruSpirit, it is disappointing, but not surprising at all... Unless Apple convinces TF/PT to add even deeper support, it won't improve much.

ngam · 2021-10-28T13:22:34Z

I mean, it looks like the M1 Max 32c is equivalent to Tesla T4 --- that's pretty good!

tlkh · 2021-10-28T13:29:36Z

@ngam amazing, thanks! I think this will be very useful.

Fwiw:

accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo

In general using mixed precision (FP16+FP32) is quite a standard practice across most larger scale training, and typically do not incur accuracy cost. Most AI chips also do this, many can't even train in FP32.

If the goal is to compare what you get easily today on NVIDIA GPU vs Apple/Metal GPU, then we compare what is available out-of-the-box without too much effort. Since it is only 2 line of code (1 to enable XLA compiler, 1 to enable FP16), no reason not to use it in comparison, otherwise we leave a lot of perf (on the NVIDIA side) out of the comparison.

In addition, I am starting to think TF may be less performant than PyTorch on NVIDIA GPU, but that's probably getting too far off topic...

ngam · 2021-10-28T13:36:36Z

That's fair @tlkh --- though, I should stress the results I am compiling in the main text above should be taken as a general guideline and by no means for legit comparison. I simply copied your repo and ran your files right after opening an instance. Not exactly ideal (other things may be happening then), but should give a general guideline to people. See if you spot any issue as more results are added and let me know if I should repeat any of them to confirm!

tlkh · 2021-10-28T13:43:39Z

I did have some concerns - nothing against you, but I am hesitant to endorse results I cannot verify as well.

I do think it's a good community effort putting this together, and we can label it as such (crowdsourced benchmarks).

ngam · 2021-10-28T13:47:00Z

Yes, exactly --- can you upload results for 3090 without the xla and fp16 flags?

tlkh · 2021-10-28T14:18:36Z

RN50 - 464
MNV2 - 380
DistilBERT - 396
BERTLarge - 62

One thing I just realised: what is the batch sizes you are using? Are you just using my script? Yesterday, I updated the script to use BS=128 for ResNet and BS=16 for BERTLarge. (M1 is slightly faster in this case)

If you're just pulling my git repo, some of your earlier benchmarks might have different batch size... we might need to sync up a bit, alternatively, we just indicate the batch size used in the benchmarks.

Sorry about that!

ngam · 2021-10-28T14:26:03Z

Yes, I will redo the tests later from scratch and indicate the different settings as well.

However, I think from my random tests so far: the picture is somewhat clear.
M1 Max 32c is equivalent to a low-end or mid-range NVIDIA GPU when it comes to DL. That's pretty good, imo.

Another way to look at it:
Going from M1 7c (MBA) to M1 Max 32c gets you 10x improvement when it comes to DL; going from M1 Max 32c to A100 40GB gets you another 10x improvement (with optimization flags turned on). A100 being the absolute best you can get in a professional setting which is also more or less equivalent to RTX 3090, this isn't bad; it also gets slightly better for M1 if you are forced to use 32 only (some scientific applications require 32).

One question for @tlkh: How is the fan on the M1 Max 32c??? Does it go crazy when you run these benchmarks? Any insights on the fan noise/heating in general? I love the fact that my fanless MBA never makes noise!!!

tlkh · 2021-10-28T14:31:01Z

I agree, the M1 Max is actually very good, both in perf/watt and especially since it does not have Tensor Cores.

Also, and this might surprise you: I haven’t heard my fans yet! (even with high power mode) I’m actually wondering why these chips aren’t allowed to clock higher.

ngam · 2021-10-28T14:43:02Z

That's fantastic about the fans --- I am waiting for my 14-inch order (64 GB RAM + M1 Max 32c) to come in a week or two.

And I totally agree with you: This is really good for DL. I honestly thought it was going to be completely meh. Anyway, if someone is that serious about performance wrt DL, they will need to go all in and get the DGX or something like that. A laptop should only be used for basic and quick prototyping, nothing more. I thought quite hard about getting a home setup for DL, but I decided it was simply not worth it, especially when you consider how cheap cloud GPU alternatives are. (Fwiw, I am also associated with an academic institution and I get access to all sorts of GPUs / compute resources for free anyway.)

ngam · 2021-10-28T16:14:35Z

Jordan Harris's YouTube channel

Can you provide a link @SubaruSpirit --- who's this person??? 👀

SubaruSpirit · 2021-10-28T16:25:44Z

Jordan Harris's YouTube channel

Can you provide a link @SubaruSpirit --- who's this person??? 👀

My apology it should be Jordan Harrod , the autocorrect changed it to Harris 😅 she hasn't made the video yet, but should be out next Monday. She did a few videos on M1 Macbook using Tensorflow, and she recently purchased m1 pro and max for testing, you can go to the community page of her channel, there is a post where she asked you to post what you want her to test as well.

ngam · 2021-10-28T16:32:50Z

Jordan Harrod

Thanks! Subscribing now 😃

I think her take is pretty spot on here: https://youtube.com/shorts/kRSvkjuc4ac?feature=share

ngam · 2021-11-05T00:24:11Z

Also, and this might surprise you: I haven’t heard my fans yet! (even with high power mode) I’m actually wondering why these chips aren’t allowed to clock higher.

@SubaruSpirit + @tlkh

I got my MBP14inch2021 (M1 Max, 64 GB, 1 TB). Sad to report, the fans do turn on pretty quickly with the tests in this repo. Results are in line with what's published in this issue and in the main readme, ~5--10% lower, I'd say. The fans are not as noisy as the Intel versions; and the laptop isn't as hot. However, if fan noise is an issue for you, the M1 Air has no fans! Having said that, the noise is easy to miss --- the fans are very quiet even when they are on.

In my opinion, besides the display and potentially pro video editing, M1 Air is more or less the same. This MBP does offer a little bit of extra smoothness and it goes well with the extra smooth display. Is it worth 4x the M1 Air though? That'd be the question for you to answer 🤞

tlkh · 2021-11-05T01:52:06Z

Seems roughly in line with reviews, the 14 inch model definitely has poorer cooling! Glad you’re enjoying your new MacBook!

ngam · 2021-11-08T17:06:36Z

@tlkh closing this, but feel free to copy/edit the values, etc. 👍

ngam changed the title ~~adding A100 comparison for fun~~ adding A100, other configs comparison for fun Oct 27, 2021

ngam closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding A100, other configs comparison for fun #6

adding A100, other configs comparison for fun #6

ngam commented Oct 27, 2021 •

edited

ngam commented Oct 27, 2021

tlkh commented Oct 27, 2021

ngam commented Oct 27, 2021

tlkh commented Oct 27, 2021

SubaruSpirit commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

garyongguanjie commented Oct 28, 2021

tlkh commented Oct 28, 2021

garyongguanjie commented Oct 28, 2021

tlkh commented Oct 28, 2021

SubaruSpirit commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

SubaruSpirit commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Nov 5, 2021

tlkh commented Nov 5, 2021

ngam commented Nov 8, 2021

adding A100, other configs comparison for fun #6

adding A100, other configs comparison for fun #6

Comments

ngam commented Oct 27, 2021 • edited

ngam commented Oct 27, 2021

tlkh commented Oct 27, 2021

ngam commented Oct 27, 2021

tlkh commented Oct 27, 2021

SubaruSpirit commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

garyongguanjie commented Oct 28, 2021

tlkh commented Oct 28, 2021

garyongguanjie commented Oct 28, 2021

tlkh commented Oct 28, 2021

SubaruSpirit commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

tlkh commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Oct 28, 2021

SubaruSpirit commented Oct 28, 2021

ngam commented Oct 28, 2021

ngam commented Nov 5, 2021

tlkh commented Nov 5, 2021

ngam commented Nov 8, 2021

ngam commented Oct 27, 2021 •

edited