Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Kepler results (GTX 760M, driver version 419.67) #14

Closed
ash3D opened this issue Apr 2, 2019 · 2 comments
Closed

NVIDIA Kepler results (GTX 760M, driver version 419.67) #14

ash3D opened this issue Apr 2, 2019 · 2 comments

Comments

@ash3D
Copy link

ash3D commented Apr 2, 2019

PerfTest
To select adapter, use: PerfTest.exe [ADAPTER_INDEX]

Adapters found:
0: NVIDIA GeForce GTX 760M
1: Intel(R) HD Graphics 4600
2: Microsoft Basic Render Driver
Using adapter 0

Running 30 warm-up frames and 30 benchmark frames:
.............................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Performance compared to Buffer.Load random

Buffer.Load uniform: 3.073ms 62.440x
Buffer.Load linear: 195.662ms 0.981x
Buffer.Load random: 197.022ms 0.974x
Buffer.Load uniform: 3.227ms 59.465x
Buffer.Load linear: 195.179ms 0.983x
Buffer.Load random: 196.785ms 0.975x
Buffer.Load uniform: 3.598ms 53.329x
Buffer.Load linear: 193.676ms 0.991x
Buffer.Load random: 191.866ms 1.000x
Buffer.Load uniform: 3.031ms 63.308x
Buffer.Load linear: 195.622ms 0.981x
Buffer.Load random: 197.009ms 0.974x
Buffer.Load uniform: 3.025ms 63.434x
Buffer.Load linear: 195.135ms 0.983x
Buffer.Load random: 196.860ms 0.975x
Buffer.Load uniform: 3.443ms 55.728x
Buffer.Load linear: 193.744ms 0.990x
Buffer.Load random: 191.929ms 1.000x
Buffer.Load uniform: 2.970ms 64.605x
Buffer.Load linear: 195.751ms 0.980x
Buffer.Load random: 197.141ms 0.973x
Buffer.Load uniform: 3.175ms 60.425x
Buffer.Load linear: 195.351ms 0.982x
Buffer.Load random: 196.911ms 0.974x
Buffer.Load uniform: 3.621ms 52.985x
Buffer.Load linear: 350.658ms 0.547x
Buffer.Load random: 350.633ms 0.547x
ByteAddressBuffer.Load uniform: 3.758ms 51.055x
ByteAddressBuffer.Load linear: 191.898ms 1.000x
ByteAddressBuffer.Load random: 216.928ms 0.884x
ByteAddressBuffer.Load2 uniform: 4.682ms 40.977x
ByteAddressBuffer.Load2 linear: 390.852ms 0.491x
ByteAddressBuffer.Load2 random: 442.053ms 0.434x
ByteAddressBuffer.Load3 uniform: 572.822ms 0.335x
ByteAddressBuffer.Load3 linear: 568.316ms 0.338x
ByteAddressBuffer.Load3 random: 570.361ms 0.336x
ByteAddressBuffer.Load4 uniform: 752.691ms 0.255x
ByteAddressBuffer.Load4 linear: 758.795ms 0.253x
ByteAddressBuffer.Load4 random: 763.638ms 0.251x
ByteAddressBuffer.Load2 unaligned uniform: 4.199ms 45.692x
ByteAddressBuffer.Load2 unaligned linear: 391.542ms 0.490x
ByteAddressBuffer.Load2 unaligned random: 442.574ms 0.434x
ByteAddressBuffer.Load4 unaligned uniform: 752.793ms 0.255x
ByteAddressBuffer.Load4 unaligned linear: 758.698ms 0.253x
ByteAddressBuffer.Load4 unaligned random: 763.679ms 0.251x
StructuredBuffer.Load uniform: 3.103ms 61.827x
StructuredBuffer.Load linear: 195.674ms 0.981x
StructuredBuffer.Load random: 196.991ms 0.974x
StructuredBuffer.Load uniform: 3.301ms 58.120x
StructuredBuffer.Load linear: 195.167ms 0.983x
StructuredBuffer.Load random: 196.749ms 0.975x
StructuredBuffer.Load uniform: 3.846ms 49.882x
StructuredBuffer.Load linear: 350.461ms 0.547x
StructuredBuffer.Load random: 350.494ms 0.547x
cbuffer{float4} load uniform: 4.478ms 42.844x
cbuffer{float4} load linear: 9217.404ms 0.021x
cbuffer{float4} load random: 3333.476ms 0.058x
Texture2D.Load uniform: 3.384ms 56.695x
Texture2D.Load linear: 202.197ms 0.949x
Texture2D.Load random: 204.327ms 0.939x
Texture2D.Load uniform: 3.731ms 51.424x
Texture2D.Load linear: 198.542ms 0.966x
Texture2D.Load random: 211.881ms 0.906x
Texture2D.Load uniform: 4.306ms 44.558x
Texture2D.Load linear: 196.088ms 0.978x
Texture2D.Load random: 195.847ms 0.980x
Texture2D.Load uniform: 3.419ms 56.118x
Texture2D.Load linear: 202.264ms 0.949x
Texture2D.Load random: 204.311ms 0.939x
Texture2D.Load uniform: 3.673ms 52.243x
Texture2D.Load linear: 198.553ms 0.966x
Texture2D.Load random: 211.917ms 0.905x
Texture2D.Load uniform: 4.115ms 46.626x
Texture2D.Load linear: 196.084ms 0.978x
Texture2D.Load random: 350.561ms 0.547x
Texture2D.Load uniform: 3.517ms 54.547x
Texture2D.Load linear: 202.339ms 0.948x
Texture2D.Load random: 204.392ms 0.939x
Texture2D.Load uniform: 3.705ms 51.783x
Texture2D.Load linear: 198.537ms 0.966x
Texture2D.Load random: 350.591ms 0.547x
Texture2D.Load uniform: 4.028ms 47.637x
Texture2D.Load linear: 350.589ms 0.547x
Texture2D.Load random: 350.519ms 0.547x

@ash3D
Copy link
Author

ash3D commented Apr 2, 2019

Some thoughts and comparison with newer NVIDIA architectures

  • Results shows that Volta and Turing uses read-write LSU pipeline (which is typically used for UAV and shared mem) for untyped loads (both raw and structured buffers) which is beneficial due to more LSU count compared with TMU and shorter latency. LSUs now backed by R/W L1$ (similar to Fermi), so read-only TMU pipeline has no advantages anymore.
    Typed loads seems to still uses TMU despite Maxwell introduced format conversion hardware for UAV loads. Probably not all formats supported by LSU (e.g. sRGB or shared exp), so driver has to stick with TMU for typed loads since it can't know upfront which format will be used and whether it is supported by LSU. On the other hand it can be beneficial to use TMUs in some situations: since NVIDIA GPUs has separated LSU and TMU pipelines it can presumably be utilized simultaneously, so in UAV or shared memory intensive workloads using dormant TMU for read-only SRV access can be free.

  • Maxwell and Pascal doesn't have RW L1$ so tex L1$ offers advantages for TMU over LSU. Apparently TMU used for structured buffer loads while LSU now used for raw buffers (results from previous test version with older drivers suggested TMU for raw buffers too). Maybe this discrepancy is due to inability of NVIDIA hardware to perform fullspeed 64-bit unaligned fetches (alignment in not guaranteed for raw buffers). On the other hand accesses to structured buffers are aligned and 64-bit loads runs at full speed so driver decided to take advantage of texture L1$. It seems that decision to use LSU without L1$ for raw buffers results in better performance that can be expected from TMU based on unit count except for random Load4 which is slower than expected TMU theoretical rate of 1/4 (apparently lack of L1$ comes out in this case).

  • Kepler uses TMU for all SRV loads. Maybe NVIDIA driver team decided that read-only tex L1$ gives more advantages than larger LSU count for this architecture but another explanation of this difference compared with Maxwell and Pascal which has similar cache hierarchy is that Kepler has fixed 64 slots for UAVs (it supports unlimited descriptor tables for SRVs only). Maxwell expanded descriptor table approach for UAVs so it can use LSU for untyped SRV access if desired.
    My own experiments showed that UAV loads can be faster then SRV on Kepler in some cases (it strongly depends on various factors including driver version) but generally SRV offers more predictable and consistent performance.
    It's notable that uniform load optimization works for 1d and 2d raw buffer loads but disabled for 3d and 4d ones. Maybe due to GPR pressure?

RGB32 loads

I've also tried RGB32 (float3) format for typed buffer loads and textures. Results was different from my previous experiments. Current test configuration shows 2/3 rate (somewhat strange) for buffers and linear texture access and 1/3 for random texture reads. My previous experiments provided 1/3 rate for linear buffer load (similar to raw buffer Load3) and somewhat slower for random load. 1/3 rate seemed reasonable - it is consistent with assumption than NVIDIA TMU hardware falls back to 32-bit fetches on unaligned access (otherwise it would be expected 1/2 rate as with RGBA32). I played with test configuration a little - tune thread group size and loop iteration count, and got different results - near 1/3 or much slower in some cases. Maybe cache bank conflicts start to appear.
Such dependence on test configuration prompted the idea of driver optimizations (maybe similar to uniform load optimization) - 64 or 128 bit fetches combined to assemble 96 bit result while sharing data with other threads or loop iterations. It's remarkable that 2/3 rate achieved when both thread group size and loop iteration count is 256. But this assumption can be wrong and TMU hardware able to offer 2/3 rate for RGGB32 format and its drop to 1/3 or slower in some conditions is because of cache bank conflicts or other reasons.

@sebbbi
Copy link
Owner

sebbbi commented Jul 30, 2019

Thanks! Added Kepler results. This confirms that Nvidia's uniform load driver optimization affects kepler too.

@sebbbi sebbbi closed this as completed Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants