Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Initial commit

  • Loading branch information...
commit 3e7bbbe22f57891a4c0ebe875e7902d3bdfa4c4f 0 parents
@rygorous authored
Showing with 277 additions and 0 deletions.
  1. +26 −0 main.tex
  2. +12 −0 sw_stack.bib
  3. +239 −0 sw_stack.tex
26 main.tex
@@ -0,0 +1,26 @@
+\title{A Trip Through The Graphics Pipeline 2011}
+\author{Fabian Giesen}
+% Include sections here
+blah blah blah.
12 sw_stack.bib
@@ -0,0 +1,12 @@
+ Author = {Microsoft},
+ Title = {Windows Vista Display Driver Model (WDDM) Reference},
+ Year = {2006},
+ URL = {}
+ Author = {Microsoft},
+ Title = {User-Mode Display Driver Functions},
+ Year = {2006},
+ URL = {}
239 sw_stack.tex
@@ -0,0 +1,239 @@
+\chapter{The Software Stack}
+This section assumes a reasonably recent Windows version (Vista or later),
+which use the WDDM~\citep{wddm} driver model. Older driver models (and other
+platforms) are somewhat different, but that's outside the scope of this
+text---I just picked WDDM because it's probably the most relevant model on
+PCs right now.
+\section{Application and API}
+It all starts with the application. On PC, all communication between an app and
+the GPU is mediated by the graphics API; apps may occasionally get direct
+access to memory that's GPU-addressable (such as Vertex Buffers or Textures),
+but on PC they can't directly generate native GPU commands \footnote{Not
+officially, anyway; since the UMD writes command buffers and runs in user mode,
+an app could conceivably figure out where the UMD stores its current write
+pointer and insert GPU commands manually, but that's not exactly supported or
+recommended behavior.} --- all that has to go through the API and the driver.
+The API is the recipient of the app's resource creation, state-setting, and draw
+calls. The API runtime keeps track of the current state your app has set,
+validates parameters and does other error and consistency checking, manages
+user-visible resources, and may or may not validate shader code and shader
+linkage (it does in D3D, while in OpenGL this is handled at the driver level).
+It can also merge batches if possible. It then packages it all up nicely and
+hands it over to the graphics driver---more precisely, the user-mode driver.
+\section{The User-Mode Driver (UMD)}
+This is where most of the ``magic'' on the CPU side happens. If your app
+crashes because of some API call you did, it will usually be in here :). It's
+called ``nvd3dum.dll'' (NVidia) or ``atiumd*.dll'' (AMD). As the name suggests,
+this is user-mode code; it's running in the same context and address space as
+your app (and the API runtime) and has no elevated privileges whatsoever. It
+implements a lower-level API (the DDI~\citep{umd-ddi}) that is called by D3D;
+this API is fairly similar to the one you're seeing on the surface, but a bit
+more explicit about things like memory and state management.
+This module is where things like shader compilation happen. D3D passes a
+pre-validated shader token stream to the UMD---i.e. it's already checked that
+the code is valid in the sense of being syntactically correct and obeying D3D
+constraints (using the right types, not using more textures/samplers than
+available, not exceeding the number of available constant buffers, stuff like
+that). This is compiled from HLSL code and usually has quite a number of
+high-level optimizations (various loop optimizations, dead-code elimination,
+constant propagation, predicating ifs etc.) applied to it---this is good news
+since it means the driver benefits from all these relatively costly
+optimizations that have been performed at compile time. However, it also has a
+bunch of lower-level optimizations (such as register allocation and loop
+unrolling) applied that drivers would rather do themselves; long story short,
+this usually just gets immediately turned into a intermediate representation
+(IR) and then compiled some more; shader hardware is close enough to D3D
+bytecode that compilation doesn't need to work wonders to give good results
+(and the HLSL compiler having done some of the high-yield and high-cost
+optimizations already definitely helps), but there's still lots of low-level
+details (such as HW resource limits and scheduling constraints) that D3D
+neither knows nor cares about, so this is not a trivial process.
+And of course, if your app is a well-known game, programmers at NV/AMD have
+probably looked at your shaders and wrote hand-optimized replacements for their
+hardware - though they better produce the same results lest there be a scandal
+:). These shaders get detected and substituted by the UMD too. You're welcome.
+More fun: Some of the API state may actually end up being compiled into the
+shader - to give an example, relatively exotic (or at least infrequently used)
+features such as texture borders are probably not implemented in the texture
+sampler, but emulated with extra code in the shader (or just not supported at
+all). This means that there's sometimes multiple versions of the same shader
+floating around, for different combinations of API states.
+Incidentally, this is also the reason why you'll often see a delay the first
+time you use a new shader or resource; a lot of the creation/compilation work
+is deferred by the driver and only executed when it's actually necessary (you
+wouldn't believe how much unused crap some apps create!). Graphics programmers
+know the other side of the story - if you want to make sure something is
+actually created (as opposed to just having memory reserved), you need to issue
+a dummy draw call that uses it to "warm it up". Ugly and annoying, but this has
+been the case since I first started using 3D hardware in 1999 - meaning, it's
+pretty much a fact of life by this point, so get used to it. :)
+Anyway, moving on. The UMD also gets to deal with fun stuff like all the D3D9
+``legacy'' shader versions and the fixed function pipeline - yes, all of that
+will get faithfully passed through by D3D. The 3.0 shader profile ain't that
+bad (it's quite reasonable in fact), but 2.0 is crufty and the various 1.x
+shader versions are seriously weird---remember 1.3 pixel shaders? Or, for that
+matter, the fixed-function vertex pipeline with vertex lighting and such? Yeah,
+support for all that's still there in D3D and the guts of every modern graphics
+driver, though of course they just translate it to newer shader versions by now
+(and have been doing so for quite some time).
+Then there's things like memory management. The UMD will get things like
+texture creation commands and need to provide space for them. Actually, the UMD
+just suballocates some larger memory blocks it gets from the KMD (Kernel-Mode
+Driver); actually mapping and unmapping pages (and managing which part of video
+memory the UMD can see, and conversely which parts of system memory the GPU may
+access) is a kernel-mode privilege and can't be done by the UMD.
+But the UMD can swizzle textures, for example - that is, go from linear pixel
+layout to something that's more likely to get good cache hit rates during 3D
+rendering; we'll see this later. Some GPUs can also do the swizzling
+themselves, during a 2D blit or copy. The UMD can also schedule transfers
+between system memory and (mapped) video memory and the like. Most importantly,
+it can also write command buffers (or ``DMA buffers'' - I'll be using these two
+names interchangeably) once the KMD has allocated them and handed them over. A
+command buffer contains, well, commands :). All your state-changing and drawing
+operations will be converted by the UMD into commands that the hardware
+understands. As will a lot of things you don't trigger manually - such as
+uploading textures and shaders to video memory.
+In general, drivers will try to put as much of the actual processing into the
+UMD as possible; the UMD is user-mode code, so anything that runs in it doesn't
+need any costly kernel-mode transitions, it can freely allocate memory, farm
+work out to multiple threads, and so on - it's just a regular DLL (even though
+it's loaded by the API, not directly by your app). This has advantages for
+driver development too - if the UMD crashes, the app crashes with it, but not
+the whole system; it can just be replaced while the system is running (it's
+just a DLL!); it can be debugged with a regular debugger; and so on. So it's
+not only efficient, it's also convenient.
+But there's a big elephant in the room that I haven't mentioned yet.
+\subsection*{Did I say ``user-mode driver''? I meant ``user-mode drivers.''}
+As said, the UMD is just a DLL. Okay, one that happens to have the blessing of
+D3D and a direct pipe to the KMD, but it's still a regular DLL, and in runs in
+the address space of its calling process.
+But we're using multi-tasking OSes nowadays. In fact, we have been for some
+This "GPU" thing I keep talking about? That's a shared resource. There's only
+one that drives your main display (even if you use SLI/Crossfire). Yet we have
+multiple apps that try to access it (and pretend they're the only ones doing
+it). This doesn't just work automatically; back in The Olden Days, the solution
+was to only give 3D to one app at a time, and while that app was active, all
+others wouldn't have access. But that doesn't really cut it if you're trying to
+have your windowing system use the GPU for rendering. Which is why you need
+some component that arbitrates access to the GPU and allocates time-slices and
+\section{The scheduler}
+This is a system component, part of the OS - note the ``the'' is somewhat
+misleading; I'm talking about the graphics scheduler here, not the CPU or IO
+schedulers. This does exactly what you think it does - it arbitrates access to
+the 3D pipeline by time-slicing it between different apps that want to use it.
+A context switch incurs, at the very least, some state switching on the GPU
+(which generates extra commands for the command buffer) and possibly also
+swapping some resources in and out of video memory. And of course only one
+process gets to actually submit commands to the 3D pipe at any given time.
+You'll often find console programmers complaining about the fairly high-level,
+hands-off nature of PC 3D APIs, and the performance cost this incurs. But the
+thing is that 3D APIs/drivers on PC really have a more complex problem to solve
+than console games - they really do need to keep track of the full current
+state for example, since someone may pull the metaphorical rug from under them
+at any moment! They also work around broken apps and try to fix performance
+problems behind their backs; this is a rather annoying practice that no-one's
+happy with, certainly including the driver authors themselves, but the fact is
+that the business perspective wins here; people expect stuff that runs to
+continue running (and doing so smoothly). You just won't win any friends by
+yelling ``BUT IT'S WRONG!'' at the app and then sulking and going through an
+ultra-slow path.
+Anyway, on with the pipeline. Next stop: Kernel mode!
+\section{The Kernel-Mode Driver (KMD)}
+This is the part that actually deals with the hardware. There may be multiple
+UMD instances running at any one time, but there's only ever one KMD, and if
+that crashes, then boom you're dead - used to be "blue screen" dead, but by now
+Windows actually knows how to kill a crashed driver and reload it (progress!).
+As long as it happens to be just a crash and not some kernel memory corruption
+at least - if that happens, all bets are off.
+The KMD deals with all the things that are just there once. There's only one
+GPU memory, even though there's multiple apps fighting over it. Someone needs
+to call the shots and actually allocate (and map) physical memory. Similarly,
+someone must initialize the GPU at startup, set display modes (and get mode
+information from displays), manage the hardware mouse cursor, program the HW
+watchdog timer so the GPU gets reset if it stays unresponsive for a certain
+time, respond to interrupts, and so on. This is what the KMD does.
+There's also this whole content protection/DRM bit about setting up a
+protected/DRM'ed path between a video player and the GPU so no the actual
+precious decoded video pixels aren't visible to any dirty user-mode code that
+might do awful forbidden things like dump them to disk (...whatever). The KMD
+has some involvement in that too.
+Most importantly for us, the KMD manages the \emph{actual} command buffer. You
+know, the one that the hardware actually consumes. The command buffers that the
+UMD produces aren't the real deal --- as a matter of fact, they're just random
+slices of GPU-addressable memory. What actually happens with them is that the
+UMD finishes them, submits them to the scheduler, which then waits until that
+process is up and then passes the UMD command buffer on to the KMD. The KMD
+then writes a call to command buffer into the main command buffer, and
+depending on whether the GPU command processor can read from main memory or
+not, it may also need to DMA it to video memory first. The main command buffer
+is usually a (quite small) ring buffer --- the only thing that ever gets
+written there is system/initialization commands and calls to the ``real'',
+meaty 3D command buffers.
+But this is still just a buffer in memory right now. Its position is known to
+the graphics card --- there's usually a read pointer, which is where the GPU is
+in the main command buffer, and a write pointer, which is how far the KMD has
+written the buffer yet (or more precisely, how far it has \emph{told} the GPU
+it has written yet). These are hardware registers, and they are memory-mapped
+--- the KMD updates them periodically, usually whenever it submits a new chunk
+of work. How memory transfers to and from the GPU work will be explained in
+\section{Aside: OpenGL and other platforms}
+OpenGL is fairly similar to what I just described, except there's not as sharp
+a distinction between the API and UMD layer. And unlike D3D, the (GLSL) shader
+compilation is not handled by the API at all, it's all done by the driver. An
+unfortunate side effect is that there are as many GLSL frontends as there are
+3D hardware vendors, all of them basically implementing the same spec, but with
+their own bugs and idiosyncrasies. Not fun. And it also means that the drivers
+have to do all the optimizations themselves whenever they get to see the
+shaders - including expensive optimizations. The D3D bytecode format is really
+a cleaner solution for this problem - there's only one compiler (so no slightly
+incompatible dialects between different vendors!) and it allows for some
+costlier data-flow analysis than you would normally do.
+Open Source implementations of GL tend to use either Mesa or Gallium3D, both of
+which have a single shared GLSL frontend that generates a device-independent IR
+and supports multiple pluggable backends for actual hardware. In other words,
+that space is fairly similar to the D3D model.
+\section{Further Reading}
+This is just a very coarse overview of the WDDM graphics stack, meant to give
+you a general idea of what fits where. For details, refer to the official WDDM
Please sign in to comment.
Something went wrong with that request. Please try again.