Having previously tinkered only very briefly, in assembly, I was keen to try my hand at more.
I do best with a practical, defined problem to solve; having used more or less the same unrolled-loop implementation of a 4x4 matrix multiplication I wrote in university, it seemed a good candidate for a 21st Century update, using Advanced Vector Extensions (AVX) which first shipped with Sandy Bridge processors in 2011. Non-trivial, but tractable.
*Performance was never a motivation of this side project - the problem is too small - but there wouldn't be much point if the output were slower. And it isn't: on my (Ivy Bridge) Macbook Pro, it executes in half as many cycles as my previous unrolled-loop implementation and in slightly more than two-thirds as many cycles on a Haswell Ultrabook.
But not faster than XMMatrixMultiply.
P.S. The built executable has a dependency on the Visual C++ 2012 Update 4 runtime and does not check that the host CPU supports AVX instructions.