BaseMemoryLibSse2: Take advantage of write combining buffers #1009

lgao4 · 2020-10-14T05:46:46Z

The current SSE2 implementation of the ZeroMem(), SetMem(),
SetMem16(), SetMem32 and SetMem64 functions is writing 16 bytes per 16
bytes. It hurts the performances so bad that this is even slower than
a simple 'rep stos' (4% slower) in regular DRAM.

To take full advantages of the 'movntdq' instruction it is better to
"queue" a total of 64 bytes in the write combining buffers. This
patch implement such a change. Below is a table where I measured
(with 'rdtsc') the time to write an entire 100MB RAM buffer. These
functions operate almost two times faster.

| Function | Arch | Untouched | 64 bytes | Result |
|----------+------+-----------+----------+--------|
| ZeroMem | Ia32 | 17765947 | 9136062 | 1.945x |
| ZeroMem | X64 | 17525170 | 9233391 | 1.898x |
| SetMem | Ia32 | 17522291 | 9137272 | 1.918x |
| SetMem | X64 | 17949261 | 9176978 | 1.956x |
| SetMem16 | Ia32 | 18219673 | 9372062 | 1.944x |
| SetMem16 | X64 | 17523331 | 9275184 | 1.889x |
| SetMem32 | Ia32 | 18495036 | 9273053 | 1.994x |
| SetMem32 | X64 | 17368864 | 9285885 | 1.870x |
| SetMem64 | Ia32 | 18564473 | 9241362 | 2.009x |
| SetMem64 | X64 | 17506951 | 9280148 | 1.886x |

Signed-off-by: Jeremy Compostella jeremy.compostella@intel.com
Reviewed-by: Liming Gao gaoliming@byosoft.com.cn

mergify · 2020-10-14T05:48:35Z

PR can not be merged due to a PatchCheck failure. Please resolve and resubmit

The current SSE2 implementation of the ZeroMem(), SetMem(), SetMem16(), SetMem32 and SetMem64 functions is writing 16 bytes per 16 bytes. It hurts the performances so bad that this is even slower than a simple 'rep stos' (4% slower) in regular DRAM. To take full advantages of the 'movntdq' instruction it is better to "queue" a total of 64 bytes in the write combining buffers. This patch implement such a change. Below is a table where I measured (with 'rdtsc') the time to write an entire 100MB RAM buffer. These functions operate almost two times faster. | Function | Arch | Untouched | 64 bytes | Result | |----------+------+-----------+----------+--------| | ZeroMem | Ia32 | 17765947 | 9136062 | 1.945x | | ZeroMem | X64 | 17525170 | 9233391 | 1.898x | | SetMem | Ia32 | 17522291 | 9137272 | 1.918x | | SetMem | X64 | 17949261 | 9176978 | 1.956x | | SetMem16 | Ia32 | 18219673 | 9372062 | 1.944x | | SetMem16 | X64 | 17523331 | 9275184 | 1.889x | | SetMem32 | Ia32 | 18495036 | 9273053 | 1.994x | | SetMem32 | X64 | 17368864 | 9285885 | 1.870x | | SetMem64 | Ia32 | 18564473 | 9241362 | 2.009x | | SetMem64 | X64 | 17506951 | 9280148 | 1.886x | Signed-off-by: Jeremy Compostella <jeremy.compostella@intel.com> Reviewed-by: Liming Gao <gaoliming@byosoft.com.cn>

lgao4 force-pushed the Sse2MemoryCopy branch from 53b21c8 to e877f5f Compare October 16, 2020 00:58

lgao4 added the push Auto push patch series in PR if all checks pass label Oct 16, 2020

mergify bot merged commit d25fd87 into tianocore:master Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseMemoryLibSse2: Take advantage of write combining buffers #1009

BaseMemoryLibSse2: Take advantage of write combining buffers #1009

lgao4 commented Oct 14, 2020

mergify bot commented Oct 14, 2020

BaseMemoryLibSse2: Take advantage of write combining buffers #1009

BaseMemoryLibSse2: Take advantage of write combining buffers #1009

Conversation

lgao4 commented Oct 14, 2020

mergify bot commented Oct 14, 2020