Skip to content

How To Do Multithreading With a Read Only Multi Level Cache

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Oct 11, 2021 · 6 revisions
  • LLC(Last Level Cache) should be created only once and wrapped by an std::shared_ptr. It is the synchronization point for all reads (for now. Write-sync will be added later. But single-threaded read+write works.).
  • Every thread needs to create its own CacheThreader object using the common LLC object's shared_ptr smart pointer.
  • Now every thread can read the backing-store (file read, web download, whatever it is) freely
  • Currently DirectMappedMultiThreadCache.h and LruClockCache.h support LLC usage. LruClockCache has better hit-ratio but DirectMappedMultiThreadCache has much less lock contention between threadsa. Benchmarks indicate up to 2.5 billion lookups per second performance when setup is made with DirectMappedMultiThreadCache as LLC and CacheThreader as the per-thread private cache.

Benchmark

     #include "CpuBenchmarker.h"
     #include "LruClockCache.h"
     #include "DirectMappedCache.h"
     #include "CacheThreader.h"
     #include <atomic>
     #include <memory>
     #include <iostream>
     #include <whatever you can use as PARALLEL_FOR loop that can use 8 threads simultaneously>

    int main()
    {
		std::atomic<size_t> total =0;
		for(float benchmark=2;benchmark>=0.5f;benchmark*=0.97f)
		{
		  const int LLCsize=1024*128;

		  // randomly fill array
		  std::vector<int> backingStore(5000000);
			std::random_device rd;
			std::mt19937 rng(rd());
			std::uniform_real_distribution<float> rnd(0,5000000);
			for(int i=0;i<5000000;i++)
				backingStore[i]=rnd(rng);

		  auto LLC = std::make_shared<LruClockCache<int,int>>(LLCsize,
			[ & ](int key) {

			  return backingStore[key];
			},
			[ & ](int key, int value) {

			  backingStore[key] = value;
			});

		  const int L2size=LLCsize/4;
		  const int L1size=L2size/4; // L1 size has to be integer power of 2 !!!


			const size_t N=100*benchmark;
			const size_t repeat = 50/(benchmark*benchmark*benchmark*benchmark*benchmark);
			const int numThreads = 8;
			const size_t totalWork = repeat*N*N*numThreads;
			CpuBenchmarker bench(totalWork*sizeof(int),"image pixels = "+std::to_string(N)+"x"+std::to_string(N)+"="+std::to_string(N*N)+
					"   8 threads (LLC="+std::to_string(LLCsize)+
					", L2="+std::to_string(L2size)+
					" x"+std::to_string(numThreads)+", L1="+std::to_string(L1size)+" x"+std::to_string(numThreads)+")",totalWork);
			{

				PARALLEL_FOR(0,numThreads,k,
				{
						CacheThreader<LruClockCache,int,int> cache(LLC,L1size,L2size);
						size_t subTotal = 0;
						for(int m=0;m<repeat;m++)
							for(int j=0;j<N;j++)
								for(int i=0;i<N;i++)
								{
									subTotal +=cache.get(i+j*N);
								}

						total += subTotal;
				});

			}
		}
		std::cout<<"sum of random numbers:"<<total<<std::endl;
		return 0;
    }

Results

image pixels = 200x200=40000   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 295056015 nanoseconds     (bandwidth = 4.34 MB/s)      (throughput = 922.05 nanoseconds per iteration) 
image pixels = 194x194=37636   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 281589781 nanoseconds     (bandwidth = 4.28 MB/s)      (throughput = 935.24 nanoseconds per iteration) 
image pixels = 188x188=35344   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 489024940 nanoseconds     (bandwidth = 4.63 MB/s)      (throughput = 864.76 nanoseconds per iteration) 
image pixels = 182x182=33124   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 459537696 nanoseconds     (bandwidth = 4.61 MB/s)      (throughput = 867.08 nanoseconds per iteration) 
image pixels = 177x177=31329   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 233811738 nanoseconds     (bandwidth = 8.58 MB/s)      (throughput = 466.44 nanoseconds per iteration) 
image pixels = 171x171=29241   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 222362845 nanoseconds     (bandwidth = 12.62 MB/s)      (throughput = 316.85 nanoseconds per iteration) 
image pixels = 166x166=27556   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 212062835 nanoseconds     (bandwidth = 12.47 MB/s)      (throughput = 320.65 nanoseconds per iteration) 
image pixels = 161x161=25921   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 195904680 nanoseconds     (bandwidth = 16.94 MB/s)      (throughput = 236.18 nanoseconds per iteration) 
image pixels = 156x156=24336   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 183465751 nanoseconds     (bandwidth = 21.22 MB/s)      (throughput = 188.47 nanoseconds per iteration) 
image pixels = 152x152=23104   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 180227348 nanoseconds     (bandwidth = 24.61 MB/s)      (throughput = 162.51 nanoseconds per iteration) 
image pixels = 147x147=21609   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 165683176 nanoseconds     (bandwidth = 29.21 MB/s)      (throughput = 136.92 nanoseconds per iteration) 
image pixels = 143x143=20449   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 154924644 nanoseconds     (bandwidth = 33.79 MB/s)      (throughput = 118.38 nanoseconds per iteration) 
image pixels = 138x138=19044   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 146176998 nanoseconds     (bandwidth = 37.52 MB/s)      (throughput = 106.61 nanoseconds per iteration) 
image pixels = 134x134=17956   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 141101035 nanoseconds     (bandwidth = 44.79 MB/s)      (throughput = 89.30 nanoseconds per iteration) 
image pixels = 130x130=16900   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 132227012 nanoseconds     (bandwidth = 53.17 MB/s)      (throughput = 75.23 nanoseconds per iteration) 
image pixels = 126x126=15876   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 123245615 nanoseconds     (bandwidth = 61.83 MB/s)      (throughput = 64.69 nanoseconds per iteration) 
image pixels = 122x122=14884   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 116445411 nanoseconds     (bandwidth = 69.53 MB/s)      (throughput = 57.53 nanoseconds per iteration) 
image pixels = 119x119=14161   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 113286742 nanoseconds     (bandwidth = 80.00 MB/s)      (throughput = 50.00 nanoseconds per iteration) 
image pixels = 115x115=13225   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 108334235 nanoseconds     (bandwidth = 93.75 MB/s)      (throughput = 42.66 nanoseconds per iteration) 
image pixels = 112x112=12544   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 102359169 nanoseconds     (bandwidth = 109.80 MB/s)      (throughput = 36.43 nanoseconds per iteration) 
image pixels = 108x108=11664   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 96468722 nanoseconds     (bandwidth = 123.81 MB/s)      (throughput = 32.31 nanoseconds per iteration) 
image pixels = 105x105=11025   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 91924295 nanoseconds     (bandwidth = 145.84 MB/s)      (throughput = 27.43 nanoseconds per iteration) 
image pixels = 102x102=10404   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 87428045 nanoseconds     (bandwidth = 167.55 MB/s)      (throughput = 23.87 nanoseconds per iteration) 
image pixels = 99x99=9801   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 79328282 nanoseconds     (bandwidth = 201.63 MB/s)      (throughput = 19.84 nanoseconds per iteration) 
image pixels = 96x96=9216   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 74810216 nanoseconds     (bandwidth = 236.53 MB/s)      (throughput = 16.91 nanoseconds per iteration) 
image pixels = 93x93=8649   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 64802386 nanoseconds     (bandwidth = 298.97 MB/s)      (throughput = 13.38 nanoseconds per iteration) 
image pixels = 90x90=8100   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 63040780 nanoseconds     (bandwidth = 333.04 MB/s)      (throughput = 12.01 nanoseconds per iteration) 
image pixels = 87x87=7569   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 58062629 nanoseconds     (bandwidth = 396.29 MB/s)      (throughput = 10.09 nanoseconds per iteration) 
image pixels = 85x85=7225   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 57438513 nanoseconds     (bandwidth = 446.79 MB/s)      (throughput = 8.95 nanoseconds per iteration) 
image pixels = 82x82=6724   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 53513334 nanoseconds     (bandwidth = 518.69 MB/s)      (throughput = 7.71 nanoseconds per iteration) 
image pixels = 80x80=6400   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 49910281 nanoseconds     (bandwidth = 615.50 MB/s)      (throughput = 6.50 nanoseconds per iteration) 
image pixels = 77x77=5929   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 47385689 nanoseconds     (bandwidth = 700.68 MB/s)      (throughput = 5.71 nanoseconds per iteration) 
image pixels = 75x75=5625   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 49338256 nanoseconds     (bandwidth = 744.25 MB/s)      (throughput = 5.37 nanoseconds per iteration) 
image pixels = 73x73=5329   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 45812908 nanoseconds     (bandwidth = 882.18 MB/s)      (throughput = 4.53 nanoseconds per iteration) 
image pixels = 71x71=5041   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 42368550 nanoseconds     (bandwidth = 1054.64 MB/s)      (throughput = 3.79 nanoseconds per iteration) 
image pixels = 68x68=4624   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 40359238 nanoseconds     (bandwidth = 1180.54 MB/s)      (throughput = 3.39 nanoseconds per iteration) 
image pixels = 66x66=4356   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 38815133 nanoseconds     (bandwidth = 1346.69 MB/s)      (throughput = 2.97 nanoseconds per iteration) 
image pixels = 64x64=4096   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 39576391 nanoseconds     (bandwidth = 1447.29 MB/s)      (throughput = 2.76 nanoseconds per iteration) 
image pixels = 62x62=3844   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 36549208 nanoseconds     (bandwidth = 1713.06 MB/s)      (throughput = 2.33 nanoseconds per iteration) 
image pixels = 60x60=3600   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 34613809 nanoseconds     (bandwidth = 1973.59 MB/s)      (throughput = 2.03 nanoseconds per iteration) 
image pixels = 59x59=3481   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 34877694 nanoseconds     (bandwidth = 2203.71 MB/s)      (throughput = 1.82 nanoseconds per iteration) 
image pixels = 57x57=3249   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 36332778 nanoseconds     (bandwidth = 2300.68 MB/s)      (throughput = 1.74 nanoseconds per iteration) 
image pixels = 55x55=3025   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 40545700 nanoseconds     (bandwidth = 2237.02 MB/s)      (throughput = 1.79 nanoseconds per iteration) 
image pixels = 53x53=2809   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 35214927 nanoseconds     (bandwidth = 2784.84 MB/s)      (throughput = 1.44 nanoseconds per iteration) 
image pixels = 52x52=2704   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 35149004 nanoseconds     (bandwidth = 3126.42 MB/s)      (throughput = 1.28 nanoseconds per iteration) 
image pixels = 50x50=2500   8 threads (LLC=131072, L2=32768 x8, L1=8192 x8): 33146839 nanoseconds     (bandwidth = 3569.57 MB/s)      (throughput = 1.12 nanoseconds per iteration) 
sum of random numbers:838841069478984

For FX8150(no turbo) + 1 channel memory at 1600MHz + Ubuntu 18.04 LTS, multithreading achieved 2x bandwidth and 50% average latency of single-threaded version.

For now, only read-only use for multiple-threads is allowed. So the backing store needs to be filled with relevant data first. It could be a training data for some AI-training or anything that is a constant dataset. For single-thread, both writes and reads work. Just don't forget to call cache.flush() and LLC.flush() at the end to write latest bits to backing-store. LLC.flush() must be called after the threads' flush.