Skip to content
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 23, 2021 · 10 revisions

VirtualMultiArray uses all graphics cards that support OpenCL to accomodate storage for virtual array of objects, in C++. Sometimes we don't have time to upgrade stuff or just all slots already filled and then there is that low-end graphics card doing nothing while coding. So why not leverage the economical advantages of some "virtual"ness while keeping a few write-cycles out of our SSD life span? Or just for fun?

Anyway, on a system with just 4GB RAM (with half of it taken by developer tools, ides, etc already), responsiveness is not lost when working on 4GB "seamless" data:

GraphicsCardSupplyDepot depot;

const size_t n = 500000000; // integer multiple of pageSize

// synonym to cache line size (but of VirtualMultiArray, not L1)
const size_t pageSize=100000; // don't make this too big (if object size(like double here) is big too)

// synonym to number of cache lines per channel
const int maxActivePagesPerGpu = 1; // caching per virtual gpu

// 50 opencl command queues per graphics card, interleaved on different page accesses
// supports maximum 150(50+50+50) cpu threads (but not as effective since cards have only 1-2 async copy engines)
// only 120MB RAM is used for paging (page size * sizeof(double) * (total command queues = 50+50+50))
// 4GB data stored on 3 graphics cards
VirtualMultiArray<double> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu,{50,50,50});

#pragma omp parallel for
for(size_t i=0;i<n;i++) 
{ 
    arr.set(i,(double)i); // arr[i] = (double)i;
}

it just works (for both Windows and Linux) much faster than HDD but limited by only total vram size of system. {50,50,50} integer vector elements can be tuned for data distribution ratio of graphics cards like {0,10,100}, {1,2,1} etc which also equally affects bandwidth usage on cards. 1 means only 1 OpencL command queue for that card(similar to cuda-stream). { M > 1, N > 1, ..} means all cards have multiple command queues that overlap data transfers of both directions. Increasing this number makes better load balancing between elements by increasing possibility of two accessed elements going through different channels. Also, every channel contains a LRU cache that has "maxActivePagesPerGpu" number of cache lines. Size of each cache line is "pageSize" (in elements, not bytes).

some VRAM + some cpu cycles = increased SSD lifespan for repeated temporary work

Since this is a virtual-array that does I/O with graphics cards under the hood, you can use more threads than you CPU supports as logical cores. If your CPU has 8 cores / 8 threads, then

#pragma omp parallel for num_threads(32)
for(size_t i=0;i<n;i++) 
{ 
    arr[i]=(double)i;
}

can be twice as fast as this:

#pragma omp parallel for num_threads(8)
for(size_t i=0;i<n;i++) 
{ 
    arr[i]=(double)i;
}

because once a thread is in awaiting state for data, it can yield its process and a new thread comes back into life and starts doing something, until DMA operation (which doesn't use CPU) of the other thread is complete. DMA operations still feed on RAM's available bandwidth but at least CPU can use more threads on other tasks like computing math, saving data to HDD, accessing another virtual array or another index of same virtual array.