<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/my_colab_gpu_topk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# task
Given 8.5 million big data files, each data is an integer id vector of up to 128 dimensions (called doc), and the id value range is 0-50000.
Given a integer id vector of up to 128 dimensions (called query), the data set can be spread for optimization

```shell
# Generate test data, has been sorted in ascending order, the default docs file counts one document per line,10 documents; 10 query files
make gen
```
Find the average score topk (k=100) of the number of data intersections in query and doc; Here we define the intersection fraction of item as:
query[i] == doc[j] (0<=i<query_size, 0<=j<doc_size) calculates an intersection, the average number of query and doc intersections /max(query_size,doc_size)

``` shell
./bin/query_doc_scoring <doc_file_name> <query_file_name> <output_filename>
```

# optimize
note: just optimize stand-alone, for dist m/r(fan-out/in) arch to schedule those instances.

0. gpu device RR balance by user request
1. concurrency(cpu thread pool) + parallel(cpu openMP + gpu warp threads): cpu(baseline) -> cpu thread concurrency -> cpu + gpu -> cpu thread concurrency/parallel + gpu stream concurrency/warp thread parallel => dist
2. find or filter: use hashmap/bitmap(bloom) on cpu/gpu global memory or gpu shared memory
3. topk sort: heap sort (partial_sort) on cpu -> bitonic/radix sort on gpu parallel topk,then reduce topk to cpu
4. search: need build index (list(IVF,skip),tree, graph), orderly struct/model
5. SIMD: for cpu arch instruction set (intel cpu sse,avx2,avx512 etc..)
6. sequential IO stream pipeline: for r query/docs file, (batch per thread, multibyte_split parallel Accelerators) , w res file
7. resources pool

# result
add read file chunk topk on gpu, run on google colab A100
## gpu_readfile -> vec docs -> gpu_cpu_topk

1. read file cost from 34274 ms(line/per) to 9196 ms(gpu chunk multi_split), cost reduce (34274-9196)/34274 = **73.17%**
2. total cost reduce (35551 - 11589)/35551 = **67.40%**

---
## cpu_readfile -> vec docs -> cpu_topk (cpu_baseline)

1. read file cost from 33054 ms(line/per)
2. topk cost 87230 ms
3. all cost 120284 ms

---

## cpu_readfile -> vec docs split -> cpu_concurrency_topk
use thread_pool thread num: cpu core num a100 (12 cores)

1. read file cost from 33054 ms(line/per)
2. topk cost 14206 ms, reduce: (87230-14206)/87230=**83.71%** compare with `cpu_baseline`  
3. all cost 47654 ms, reduce: (120284-47654)/120284=**60.38%** compare with `cpu_baseline`  

---

## cpu_readfile -> vec docs -> gpu_cpu_topk (gpu_baseline)

1. read file cost from 33054 ms(line/per)
2. topk cost 2504 ms, reduce: (87230-2504)/87230=**97.13%** compare with `cpu_baseline`  ;  (14206-2504)/14206=**97.13%** compare with `cpu_concurrency`  
3. all cost 36026 ms, reduce: (120284-36026)/120284=**70.05%** compare with `cpu_baseline`  ; (47654-36026)/47654=**24.40%** compare with `cpu_concurrency`  

---

## cpu_readfile -> vec docs split -> cpu_concurency_gpu_topk  : (

1. read file cost from 33054 ms(line/per)
2. topk cost 2915 ms, increase: (2915-2504)/2915=**14.10%** compare with `gpu_baseline` ;
3. all cost 36230 ms, increase: (36230-36026)/36230=**00.56%** compare with `gpu_baseline` ;

increase **cpu context switch cost**

---

## gpu_readfile -> vec docs -> gpu_cpu_topk

1. read file cost from 34274 ms(line/per) to 9196 ms(gpu chunk multi_split), cost reduce (34274-9196)/34274 = **73.17%**
2. total cost reduce (35551 - 11589)/35551 = **67.40%**

---

## gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk

1. read file chunk pipeline to rank topk on gpu
2. total cost reduce (35551 - 7021)/35551 = **80.25%** compare with `gpu baseline`
3. total cost reduce (11589 - 7021)/11589 = **39.42%** compare with `gpu read file chunk to cpu vec docs then load to gpu rank topk`

---

## (gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk) + stream pool + rmm
(todo)

---

## use select k -> sort -> top k. gpu accelerate
 (todo)

---

# reference
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
- https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
- https://docs.nvidia.com/cuda/cuda-runtime-api/index.html
- https://docs.nvidia.com/cuda/thrust/index.html
- https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
- https://nvlabs.github.io/cub/index.html
- https://stotko.github.io/stdgpu/api/memory.html
-
- https://www.youtube.com/watch?v=cOBtkPsgkus
- **https://www.youtube.com/watch?v=Na9_2G6niMw**
-
- https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Many_core_computing_with_CUDA.pdf
- [Exploring Performance Portability for Accelerators via High-level Parallel Patterns](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=4Ab_NBkAAAAJ&citation_for_view=4Ab_NBkAAAAJ:hqOjcs7Dif8C), [PPT](https://pdfs.semanticscholar.org/b34a/f7c4739d622379fa31a1e88155335061c1b1.pdf)

-
- https://zhuanlan.zhihu.com/p/52344300
-
- https://passlab.github.io/OpenMPProgrammingBook/cover.html
-

- https://developer.nvidia.com/blog/maximizing-performance-with-massively-parallel-hash-maps-on-gpus/

- https://github.com/rapidsai/raft/blob/branch-23.12/docs/source/vector_search_tutorial.md


## view paper
1. [Fast Segmented Sort on GPUs.](https://raw.github.com/weedge/learn/main/gpu/Fast%20Segmented%20Sort%20on%20GPUs.pdf)
2. [Efficient Top-K query processing on massively parallel hardware](https://raw.githubusercontent.com/weedge/learn/main/gpu/Efficient%20Top-K%20Query%20Processing%20on%20Massively%20Parallel%20Hardware.pdf)
3. [stdgpu: Efficient STL-like Data Structures on the GPU](https://www.researchgate.net/publication/335233070_stdgpu_Efficient_STL-like_Data_Structures_on_the_GPU)
4. [Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods](https://sc23.supercomputing.org/presentation/?id=pap294&sess=sess156)

## view code
1. https://github.com/rapidsai/cudf/pull/8702 , https://github.com/rapidsai/cudf/blob/branch-23.12/cpp/tests/io/text/multibyte_split_test.cpp
2. https://github.com/vtsynergy/bb_segsort (k/v), https://github.com/Funatiq/bb_segsort (k,k/v)
3. https://github.com/anilshanbhag/gpu-topk
4. https://github.com/heavyai/heavydb/blob/master/QueryEngine/TopKSort.cu
5. https://github.com/rapidsai/raft/blob/branch-23.12/cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh
6. https://github.com/rapidsai/raft/blob/branch-23.12/cpp/include/raft/matrix/select_k.cuh , https://github.com/rapidsai/raft/blob/branch-23.12/cpp/test/matrix/select_k.cuh

## run baseline

In [None]:
!lsb_release -a
!uname -a
!python --version
!lsblk

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy
Linux a1fb2ef70bdd 5.15.120+ #1 SMP Wed Aug 30 11:19:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux


In [None]:
!nvidia-smi

Fri Nov 10 04:03:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!nvidia-smi -q

In [None]:
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt update
!apt install ./nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt --fix-broken install


In [1]:
!wget "http://220.181.33.47:80/v1/ai-studio-online/9805dd2d2e8e472693efac637628e16b9f9c5be0fe30438bb4a80de3b386781a?responseContentDisposition=attachment%3B%20filename%3DSTI2_1017.zip&authorization=bce-auth-v1%2F5cfe9a5e1454405eb2a975c43eace6ec%2F2023-10-18T12%3A42%3A27Z%2F-1%2F%2F6b5388dcd9013bc9b340bb1806476afa938ce0c65f2f595e1a75f529e90e4187" -O STI2_1017.zip

--2023-12-13 14:52:27--  http://220.181.33.47/v1/ai-studio-online/9805dd2d2e8e472693efac637628e16b9f9c5be0fe30438bb4a80de3b386781a?responseContentDisposition=attachment%3B%20filename%3DSTI2_1017.zip&authorization=bce-auth-v1%2F5cfe9a5e1454405eb2a975c43eace6ec%2F2023-10-18T12%3A42%3A27Z%2F-1%2F%2F6b5388dcd9013bc9b340bb1806476afa938ce0c65f2f595e1a75f529e90e4187
Connecting to 220.181.33.47:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1005669898 (959M) [application/octet-stream]
Saving to: ‘STI2_1017.zip’


2023-12-13 14:52:59 (30.8 MB/s) - ‘STI2_1017.zip’ saved [1005669898/1005669898]



In [2]:
!rm -rf STI2 && unzip STI2_1017.zip && mv STI2\ 2 STI2

Archive:  STI2_1017.zip
   creating: STI2 2/
  inflating: __MACOSX/._STI2 2       
   creating: STI2 2/bin/
  inflating: __MACOSX/STI2 2/._bin   
   creating: STI2 2/translate/
  inflating: __MACOSX/STI2 2/._translate  
  inflating: STI2 2/run.sh           
  inflating: __MACOSX/STI2 2/._run.sh  
  inflating: STI2 2/build.sh         
  inflating: __MACOSX/STI2 2/._build.sh  
   creating: STI2 2/src/
  inflating: __MACOSX/STI2 2/._src   
  inflating: STI2 2/bin/query_doc_scoring  
  inflating: __MACOSX/STI2 2/bin/._query_doc_scoring  
   creating: STI2 2/translate/res/
  inflating: __MACOSX/STI2 2/translate/._res  
   creating: STI2 2/translate/querys/
  inflating: __MACOSX/STI2 2/translate/._querys  
  inflating: STI2 2/translate/docs.txt  
  inflating: __MACOSX/STI2 2/translate/._docs.txt  
  inflating: STI2 2/src/topk.h       
  inflating: __MACOSX/STI2 2/src/._topk.h  
  inflating: STI2 2/src/topk.cu      
  inflating: __MACOSX/STI2 2/src/._topk.cu  
  inflating: STI2 2/src/main.cpp

In [None]:
!sh STI2/build.sh

In [None]:
!STI2/bin/query_doc_scoring STI2/translate/docs.txt STI2/translate/querys ./res_gpu_baseline.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nvcc STI2/src/main.cpp STI2/src/topk.cu -o STI2/bin/query_doc_scoring_gpu  \
	-ISTI2/src \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-g


In [None]:
!STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res_3.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff res_3.txt STI2/translate/res/result.txt

1c1
< 3175
---
> 2990


In [None]:
!nvprof --print-gpu-trace STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res.txt

In [None]:
!ncu --set full --call-stack --nvtx -o report_gpu STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nvcc STI2/src/main.cpp topk/topk_query_stream.cu -o STI2/bin/query_doc_scoring_gpu_stream  \
	-ISTI2/src \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-g

In [None]:
!STI2/bin/query_doc_scoring_gpu_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff ./res_gpu_stream.txt STI2/translate/res/result.txt

1c1
< 2850
---
> 2990


In [None]:
!nvprof --print-gpu-trace STI2/bin/query_doc_scoring_gpu_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

## run topk

In [None]:
!make -C topk/ BUILD_TYPE=Release

In [None]:
!topk/bin/query_doc_scoring_cpu STI2/translate/docs.txt STI2/translate/querys ./cpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_res.txt STI2/translate/res/result.txt

1c1
< 87230
---
> 2990


### cpu_readfile -> vec docs -> cpu_topk (cpu_baseline)

1. read file cost from 33054 ms(line/per)
2. topk cost 87230 ms
3. all cost 120284 ms



In [None]:
!topk/bin/query_doc_scoring_cpu_concurrency STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_concurency_res.txt STI2/translate/res/result.txt

1c1
< 14206
---
> 2990


### cpu_readfile -> vec docs split -> cpu_concurrency_topk
use thread_pool thread num: cpu core num a100 (12 cores)

1. read file cost from 33054 ms(line/per)
2. topk cost 14206 ms, reduce: (87230-14206)/87230=**83.71%** compare with `cpu_baseline`  
3. all cost 47654 ms, reduce: (120284-47654)/120284=**60.38%** compare with `cpu_baseline`  

---


In [None]:
!make -C topk/ build_cpu_gpu BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_gpu_res.txt STI2/translate/res/result.txt

1c1
< 2504
---
> 2990


### cpu_readfile -> vec docs -> gpu_cpu_topk (gpu_baseline)

1. read file cost from 33054 ms(line/per)
2. topk cost 2504 ms, reduce: (87230-2504)/87230=**97.13%** compare with `cpu_baseline`  ;  (14206-2504)/14206=**97.13%** compare with `cpu_concurrency`  
3. all cost 36026 ms, reduce: (120284-36026)/120284=**70.05%** compare with `cpu_baseline`  ; (47654-36026)/47654=**24.40%** compare with `cpu_concurrency`  

---


In [None]:
!nvprof --print-gpu-trace topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt

In [None]:
!nsys profile  -o a100_report_cpu_gpu.nsys-rep topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_gpu topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt

In [None]:
!make -C topk/ build_cpu_concurrency_gpu BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk.cu -o ./bin/query_doc_scoring_cpu_concurrency_gpu  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DCPU_CONCURRENCY \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_concurency_gpu_res.txt STI2/translate/res/result.txt

1c1
< 2915
---
> 2990


### cpu_readfile -> vec docs split -> cpu_concurency_gpu_topk :(

1. read file cost from 33054 ms(line/per)
2. topk cost 2915 ms, increase: (2915-2504)/2915=**14.10%** compare with `gpu_baseline` ;
3. all cost 36230 ms, increase: (36230-36026)/36230=**00.56%** compare with `gpu_baseline` ;

increase **cpu context switch cost**

---


In [None]:
!nvprof --print-gpu-trace topk/bin/query_doc_scoring_cpu_concurency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

In [None]:
!nsys profile  -o a100_report_cpu_concurrency_gpu.nsys-rep topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_concurrency_gpu topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

### topk_pinned_memory

## sample test

In [None]:
!make -C topk/ build_example_readfile_cpu BUILD_TYPE=Release CXXFLAGS="-std=c++11"

make: Entering directory '/content/topk'
nvcc -o bin/example_readfile_cpu example_readfile.cpp -DFMT_HEADER_ONLY \
	-I./ \
	-std=c++11 \
	-O3 \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt line

docs_size:7853051 doc_lens_size:7853051
read file cost 33616 ms 


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt buffer

readcnt: 7 fread size: 3287461913
docs_size:7853051 doc_lens_size:7853051
read file cost 41724 ms 


In [None]:
!cd topk && nvcc ./stream.cu -o ./bin/stream && ./bin/stream

Number of device(s): 1
Device 0
    Name:                    Tesla T4
    Glocbal memory:          15101.8 MB
    Shared memory per block: 48 KB
    Warp size:               32
    Max thread per block:    1024
    Thread dimension limits: 1024 x 1024 x 64
    Max grid size:           2147483647 x 65535 x 65535
    Compute capability:      7.5
 
Generating 7680 x 4320 BRGA8888 image, data size: 132710400
 
Computing results using CPU.
 
    Whole process took 497.971ms.
 
Computing results using GPU, default stream.
 
    Move data to GPU.
        Data transfer took 12.0095ms.
        Performance is 11.0504GB/s.
    Convert 8-bit BGRA to 8-bit YUV.
        Processing of 8K image took 1.70637ms.
        Performance is 77.7736GB/s.
    Move data to CPU.
        Data transfer took 8.13226ms.
        Performance is 12.2393GB/s.
    Whole process took 21.8481ms.
    Compare CPU and GPU results ...
        Results are the same.
 
Computing results using GPU, using 16 streams.
 
    Creating 

In [None]:
!make -C topk/ build_cpu_gpu_doc_stream BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk_doc_stream.cu -o ./bin/query_doc_scoring_cpu_gpu_doc_stream  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_gpu_doc_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_doc_stream.txt

In [None]:
!diff ./res_gpu_doc_stream.txt STI2/translate/res/result.txt

# rapidsai - cudf
use chunk multibyte_split, strings split, gpu accelerate.

1. https://github.com/rapidsai/cudf/blob/branch-23.12/CONTRIBUTING.md#build-cudf-from-source

In [None]:
!rm -rf cudf && git clone https://github.com/weedge/cudf.git

Cloning into 'cudf'...
remote: Enumerating objects: 351573, done.[K
remote: Counting objects: 100% (24844/24844), done.[K
remote: Compressing objects: 100% (2001/2001), done.[K
remote: Total 351573 (delta 23466), reused 23327 (delta 22826), pack-reused 326729[K
Receiving objects: 100% (351573/351573), 129.01 MiB | 2.32 MiB/s, done.
Resolving deltas: 100% (260502/260502), done.
git: 'co' is not a git command. See 'git --help'.

The most similar commands are
	commit
	clone
	log


In [None]:
!cd cudf && git branch

* [32mbranch-23.10[m
  branch-24.02[m


In [None]:
!cd cudf && ./build.sh clean && INSTALL_PREFIX=$HOME/rapidsai ./build.sh libcudf --cmake-args=\"-DBUILD_SHARED_LIBS=OFF\"

In [None]:
!ls $HOME/rapidsai/lib/libcudf.a

/root/rapidsai/lib/libcudf.a


In [None]:
!git clone https://github.com/gabime/spdlog.git

fatal: destination path 'spdlog' already exists and is not an empty directory.


In [None]:
!cd spdlog && cmake -B build -S . && make -C build -j

In [None]:
!ls $HOME/rapidsai/include

arrow  cudf	  fmt	    gmock  kvikio   native  nvcomp.h	nvtext	spdlog
cuco   cudf_test  gdeflate  gtest  libcudf  nvcomp  nvcomp.hpp	rmm


In [None]:
!cp -r ./spdlog/include/spdlog/fmt/bundled $HOME/rapidsai/include/spdlog/fmt/

In [None]:
!ls /usr/local/cuda*

/usr/local/cuda:
bin  compat  compute-sanitizer	doc  extras  gds  include  lib64  nvml	nvvm  share  src  targets

/usr/local/cuda-11:
bin  compat  compute-sanitizer	doc  extras  gds  include  lib64  nvml	nvvm  share  src  targets

/usr/local/cuda-11.8:
bin  compat  compute-sanitizer	doc  extras  gds  include  lib64  nvml	nvvm  share  src  targets


In [None]:
!tar -zcvf libcudf.tar.gz /include /lib/libcudf.so /lib/libarrow*

In [None]:
!tar -zxvf libcudf.tar.gz

In [None]:
!cd topk && /opt/nvidia/hpc_sdk/Linux_x86_64/2023/cuda/bin/nvcc -o bin/example_readfile_gpu example_readfile.cpp readfile.cu -DGPU -DFMT_HEADER_ONLY \
	-I./  \
	-std=c++17 --expt-relaxed-constexpr \
	-L/opt/nvidia/hpc_sdk/Linux_x86_64/2023/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-O2 \
	-g

In [None]:
!make -C topk build_example_readfile_gpu BUILD_TYPE=Release NVCCSTD=c++17 RAPIDSAI_DIR=$HOME/rapidsai

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc -o bin/example_readfile_gpu example_readfile.cpp readfile.cu -DGPU -DFMT_HEADER_ONLY \
	-I./ -I/usr/local/cuda/include \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-O2 \
	-g
make: Leaving directory '/content/topk'


In [None]:
!git clone --recurse-submodules https://github.com/rapidsai/rmm.git


Cloning into 'rmm'...
remote: Enumerating objects: 18856, done.[K
remote: Total 18856 (delta 0), reused 0 (delta 0), pack-reused 18856[K
Receiving objects: 100% (18856/18856), 4.72 MiB | 10.49 MiB/s, done.
Resolving deltas: 100% (12040/12040), done.


In [None]:
!cd rmm && git checkout branch-23.10

Branch 'branch-23.10' set up to track remote branch 'branch-23.10' from 'origin'.
Switched to a new branch 'branch-23.10'


In [None]:
!cd rmm && git branch

* [32mbranch-23.10[m
  branch-24.02[m


In [None]:
!cp -r rmm/include/rmm /root/rapidsai/include/

In [None]:
!cd rmm && cmake -B build -S . -DCMAKE_INSTALL_PREFIX=$HOME/rapidsai

In [None]:
!cat topk/data.txt

0, 1, 3
1, 2, 3, 4
4, 5, 6, 5
7, 2

In [None]:
!topk/readfile topk/data.txt chunk

file size: 34
chunk size: 268435456
 fread size: 34
 buffer: 0, 1, 3
1, 2, 3, 4
4, 5, 6, 5
7, 2

tid:0 docid:0 s:0 e:3 sub_view_size:3

tid:1 docid:1 s:3 e:7 sub_view_size:4

tid:2 docid:2 s:7 e:11 sub_view_size:4

tid:3 docid:3 s:11 e:13 sub_view_size:2
0,1,4,7,1,2,5,2,3,3,6,4,5,readcnt: 1
doccnt: 4
docs_size:0 doc_lens_size:0
read file cost 1183 ms 


In [None]:
!topk/readfile STI2/translate/docs.txt line

docs_size:7853051 doc_lens_size:7853051
read file cost 34274 ms 


In [None]:
!topk/readfile STI2/translate/docs.txt buffer

readcnt: 7 fread size: 3287461913
docs_size:7853051 doc_lens_size:7853051
read file cost 42369 ms 


In [None]:
!topk/bin/example_readfile_gpu STI2/translate/docs.txt chunk

chunk size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 66239130
readcnt: 13
doccnt: 7853052
docs_size:7853052 doc_lens_size:7853052
read file cost 29039 ms 


In [None]:
!make -C topk/ build_cpu_gpu_readfile BUILD_TYPE=Release NVCCFLAGS="-std=c++17 --expt-relaxed-constexpr"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./readfile.cu ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu_readfile \
	-I./ \
	-std=c++17 --expt-relaxed-constexpr \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-L/lib -lcudf -I/include  \
	-O3 \
	-DGPU -DFMT_HEADER_ONLY -DPIO \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff res_cpu_gpu_readfile.txt STI2/translate/res/result.txt

1c1
< 2393
---
> 2990


In [None]:
!nsys profile  -o a100_report_cpu_gpu_readfile.nsys-rep \
  topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_gpu_readfile \
  topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt

### gpu_readfile -> vec docs -> gpu_cpu_topk

1. read file cost from 34274 ms(line/per) to 9196 ms(gpu chunk multi_split), cost reduce (34274-9196)/34274 = **73.17%**
2. total cost reduce (35551 - 11589)/35551 = **67.40%**

---



In [None]:
!make -C topk/ build_gpu_cudf_strings BUILD_TYPE=Release NVCCFLAGS="-std=c++17 --expt-relaxed-constexpr"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./readfile.cu ./topk_doc_cudf_strings.cu -o ./bin/query_doc_scoring_gpu_cudf_strings \
	-I./ \
	-std=c++17 --expt-relaxed-constexpr \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-L/lib -lcudf -I/include  \
	-O3 \
	-DFMT_HEADER_ONLY -DGPU -DPIO_TOPK \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff res_gpu_cudf_strings.txt STI2/translate/res/result.txt

1c1
< 0
---
> 2990


In [None]:
!nsys profile  -o a100_report_gpu_cudf_strings.nsys-rep \
  topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_gpu_cudf_strings \
  topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

### gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk

1. read file chunk pipeline to rank topk on gpu
2. total cost reduce (35551 - 7021)//35551 = **80.25%** compare with `gpu baseline`
3. total cost reduce (11589 - 7021)/11589 = **39.42%** compare with `gpu read file chunk to cpu vec docs then load to gpu rank topk`

---




### (gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk) + stream pool + rmm (todo)

# rapidsai - RAFT

use select k -> sort -> top k. gpu accelerate

1. https://github.com/rapidsai/raft/blob/branch-23.12/docs/source/build.md

In [None]:
!apt install ninja-build

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  ninja-build
0 upgraded, 1 newly installed, 0 to remove and 19 not upgraded.
Need to get 111 kB of archives.
After this operation, 358 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 ninja-build amd64 1.10.1-1 [111 kB]
Fetched 111 kB in 1s (78.4 kB/s)
Selecting previously unselected package ninja-build.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack .../ninja-build_1.10.1-1_amd64.deb ...
Unpacking ninja-build (1.10.1-1) ...
Setting up ninja-build (1.10.1-1) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
!git clone https://github.com/rapidsai/raft.git

Cloning into 'raft'...
remote: Enumerating objects: 30810, done.[K
remote: Counting objects: 100% (658/658), done.[K
remote: Compressing objects: 100% (376/376), done.[K
remote: Total 30810 (delta 360), reused 472 (delta 261), pack-reused 30152[K
Receiving objects: 100% (30810/30810), 12.60 MiB | 10.56 MiB/s, done.
Resolving deltas: 100% (22149/22149), done.


In [None]:
!cd raft && ./build.sh --help

./build.sh [<target> ...] [<flag> ...] [--cmake-args="<args>"] [--cache-tool=<tool>] [--limit-tests=<targets>] [--limit-bench-prims=<targets>] [--limit-bench-ann=<targets>] [--build-metrics=<filename>]
 where <target> is:
   clean            - remove all existing build artifacts and configuration (start over)
   libraft          - build the raft C++ code only. Also builds the C-wrapper library
                      around the C++ code.
   pylibraft        - build the pylibraft Python package
   raft-dask        - build the raft-dask Python package. this also requires pylibraft.
   docs             - build the documentation
   tests            - build the tests
   bench-prims      - build micro-benchmarks for primitives
   bench-ann        - build end-to-end ann benchmarks
   template         - build the example RAFT application template

 and <flag> is:
   -v                          - verbose build mode
   -g                          - build for debug
   -n                          - 

In [None]:
!sleep 3600

In [None]:
!cd raft && ./build.sh libraft --compile-lib

In [None]:
!ls /content/raft/cpp/build/install/include
!ls /content/raft/cpp/build/install/lib

cuco  cutlass  fmt  raft  raft_runtime	rapids	rmm  spdlog
cmake	   libfmt.so.9	    libraft.a	libspdlog.so	   libspdlog.so.1.11.0	rapids
libfmt.so  libfmt.so.9.1.0  libraft.so	libspdlog.so.1.11  pkgconfig


In [None]:
!ls /include/ /lib/
!cp -r /content/raft/cpp/build/install/include/* /include/
!cp -r /content/raft/cpp/build/install/lib/* /lib/

/include/:
cuco  cutlass  fmt  raft  raft_runtime	rapids	rmm  spdlog

/lib/:
apt		  libarmadillo.so.10	  libraft.so		python3.10
bfd-plugins	  libarmadillo.so.10.8.2  libR.so		python3.11
binfmt.d	  libBLT.2.5.so.8.6	  libspdlog.so		R
blt2.5		  libBLTlite.2.5.so.8.6   libspdlog.so.1.11	rapids
clang		  libdfalt.a		  libspdlog.so.1.11.0	sasl2
cmake		  libdfalt.la		  libvpf.so		software-properties
compat-ld	  libdfalt.so		  libvpf.so.4		ssl
cpp		  libdfalt.so.0		  libvpf.so.4.1		sysctl.d
dbus-1.0	  libdfalt.so.0.0.0	  llvm-14		systemd
debug		  libfmt.so		  locale		sysusers.d
dh-elpa		  libfmt.so.9		  lsb			tc
dpkg		  libfmt.so.9.1.0	  man-db		tcl8.6
emacsen-common	  libgdal.a		  mime			tclConfig.sh
environment.d	  libgdal.so		  modprobe.d		tclooConfig.sh
file		  libgdal.so.30		  modules		tcltk
gcc		  libgdal.so.30.0.3	  modules-load.d	terminfo
girepository-1.0  libhdf4.settings	  ogdi			tk8.6
git-core	  libmfhdfalt.a		  openssh		tkConfig.sh
gnupg		  libmfhdfalt.la	  os-release		tmpfiles.d
g

In [None]:
!ls /include/spdlog/fmt

bin_to_hex.h  bundled  chrono.h  compile.h  fmt.h  ostr.h  ranges.h  xchar.h


In [None]:
!tar -zcvf libraft.tar.gz /content/raft/cpp/build/install

In [None]:
!make -C topk build_cpu_gpu_sort NVCCSTD="c++17" BUILD_TYPE=Release

In [None]:
!topk/bin/query_doc_scoring_cpu_gpu_sort STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_sort.txt

In [None]:
!diff res_cpu_gpu_sort.txt STI2/translate/res/result.txt

# profiling (A100)

In [None]:
!tar zcvf a100_gpu_topk_ncu_nsys_profile.tar.gz ./a100*
!ls -gh a100*

./a100_report_cpu_concurrency_gpu.ncu-rep
./a100_report_cpu_concurrency_gpu.nsys-rep
./a100_report_cpu_gpu.ncu-rep
./a100_report_cpu_gpu.nsys-rep
./a100_report_cpu_gpu_readfile.ncu-rep
./a100_report_cpu_gpu_readfile.nsys-rep
./a100_report_gpu_cudf_strings.ncu-rep
./a100_report_gpu_cudf_strings.nsys-rep
-rw-r--r-- 1 root  66M Nov 10 15:03 a100_gpu_topk_ncu_nsys_profile.tar.gz
-rw-r--r-- 1 root  29M Nov 10 07:23 a100_report_cpu_concurrency_gpu.ncu-rep
-rw-rw-r-- 1 root  11M Nov 10 07:21 a100_report_cpu_concurrency_gpu.nsys-rep
-rw-r--r-- 1 root 2.6M Nov 10 07:16 a100_report_cpu_gpu.ncu-rep
-rw-rw-r-- 1 root 5.8M Nov 10 07:15 a100_report_cpu_gpu.nsys-rep
-rw-r--r-- 1 root 229M Nov 10 14:23 a100_report_cpu_gpu_readfile.ncu-rep
-rw-rw-r-- 1 root 583K Nov 10 14:11 a100_report_cpu_gpu_readfile.nsys-rep
-rw-r--r-- 1 root 253M Nov 10 14:37 a100_report_gpu_cudf_strings.ncu-rep
-rw-rw-r-- 1 root 621K Nov 10 14:37 a100_report_gpu_cudf_strings.nsys-rep


# install deps

In [3]:
!rm -rf topk && git clone https://github.com/weedge/topk.git

Cloning into 'topk'...
remote: Enumerating objects: 464, done.[K
remote: Counting objects: 100% (168/168), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 464 (delta 113), reused 98 (delta 51), pack-reused 296[K
Receiving objects: 100% (464/464), 5.90 MiB | 18.43 MiB/s, done.
Resolving deltas: 100% (300/300), done.


## install deps rapidsai cudf RAFT

In [4]:
!sh -x topk/build_deps_rapidsai.sh

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
spdlog-1.11.0/tests/test_file_helper.cpp
spdlog-1.11.0/tests/test_file_logging.cpp
spdlog-1.11.0/tests/test_fmt_helper.cpp
spdlog-1.11.0/tests/test_macros.cpp
spdlog-1.11.0/tests/test_misc.cpp
spdlog-1.11.0/tests/test_mpmc_q.cpp
spdlog-1.11.0/tests/test_pattern_formatter.cpp
spdlog-1.11.0/tests/test_registry.cpp
spdlog-1.11.0/tests/test_sink.h
spdlog-1.11.0/tests/test_stdout_api.cpp
spdlog-1.11.0/tests/test_stopwatch.cpp
spdlog-1.11.0/tests/test_systemd.cpp
spdlog-1.11.0/tests/test_time_point.cpp
spdlog-1.11.0/tests/utils.cpp
spdlog-1.11.0/tests/utils.h
+ cp -r ./spdlog-1.11.0/include/spdlog/fmt/bundled /root/rapidsai/include/spdlog/fmt/
+ [ 0 -eq 0 ]
+ cd /root/rapidsai
+ zip -r -v rapidsai.zip ./include ./lib
  adding: include/	(in=0) (out=0) (stored 0%)
  adding: include/nvcomp/	(in=0) (out=0) (stored 0%)
  adding: include/nvcomp/lz4.h	(in=10961) (out=2826) (deflated 74%)
  adding: include/nvcomp/bitcomp.hpp	(in=2224) (out=1149) (deflated 48%

# example readfile

In [None]:
!make -C topk build_example_readfile_cpu BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
g++ -o bin/example_readfile_cpu example_readfile.cpp -DFMT_HEADER_ONLY \
	-I./ \
	-std=c++11 -march=native -fopenmp \
	-O2 \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt line

docs_size:0 doc_lens_size:0
read file cost 0 ms 


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt buffer

readcnt: 7 fread size: 3287461913
docs_size:7853051 doc_lens_size:7853051
read file cost 43283 ms 


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt map

file_size: 3287460378
docs_size:7853051 doc_lens_size:7853051
read file cost 42443 ms 


In [None]:
!make -C topk build_example_readfile_gpu BUILD_TYPE=Release RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc -o bin/example_readfile_gpu example_readfile.cpp readfile.cu -DGPU -DFMT_HEADER_ONLY \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-O2 \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/example_readfile_gpu topk/docs.txt chunk

chunk size: 268435456
 fread size: 36
1 10 10 11 13
2 3 4
11 12 13 len:1
2 3 4
11 12 13 len:1
11 12 13 len:2
 10 10 11 13
2 3 4
11 12 13 len:3
 3 4
11 12 13 len:2
 12 13 len:3
 10 11 13
2 3 4
11 12 13 len:3
 4
11 12 13 len:3
 13 len:3
 11 13
2 3 4
11 12 13 len:3
 13
2 3 4
11 12 13 len:4
readcnt: 1
doccnt: 3
1 63946 63946 63947 49628 
2 65379 63938 
11 63948 63949 
docs_size:3 doc_lens_size:3
read file cost 507 ms 


In [None]:
!sleep 3600

# example factory selectk

In [None]:
!cd topk && rm -f ./third_party/done.txt  ./lib/libfaiss.so  ./lib/libgpu_selection.so


In [None]:
!cd topk && make clean_3d_faiss && make clean_3d_gpu_selection

In [None]:
!cd topk && sh -x build_examples_factory_selectk.sh 75

+ set -e
+ dirname build_examples_factory_selectk.sh
+ cd .
+ pwd
+ ROOT_DIR=/content/topk
+ cd /content/topk
+ mkdir -p bin
+ ARCH=70
+ [ -n 75 ]
+ ARCH=75
+ [ ! -f ./third_party/done.txt ]
+ [ ! -f ./lib/libfaiss.so ]
+ [ ! -f ./lib/libgpu_selection.so ]
+ nvcc -o bin/example_factory_selectk example_factory_selectk.cu -O2 -std=c++17 -Xcompiler -Wall -Wextra -Wno-unused-parameter --expt-relaxed-constexpr --extended-lambda -arch=sm_75 -gencode=arch=compute_75,code=sm_75 -I./include -I./third_party -isystem ./third_party/DrTopKSC/bitonic/LargerKVersions/largerK/ -I./third_party/DrTopKSC/baseline+filter+beta+shuffle/ -I./third_party/gpu_selection/include -I./third_party/gpu_selection/lib -L/usr/local/cuda/lib64 -lcudart -lcuda -L./lib -lfaiss -Xlinker -rpath=./lib -L./lib -lgpu_selection -Xlinker -rpath=./lib -L./lib -lgridselect -Xlinker -rpath=./lib -g


In [None]:
!ldd topk/lib/lib*.so
#!ldd topk/bin/example_factory_selectk
!cd topk && ldd bin/example_factory_selectk

support algo:

cub	drtopk_bitonic	drtopk_radix	faiss_block	faiss_warp	grid_select	sampleselect	sampleselect-bucket	sampleselect-quick


In [None]:
!cd topk && bin/example_factory_selectk 10 2 cpu g

size:1000 scores:
-0.928571,-0.9,-0.896552,-0.875,-0.875,-0.870968,-0.866667,-0.866667,-0.866667,-0.857143,-0.857143,-0.857143,-0.83871,-0.83871,-0.83871,-0.818182,-0.818182,-0.8125,-0.8125,-0.8125,-0.8125,-0.8125,-0.806452,-0.8,-0.794118,-0.794118,-0.787879,-0.787879,-0.787879,-0.78125,-0.771429,-0.771429,-0.771429,-0.764706,-0.764706,-0.764706,-0.764706,-0.757576,-0.742857,-0.742857,-0.72973,-0.72973,-0.722222,-0.722222,-0.722222,-0.722222,-0.722222,-0.71875,-0.702703,-0.702703,-0.702703,-0.702703,-0.702703,-0.702703,-0.692308,-0.685714,-0.684211,-0.684211,-0.676471,-0.666667,-0.658537,-0.634146,-0.628571,-0.611111,-0.606061,-0.604651,-0.571429,-0.568182,-0.565217,-0.522727,-0.490566,-0.485714,-0.482759,-0.466667,-0.466667,-0.464286,-0.464286,-0.464286,-0.464286,-0.464286,-0.464286,-0.464286,-0.454545,-0.451613,-0.451613,-0.451613,-0.451613,-0.451613,-0.451613,-0.448276,-0.448276,-0.448276,-0.448276,-0.448276,-0.444444,-0.441176,-0.441176,-0.441176,-0.4375,-0.4375,-0.4375,-0.4375,-0.

In [None]:
!cd topk && make clean_3d_gpu_selection
#!cd topk && make clean_3d_faiss

# optimize

## gpu_baseline

In [5]:
!make -C topk build_cpu_gpu BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [6]:
!topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

## topk_doc_pinned_memory


In [7]:
!make -C topk build_cpu_gpu_pinned_memory BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_pinned_memory.cu -o ./bin/query_doc_scoring_pinned_memory \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [8]:
!topk/bin/query_doc_scoring_pinned_memory STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_pin.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [9]:
!topk/bin/query_doc_scoring_pinned_memory STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_pin.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [10]:
!make -C topk build_cpu_gpu_pinned_map_memory BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_pinned_memory.cu -o ./bin/query_doc_scoring_pinned_map_memory \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	-DGPU -DMAP_HOST_MEMORY \
	-g
make: Leaving directory '/content/topk'


In [11]:
!topk/bin/query_doc_scoring_pinned_map_memory STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_pinned_map_memory.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

## query_stream

stream 之间数据不应该存在依赖，尽量并行化

In [12]:
!make -C topk build_cpu_gpu_query_stream BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_query_stream.cu -o ./bin/query_doc_scoring_cpu_gpu_query_stream  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	--default-stream per-thread \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [13]:
!topk/bin/query_doc_scoring_cpu_gpu_query_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_query_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [14]:
!topk/bin/query_doc_scoring_cpu_gpu_query_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_query_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [15]:
!nsys profile  -o query_doc_scoring_cpu_gpu_query_stream.nsys-rep --force-overwrite true \
  topk/bin/query_doc_scoring_cpu_gpu_query_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_query_stream.txt

/bin/bash: line 1: nsys: command not found


In [16]:
!topk/bin/query_doc_scoring_cpu_gpu_query_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_query_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff query_doc_scoring_cpu_gpu_query_stream.txt STI2/translate/res/result.txt

1c1
< 2607
---
> 2990


## topk_doc_align_locality


In [17]:
!make -C topk build_cpu_gpu_align_locality BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_align_locality.cu -o ./bin/query_doc_scoring_align_locality \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	--default-stream per-thread \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [18]:
!topk/bin/query_doc_scoring_align_locality STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_align_locality.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [19]:
!diff query_doc_scoring_align_locality.txt STI2/translate/res/result.txt

1c1
< 1105
---
> 2990


In [20]:
!make -C topk build_cpu_gpu_pinned_align_locality BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_align_locality.cu -o ./bin/query_doc_scoring_pinned_align_locality \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	--default-stream per-thread \
	-DGPU -DPINNED_MEMORY \
	-g
make: Leaving directory '/content/topk'


In [21]:
!topk/bin/query_doc_scoring_pinned_align_locality STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_pinned_align_locality.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [22]:
!diff query_doc_scoring_pinned_align_locality.txt STI2/translate/res/result.txt

1c1
< 1698
---
> 2990


## topk_doc_stream


In [23]:
!make -C topk build_cpu_gpu_doc_stream BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_stream.cu -o ./bin/query_doc_scoring_cpu_gpu_doc_stream  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	--default-stream per-thread \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [24]:
!topk/bin/query_doc_scoring_cpu_gpu_doc_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_doc_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [26]:
!diff query_doc_scoring_cpu_gpu_doc_stream.txt STI2/translate/res/result.txt

1c1
< 2495
---
> 2990


In [25]:
!make -C topk build_cpu_gpu_pinned_doc_stream BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_stream.cu -o ./bin/query_doc_scoring_cpu_gpu_pinned_doc_stream  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	--default-stream per-thread \
	-DGPU -DPINNED_MEMORY \
	-g
make: Leaving directory '/content/topk'


In [27]:
!topk/bin/query_doc_scoring_cpu_gpu_pinned_doc_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_pinned_doc_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [30]:
!diff query_doc_scoring_cpu_gpu_pinned_doc_stream.txt STI2/translate/res/result.txt

1c1
< 2448
---
> 2990


In [None]:
!nsys profile  -o query_doc_scoring_cpu_gpu_pinned_doc_stream.nsys-rep --force-overwrite true \
  topk/bin/query_doc_scoring_cpu_gpu_pinned_doc_stream STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_pinned_doc_stream.txt

## topk_hashtable


In [28]:
!make -C topk build_cpu_gpu_hashtable BUILD_TYPE=Release

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_hashtable.cu -o ./bin/query_doc_scoring_cpu_gpu_hashtable \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-O2 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [29]:
!topk/bin/query_doc_scoring_cpu_gpu_hashtable STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_cpu_gpu_hashtable.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [32]:
!diff query_doc_scoring_cpu_gpu_hashtable.txt STI2/translate/res/result.txt

1c1
< 2916
---
> 2990


## topk_raft_selectk

In [31]:
!make -C topk build_gpu_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_raft_selectk.cu -o ./bin/query_doc_scoring_gpu_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DGPU -DFMT_HEADER_ONLY \
	-g
make: Leaving directory '/content/topk'


In [41]:
!topk/bin/query_doc_scoring_gpu_raft_selectk STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_gpu_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [40]:
!diff query_doc_scoring_gpu_raft_selectk.txt STI2/translate/res/result.txt

1c1
< 2284
---
> 2990


## topk_doc_cudf_strings

In [36]:
!make -C topk build_gpu_cudf_strings BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./readfile.cu ./topk_doc_cudf_strings.cu -o ./bin/query_doc_scoring_gpu_cudf_strings \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-O2 \
	--default-stream per-thread \
	-DFMT_HEADER_ONLY -DGPU -DPIO_TOPK \
	-g
make: Leaving directory '/content/topk'


In [37]:
!topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./query_doc_scoring_gpu_cudf_strings.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [39]:
!diff query_doc_scoring_gpu_cudf_strings.txt STI2/translate/res/result.txt

1c1
< 0
---
> 2990


## topk_doc_cudf_strings_raft_selectk

In [42]:
!make -C topk build_gpu_cudf_strings_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17


make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./readfile.cu ./topk_doc_cudf_strings_raft_selectk.cu -o ./bin/query_doc_scoring_gpu_cudf_strings_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DFMT_HEADER_ONLY -DGPU -DPIO_TOPK \
	-g
make: Leaving directory '/content/topk'


In [47]:
!topk/bin/query_doc_scoring_gpu_cudf_strings_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_gpu_cudf_strings_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [44]:
!diff query_doc_scoring_gpu_cudf_strings_raft_selectk.txt STI2/translate/res/result.txt

1c1
< 0
---
> 2990


## tok_cpu_concurrency_gpu_cudf_strings_raft_selectk

In [45]:
!make -C topk build_cpu_concurrency_gpu_cudf_strings_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./readfile.cu ./topk_doc_cudf_strings_raft_selectk.cu -o ./bin/query_doc_scoring_cpu_concurrency_gpu_cudf_strings_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DFMT_HEADER_ONLY -DGPU -DPIO_TOPK -DPIO_CPU_CONCURRENCY \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_concurrency_gpu_cudf_strings_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./cpu_concurrency_gpu_cudf_strings_raft_selectk.txt

In [None]:
!diff cpu_concurrency_gpu_cudf_strings_raft_selectk.txt STI2/translate/res/result.txt

In [None]:
!sleep 3900

^C


## topk_doc_align_locality_query_stream_raft_selectk

In [48]:
!make -C topk build_gpu_doc_align_locality_query_stream_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_align_locality_query_stream_raft_selectk.cu -o ./bin/query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DGPU -DFMT_HEADER_ONLY \
	-g
make: Leaving directory '/content/topk'


In [50]:
!topk/bin/query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [51]:
!diff query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk.txt STI2/translate/res/result.txt

1c1
< 852
---
> 2990


In [None]:
!nsys profile  -o query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk.nsys-rep --force-overwrite true \
  topk/bin/query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
  ./query_doc_scoring_gpu_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [54]:
!make -C topk build_gpu_pinned_doc_align_locality_query_stream_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17


make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./topk_doc_align_locality_query_stream_raft_selectk.cu -o ./bin/query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DGPU -DFMT_HEADER_ONLY -DPINNED_MEMORY \
	-g
make: Leaving directory '/content/topk'


In [55]:
!topk/bin/query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nsys profile  -o query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk.nsys-rep --force-overwrite true \
  topk/bin/query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
  ./query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [56]:
!diff query_doc_scoring_gpu_pinned_doc_align_locality_query_stream_raft_selectk.txt STI2/translate/res/result.txt

1c1
< 1552
---
> 2990


## topk_readfile_gpu_baseline

In [57]:
!make -C topk build_cpu_gpu_readfile BUILD_TYPE=Release NVCCSTD=c++17 RAPIDSAI_DIR=$HOME/rapidsai

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./readfile.cu ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu_readfile \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-O2 \
	-DGPU -DFMT_HEADER_ONLY -DPIO \
	-g
make: Leaving directory '/content/topk'


In [58]:
!topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_cpu_gpu_readfile.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [59]:
!diff query_doc_scoring_cpu_gpu_readfile.txt STI2/translate/res/result.txt

1c1
< 2359
---
> 2990


## topk_readfile_doc_align_locality_query_stream_raft_selectk

In [60]:
!make -C topk build_gpu_readfile_doc_align_locality_query_stream_raft_selectk BUILD_TYPE=Release  RAPIDSAI_DIR=$HOME/rapidsai NVCCSTD=c++17

make: Entering directory '/content/topk'
mkdir -p bin
/usr/local/cuda/bin/nvcc ./main.cpp ./readfile.cu ./topk_doc_align_locality_query_stream_raft_selectk.cu \
	-o ./bin/query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk \
	-I./ \
	-std=c++17 -Xcompiler="-fopenmp" --expt-relaxed-constexpr --extended-lambda -arch=sm_70 -gencode=arch=compute_70,code=sm_70  \
	-L/usr/local/cuda/lib64 -lcudart -lcuda  \
	-L/root/rapidsai/lib -lcudf -I/root/rapidsai/include  \
	-L/root/rapidsai/lib -lraft -I/root/rapidsai/include \
	-O2 \
	--default-stream per-thread \
	-DGPU -DFMT_HEADER_ONLY -DPIO \
	-g
make: Leaving directory '/content/topk'


In [61]:
#A100
!topk/bin/query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
#T4
!topk/bin/query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
 ./query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nsys profile  -o query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk.nsys-rep --force-overwrite true \
  topk/bin/query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk STI2/translate/docs.txt STI2/translate/querys \
  ./query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [62]:
!diff query_doc_scoring_gpu_readfile_doc_align_locality_query_stream_raft_selectk.txt STI2/translate/res/result.txt

1c1
< 719
---
> 2990


# hpc nvc openmp

## install hpc-compiler


In [None]:
!curl https://developer.download.nvidia.com/hpc-sdk/ubuntu/DEB-GPG-KEY-NVIDIA-HPC-SDK | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg
!echo 'deb [signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /' | sudo tee /etc/apt/sources.list.d/nvhpc.list
!sudo apt-get update -y


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1626  100  1626    0     0  14155      0 --:--:-- --:--:-- --:--:-- 14263
deb [signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:6 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  InRelease [2,126 B]
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:8 https://ppa.launc

In [None]:
!sudo apt-get install -y nvhpc-23-9

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  nvhpc-23-9
0 upgraded, 1 newly installed, 0 to remove and 32 not upgraded.
Need to get 3,331 MB of archives.
After this operation, 12.3 GB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  nvhpc-23-9 23.9 [3,331 MB]
Fetched 3,331 MB in 2min 1s (27.5 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package nvhpc-23-9.
(Reading database ... 135277 files and dire

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvcc --version
!/opt/nvidia/hpc_sdk/Linux_x86_64/2022/compilers/bin/nvcc --version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [None]:
!ls /opt/nvidia/hpc_sdk/Linux_x86_64/2023/cuda/

12.2  bin  include  lib64  nvvm


In [None]:
!export PATH=$PATH:/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/ && nvc --version


nvc 22.11-0 64-bit target on x86-64 Linux -tp skylake-avx512 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


## nvc openmp tutorial

In [None]:
!git clone https://github.com/UoB-HPC/openmp-tutorial.git

Cloning into 'openmp-tutorial'...
remote: Enumerating objects: 797, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 797 (delta 131), reused 99 (delta 72), pack-reused 624[K
Receiving objects: 100% (797/797), 147.51 MiB | 1.83 MiB/s, done.
Resolving deltas: 100% (570/570), done.


In [None]:
!export PATH=$PATH:/opt/nvidia/hpc_sdk/Linux_x86_64/2022/compilers/bin/ && make -C openmp-tutorial

make: Entering directory '/content/openmp-tutorial'
nvc -fast -mp=gpu -gpu=cc75 -c pi.c
nvc -fast -mp=gpu -gpu=cc75 -o pi pi.o 
nvc -fast -mp=gpu -gpu=cc75 -c jac_solv.c
nvc -fast -mp=gpu -gpu=cc75 -c mm_utils.c
nvc -fast -mp=gpu -gpu=cc75 -o jac_solv jac_solv.o mm_utils.o 
nvc -fast -mp=gpu -gpu=cc75 -c vadd.c
nvc -fast -mp=gpu -gpu=cc75 -o vadd vadd.o 
nvc -fast -mp=gpu -gpu=cc75 -c vadd_heap.c
nvc -fast -mp=gpu -gpu=cc75 -o vadd_heap vadd_heap.o 
nvc -fast -mp=gpu -gpu=cc75 -c heat.c
nvc -fast -mp=gpu -gpu=cc75 -o heat heat.o 
nvc -fast -mp=gpu -gpu=cc75 -c heat_map.c
nvc -fast -mp=gpu -gpu=cc75 -o heat_map heat_map.o 
make: Leaving directory '/content/openmp-tutorial'


In [None]:
!cd openmp-tutorial && ./vadd

# test

In [None]:
!cd topk && nvcc -o topk_test topk_test.cu

In [None]:
!cd topk && ./topk_test

array: 
383 886 777 915 793 335 386 492 649 421 362 27 690 59 763 926 540 426 172 736 211 368 567 429 782 530 862 123 67 135 929 802 22 58 69 167 393 456 11 42 229 373 421 919 784 537 198 324 315 370 413 526 91 980 956 873 862 170 996 281 305 925 84 327 336 505 846 729 313 857 124 895 582 545 814 367 434 364 43 750 87 808 276 178 788 584 403 651 754 399 932 60 676 368 739 12 226 586 94 539 795 570 434 378 467 601 97 902 317 492 652 756 301 280 286 441 865 689 444 619 440 729 31 117 97 771 481 675 709 927 567 856 497 353 586 965 306 683 219 624 528 871 732 829 503 19 270 368 708 715 340 149 796 723 618 245 846 451 921 555 379 488 764 228 841 350 193 500 34 764 124 914 987 856 743 491 227 365 859 936 432 551 437 228 275 407 474 121 858 395 29 237 235 793 818 428 143 11 928 529 776 404 443 763 613 538 606 840 904 818 128 688 369 917 917 996 324 743 470 183 490 499 772 725 644 590 505 139 954 786 669 82 542 464 197 507 355 804 348 611 622 828 299 343 746 568 340 422 311 810 605 801 661 730