Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random subsampling for IVF methods #2077

Merged
merged 11 commits into from
Jan 23, 2024

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Jan 3, 2024

While building IVF-Flat or IVF-PQ indices we usually subsample the dataset to create a smaller training set for k-means clustering. Until now this subsampling was done with a fixed stride, this PR changes it to random subsampling.

The input is always randomized, even if all the vectors of the dataset are used.

Random sampling adds an overhead. The overhead is proportional to the training set size. If dataset is on host, then this overhead can be partially or completely masked by H2D transfer. The overhead is small compared to k-means training.

To completely overlap random sampling of the data with H2D copies, we utilize OpenMP parallelization to increase the effective bandwidth for gathering the data.

@tfeher tfeher added feature request New feature or request non-breaking Non-breaking change Vector Search labels Jan 3, 2024
@tfeher tfeher requested review from a team as code owners January 3, 2024 10:33
@tfeher
Copy link
Contributor Author

tfeher commented Jan 3, 2024

The build time slightly increases when random subsampling is enabled. Here are measurements with IVF-Flat on H100 with subsets of the DEEP dataset. Here random_seed=-1 refers to the original subsampling with fixed stride.

build time (s) index_size nlist random_seed ratio
0.37 1.00E+06 5000 -1 2
0.43 1.00E+06 5000 137 2
46.7 1.00E+08 50000 -1 10
47.22 1.00E+08 50000 137 10

@tfeher
Copy link
Contributor Author

tfeher commented Jan 3, 2024

There is also slight variation in recall. DEEP-100M, IVF-Flat:

n_probes Recall orig recall random diff
20 85.68% 85.57% -0.11%
30 90.02% 90.00% -0.02%
40 92.44% 92.48% 0.04%
50 94.09% 94.03% -0.06%
100 97.37% 97.36% -0.02%
200 98.93% 98.98% 0.05%
500 99.70% 99.74% 0.03%
1000 99.88% 99.88% 0.00%

IVF-PQ for the same dataset

nprobe Recall orig recall random recall diff
20 85.57% 85.46% -0.11%
30 90.02% 89.91% -0.11%
40 92.48% 92.32% -0.16%
50 94.05% 93.91% -0.14%
100 97.28% 97.24% -0.04%
200 98.87% 98.80% -0.07%
1000 99.74% 99.74% 0.00%
2000 99.78% 99.78% 0.00%
5000 99.79% 99.78% -0.01%

@tfeher
Copy link
Contributor Author

tfeher commented Jan 3, 2024

Tagging @abc99lr to rebase #2052 on this and run tests.

@tfeher tfeher self-assigned this Jan 3, 2024
Copy link
Contributor

@lowener lowener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change! Is there a reason why subsample is added to raft::spatial::knn::detail::utils namespace? It is used in raft::neighbors so it make sense to add it under raft::neighbors::detail::utils.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
@tfeher
Copy link
Contributor Author

tfeher commented Jan 17, 2024

To establish a baseline on what are the expected variations, I have run test with the original RAFT code (fixed stride subsampling), but with input data where the vectors are shuffled.

The tables show recall values in percentage (%). Index building and search was run with 10 different permutation of the input, and statistics on recall variation is presented in the tables below. Dataset is DEEP-10M, subsample ratio is 10.

When the number of clusters is larger (while keeping number of probes fixed) then we have a large variation in the results.

IVF-Flat 1k clusters

nprobe mean std min max min-max diff
20 97.3981 0.055991 97.322 97.5 0.178
30 98.6231 0.029797 98.576 98.667 0.091
40 99.1678 0.022685 99.135 99.216 0.081
50 99.4493 0.024459 99.427 99.508 0.081
100 99.8575 0.008114 99.844 99.871 0.027
200 99.9432 0.004614 99.934 99.95 0.016
500 99.9579 0.002132 99.954 99.96 0.006
1000 99.9581 0.002132 99.954 99.961 0.007

IVF-Flat 10k clusters

nprobe mean std min max min-max diff
20 88.9535 0.144761 88.737 89.192 0.455
30 92.7403 0.098084 92.581 92.892 0.311
40 94.7593 0.083885 94.633 94.871 0.238
50 96.0016 0.072469 95.893 96.109 0.216
100 98.4533 0.040351 98.396 98.523 0.127
200 99.4852 0.024225 99.453 99.519 0.066
500 99.8841 0.008252 99.871 99.897 0.026
1000 99.9466 0.003978 99.939 99.952 0.013

IVF-PQ 1k clusters, dim_pq=64, pq_bits=5

nprobe mean std min max min-max diff
20 73.0448 0.090379 72.907 73.163 0.256
30 73.5117 0.084429 73.4 73.636 0.236
40 73.6941 0.083665 73.575 73.817 0.242
50 73.7857 0.088112 73.664 73.892 0.228
100 73.9107 0.084362 73.803 74.021 0.218
200 73.9334 0.084866 73.831 74.042 0.211
1000 73.931 0.083273 73.83 74.038 0.208

IVF-PQ 1k clusters, dim_pq=64, pq_bits=8

nprobe mean std min max min-max diff
20 87.8706 0.061715 87.775 87.964 0.189
30 88.5719 0.063796 88.49 88.671 0.181
40 88.8595 0.059461 88.759 88.965 0.206
50 89.0077 0.064338 88.906 89.134 0.228
100 89.2005 0.063746 89.11 89.314 0.204
200 89.2359 0.062199 89.149 89.341 0.192
1000 89.2373 0.065332 89.144 89.342 0.198

IVF-PQ 10k clusters, dim_pq=96, pq_bits=8

nprobe mean std min max min-max diff
20 83.026 0.100094 82.856 83.223 0.367
30 85.7558 0.068037 85.663 85.856 0.193
40 87.1303 0.078033 86.974 87.235 0.261
50 87.9287 0.070296 87.765 88.009 0.244
100 89.3685 0.065758 89.221 89.449 0.228
200 89.9053 0.063934 89.775 89.988 0.213
1000 90.1055 0.062324 90.012 90.197 0.185

@tfeher
Copy link
Contributor Author

tfeher commented Jan 17, 2024

I have investigated random subsampling versus shuffling the input data and keeping fixed stride subsampling. In both cases the recall value fluctuates with a std < 0.14. I have compared the mean recall using these two methods over 10 iterations. The diff between average recall is between 0.01% and 0.03%. I would conclude that we have no significant change in recall.

Example results: Running 10 different build/search iterations with different random seed. The table shows recall values and its variations (orig) datase file shuffled, (this PR). The last columns is the difference between the mean recall values. All recall values are in percentage.

The table shows results for deep-10M. Results for bigANN-10M ar similar.

index nprobe recall mean std min-max diff recall mean std min-max diff mean recall diff
    orig orig orig this PR this PR this PR orig - this PR
ivf-flat 10K 20 88.95 0.14 0.46 88.93 0.10 0.29 -0.03
ivf-flat 10K 30 92.74 0.10 0.31 92.73 0.08 0.24 -0.01
ivf-flat 10K 40 94.76 0.08 0.24 94.77 0.08 0.24 0.01
ivf-flat 10K 50 96.00 0.07 0.22 96.01 0.09 0.27 0.01
ivf-flat 10K 100 98.45 0.04 0.13 98.46 0.05 0.16 0.01
ivf-flat 10K 200 99.49 0.02 0.07 99.49 0.02 0.08 0.00
ivf-flat 10K 500 99.88 0.01 0.03 99.89 0.01 0.03 0.00
ivf-flat 10K 1000 99.95 0.00 0.01 99.95 0.00 0.01 0.00
ivf-PQ d64b5n1K 20 73.04 0.09 0.26 73.06 0.10 0.33 0.01
ivf-PQ d64b5n1K 30 73.51 0.08 0.24 73.52 0.10 0.34 0.01
ivf-PQ d64b5n1K 40 73.69 0.08 0.24 73.70 0.10 0.33 0.01
ivf-PQ d64b5n1K 50 73.79 0.09 0.23 73.79 0.10 0.33 0.01
ivf-PQ d64b5n1K 100 73.91 0.08 0.22 73.91 0.11 0.36 0.00
ivf-PQ d64b5n1K 200 73.93 0.08 0.21 73.93 0.10 0.34 0.00
ivf-PQ d64b5n1K 1000 73.93 0.08 0.21 73.94 0.10 0.34 0.01

@github-actions github-actions bot removed the python label Jan 18, 2024
Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lowener for the review!

why subsample is added to raft::spatial::knn::detail::utils namespace?

Currently the helper function is still in the old namespace. I have started a separate branch to move the ann_utils.cuh file to the neighbors namespace, I will submit a separate PR for 24.04.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/ivf_flat_types.hpp Outdated Show resolved Hide resolved
@tfeher
Copy link
Contributor Author

tfeher commented Jan 18, 2024

Currently the overhead of the random subsampling is larger than ideal. I am running tests to quantify it.

@tfeher
Copy link
Contributor Author

tfeher commented Jan 21, 2024

Overhead reduced. Switched to batchwise copy of the data: this reduces the time to allocate temporary buffers, and enables to overlap gathering vectors with host to device copies.

The relative overhead is larger for IVF-Flat than IVF-PQ (because IVF-PQ building has more work to do, therefore preparing the training set takes a smaller fraction of build time). The attached example with trainset ratio=1 is the corner case where the trainset copies take the largest relative fraction. While using 1 thread we have a significant overhead, it goes down to 2% overhead when 8 threads are used.

IVF-Flat build time (DEEP-10M, 1K clusters)

When we consider taking trainset as a smaller fraction of the dataset, then the overhead becomes smaller.

The data is gathered into a contiguous buffer before copying to the device. This can be slightly faster than the strided copy using cudaMemcpy2D that we head earlier. In best case H2D copies are slightly accelerated, and completely overlap with data gathering. This can lead to 1% build time speedup.

IVF methods build time (DEEP-10M)

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @tfeher! Especially for the comprehensive analysis of the perf impact and for reducing th esize of the ivf_pq_build.cuh file :)
Couple small things regarding the use of new raft primitives below.

cpp/include/raft/matrix/detail/gather.cuh Show resolved Hide resolved
cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk @achirkin for the review, I have addressed the issues.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/include/raft/matrix/detail/gather.cuh Show resolved Hide resolved
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM!

@cjnolet
Copy link
Member

cjnolet commented Jan 23, 2024

/merge

@rapids-bot rapids-bot bot merged commit 9c35f73 into rapidsai:branch-24.02 Jan 23, 2024
61 checks passed
cjnolet added a commit to cjnolet/raft that referenced this pull request Jan 31, 2024
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this pull request Aug 1, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #122
divyegala pushed a commit to divyegala/cuvs that referenced this pull request Aug 7, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp feature request New feature or request non-breaking Non-breaking change Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

4 participants