Embeddings: search with SIMD #51372

camdencheek · 2023-05-02T17:19:20Z

This implements a hand-written assembly version of the int8 dot product that takes advantage of AVX2 SIMD instructions. This speeds up our embeddings searches by roughly 10x on modern x86_64 machines.

Test plan

Added quickchecks and fuzz tests to compare output with the go version.

The following benchmark is for a n2-standard-4 GCE instance, which is a very standard machine type. For the single core benchmark, we can search about 6 million embeddings per second, which is the equivalent of a 6GB monorepo.

This is low-risk to merge because it is disabled by default.

goos: linux
goarch: amd64
pkg: github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                 │ /tmp/noasm.txt │            /tmp/asm.txt             │
                                 │     sec/op     │   sec/op     vs base                │
SimilaritySearch/numWorkers=1-4     1697.3m ±  0%   169.6m ± 1%  -90.01% (p=0.000 n=10)
SimilaritySearch/numWorkers=2-4      728.7m ± 94%   108.9m ± 7%  -85.05% (p=0.000 n=10)
SimilaritySearch/numWorkers=4-4     705.96m ±  0%   63.01m ± 2%  -91.07% (p=0.000 n=10)
SimilaritySearch/numWorkers=8-4     713.51m ±  1%   69.47m ± 2%  -90.26% (p=0.000 n=10)
SimilaritySearch/numWorkers=16-4    709.74m ±  0%   65.26m ± 1%  -90.80% (p=0.000 n=10)
geomean                              849.4m         88.00m       -89.64%

camdencheek · 2023-05-02T17:52:10Z

enterprise/internal/embeddings/dot_amd64.s

+	// Sign-extend 16 bytes into 16 int16s
+	VPMOVSXBW (AX), Y1
+	VPMOVSXBW (BX), Y2


We do operations on int16s because there is (surprisingly) no instruction to multiply-add signed int8 vectors.

camdencheek · 2023-05-02T17:54:18Z

enterprise/internal/embeddings/dot_amd64.s

+	// X0 is the low bits of Y0.
+	// Extract the high bits into X1, fold in half, add, repeat.
+	VEXTRACTI128 $1, Y0, X1
+	VPADDD X0, X1, X0
+
+	VPSRLDQ $8, X0, X1
+	VPADDD X0, X1, X0
+
+	VPSRLDQ $4, X0, X1
+	VPADDD X0, X1, X0


This section sums the 8 32-bit ints in Y0 by repeatedly folding it in half and adding vertically. We are left with the sum in the rightmost position of X0

camdencheek · 2023-05-02T17:55:31Z

enterprise/internal/embeddings/dot_amd64.s

+// In tailloop, we add to the dot product one at a time
+tailloop:
+	CMPQ DX, $0
+	JE end
+
+	// Load values from the input slices
+	MOVBQSX (AX), R9
+	MOVBQSX (BX), R10
+
+	// Multiply and accumulate
+	IMULQ R9, R10
+	ADDQ R10, R8
+
+	INCQ AX
+	INCQ BX
+	DECQ DX
+	JMP tailloop


In case our input is not a multiple of 16 (which it will be for OpenAI embeddings), this handles the remainder

goos: linux goarch: amd64 pkg: github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings cpu: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz │ /tmp/before.txt │ /tmp/after.txt │ │ sec/op │ sec/op vs base │ SimilaritySearch/numWorkers=1-24 1817.3m ± 48% 286.2m ± 67% -84.25% (p=0.000 n=10) SimilaritySearch/numWorkers=2-24 845.2m ± 31% 150.9m ± 21% -82.15% (p=0.000 n=10) SimilaritySearch/numWorkers=4-24 593.8m ± 21% 107.0m ± 15% -81.99% (p=0.000 n=10) SimilaritySearch/numWorkers=8-24 302.67m ± 19% 77.49m ± 14% -74.40% (p=0.000 n=10) SimilaritySearch/numWorkers=16-24 173.93m ± 12% 83.05m ± 6% -52.25% (p=0.000 n=10) geomean 544.9m 124.3m -77.18%

enterprise/internal/embeddings/dot_amd64.go

camdencheek · 2023-05-02T18:28:22Z

enterprise/internal/embeddings/similarity_search.go

-func CosineSimilarity(row []int8, query []int8) int32 {
-	similarity := int32(0)
-
-	count := len(row)
-	if count > len(query) {
-		// Do this ahead of time so the compiler doesn't need to bounds check
-		// every time we index into query.
-		panic("mismatched vector lengths")
-	}
-
-	i := 0
-	for ; i+3 < count; i += 4 {
-		m0 := int32(row[i]) * int32(query[i])
-		m1 := int32(row[i+1]) * int32(query[i+1])
-		m2 := int32(row[i+2]) * int32(query[i+2])
-		m3 := int32(row[i+3]) * int32(query[i+3])
-		similarity += (m0 + m1 + m2 + m3)
-	}
-
-	for ; i < count; i++ {
-		similarity += int32(row[i]) * int32(query[i])
-	}
-
-	return similarity
-}
-
-func CosineSimilarityFloat32(row []float32, query []float32) float32 {
-	similarity := float32(0)
-
-	count := len(row)
-	if count > len(query) {
-		// Do this ahead of time so the compiler doesn't need to bounds check
-		// every time we index into query.
-		panic("mismatched vector lengths")
-	}
-
-	i := 0
-	for ; i+3 < count; i += 4 {
-		m0 := row[i] * query[i]
-		m1 := row[i+1] * query[i+1]
-		m2 := row[i+2] * query[i+2]
-		m3 := row[i+3] * query[i+3]
-		similarity += (m0 + m1 + m2 + m3)
-	}
-
-	for ; i < count; i++ {
-		similarity += row[i] * query[i]
-	}
-
-	return similarity
-}


I moved these into dot.go and renamed them to Dot*. The dot product is only equivalent to cosine similarity if the vectors are normalized, so I think the rename is justified in the case we ever use non-normalized vectors.

jtibshirani

This makes sense to me, but I'm not up-to-speed on Go assembly! Maybe there's someone more knowledgeable who could give a timely review?

Also, it's interesting that when testing on the GCE instance, we get a max 2x speeduup. This is different from what we observed when testing locally, where the speedup scales with the number of workers: #51372. It means we should limit the request parallelism to something more conservative like 2 threads, rather than the number of processors as we do now. This would be for a follow-up, as it's separate from this PR.

enterprise/internal/embeddings/dot_amd64.go

enterprise/internal/embeddings/dot.go

enterprise/internal/embeddings/dot_test.go

vdavid

I went through the code (on mobile 😄 ) and it LGTM. It's been two decades since I last did assembly and, of course, I have no idea of these new instructions, but it looks reasonable, and the test coverage is convincing. Overall, wow, impressive work!

jtibshirani

Looks good to me from the config and search side. I will dig through the assembly at another time :)

github-actions · 2023-05-02T21:00:20Z

The backport to 5.0 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.0 5.0
# Navigate to the new working tree
cd .worktrees/backport-5.0
# Create a new branch
git switch --create backport-51372-to-5.0
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 17a8ec942c1eaca26ae62191460e7ff9bd6285aa
# Push it to GitHub
git push --set-upstream origin backport-51372-to-5.0
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.0

Then, create a pull request where the base branch is 5.0 and the compare/head branch is backport-51372-to-5.0.

This implements a hand-written assembly version of the int8 dot product that takes advantage of AVX2 SIMD instructions. This speeds up our embeddings searches by roughly 10x on modern x86_64 machines. (cherry picked from commit 17a8ec9)

varungandhi-src · 2023-05-02T22:43:17Z

enterprise/internal/embeddings/dot_amd64.s

@@ -0,0 +1,70 @@
+#include "textflag.h"


This file is missing from this patch; maybe remove this #include?

What do you mean "missing from this patch"? The #include "textflag.h" is needed to define the NOSPLIT symbol

If you mean "you didn't commit a textflag.h file", it's a go compiler builtin

varungandhi-src · 2023-05-02T22:47:47Z

enterprise/internal/embeddings/dot_amd64.s

+	SUBQ $16, DX
+	JMP blockloop
+
+reduce:


Given that reduce: and tailloop: are running only once (or are very small), it would make things a bit simpler to remove the assembly for them and write them in Go. (Assuming it's possible to access Y0 from Go, otherwise it would make sense to leave the reduce code in assembly)

AFAIK, it is not possible to access Y0 from Go without assembly

varungandhi-src · 2023-05-02T22:55:56Z

enterprise/internal/embeddings/dot_amd64.s

+	MOVQ a_base+0(FP), AX
+	MOVQ b_base+24(FP), BX
+	MOVQ a_len+8(FP), DX


Is this hard-coding the stack offsets based on the ABI of a slice? If so, it'll break if the compiler starts using SROA for slices.

Instead, you could use https://pkg.go.dev/unsafe#SliceData to get the underlying pointer in a stable way. Then this function would take in two pointers and the one length as the arguments, instead of hard-coding stack offsets here.

Otherwise, at least add a comment describing where these hard-coded constants come from.

My understanding was that, unless I opt into ABIInternal (or any future stable ABI), I can depend on the current (stack-based) ABI to be stable.

Of note, the a_base notation is a mnemonic that is checked by go vet. So if the field offset does not line up with the FP offset I specified there, go vet will complain.

I'll add some comments describing the offsets though 👍

varungandhi-src · 2023-05-02T23:04:21Z

enterprise/internal/embeddings/dot_amd64.s

+	VPMOVSXBW (AX), Y1
+	VPMOVSXBW (BX), Y2


Could you add a link to the calling convention where it's described whether these registers are preserved or clobbered across a call? It seems like this code is assuming that all the registers it is using are caller-preserve (i.e. OK to clobber).

See "Clobber sets" here. Based on my read of it, I do not need to worry about callee-saved registers

varungandhi-src · 2023-05-02T23:08:07Z

enterprise/internal/embeddings/dot_amd64.s

+	JMP tailloop
+
+end:
+	MOVQ R8, ret+48(FP)


Is there a more "stable"/"reliable" way to get this rather than hard-coding the stack offset? IIUC this is just writing the return value, It'll also break if Go starts returning small return values in registers.

I'm surprised this is actually working, I thought Go started using a register based calling convention recently... Maybe only for parameters or for functions implemented in Go?

All the examples I've seen hard-code the stack offset. I agree it's awkward and error-prone.

The register-based calling convention is only used for compiled go source code unless you opt into it with the ABIInternal flag on the function definition. So, since I did not opt into it, I am using the old stack-based calling convention

This will also be caught by go vet though. ret is the implicit return variable name, so go vet will check that 48(FP) is, in fact, the target address for the return value

Actually though, I thought go vet runs in CI. It does not. Apparently, I miscalculated the frame size. PR incoming.

go vet should be running 🏃‍♀️ It runs as part of the nogo linters 🤔 I'll double check the BAZEL config to make sure

enterprise/internal/embeddings/dot_test.go

enterprise/internal/embeddings/dot.go

keegancsmith

neat.

Also, it's interesting that when testing on the GCE instance, we get a max 2x speeduup

@jtibshirani is it possible that the node you are running on has a lot more CPUs than kubernetes is configured to let you use? So we end up doing too much parallelism / something else is confusing in the measurements. I would make sure we are using automaxprocs: https://sourcegraph.com/search?q=context:global+r:%5Egithub%5C.com/sourcegraph/+maxprocs.Set&patternType=standard&sm=0&groupBy=repo

keegancsmith · 2023-05-02T20:17:21Z

enterprise/internal/embeddings/dot_test.go

+			got := Dot(a, b)
+
+			if want != got {
+				t.Fatalf("a: %#v\nb: %#v\ngot: %d\nwant: %d", a, b, got, want)


t.Log otherwise you never return false.

camdencheek · 2023-05-03T16:44:04Z

@keegancsmith, the benchmarks were not running in kubernetes, so it's unlikely to be related to reserved CPUs.

For clarity, we're seeing different behaviors on the different machines I've tested on.

M1 scales linearly up to 8 cores, which is what we based our initial assumption of scaling on.

My home server (2014 intel 12 core) scales linearly up to 16 cores without SIMD, but only up to 8 cores with SIMD. I expect it's starting to hit memory bandwidth and/or cache limits with SIMD implementation (M1s have stupidly good memory bandwidth).

The GCE n2-standard-4 scales up to 2 cores without SIMD, and 4 cores with SIMD, but the 4 cores is only ~2x faster than 1 core. This is where the "2x" number is coming from. This problem should be very parallel-friendly, but we could be hitting cache effects.

I put together a spreadsheet with the numbers I'm working from. Note that these numbers aren't super rigorous and I haven't looked into this very closely. It's more of an observation that I thought was interesting and probably deserves a little bit of looking to make sure we're not throwing more CPU at the problem than we can use.

daxmc99 · 2023-05-04T16:57:38Z

Something we could consider here, since we are using VPMOVSXBW we are leveraging the GOAMD64=v3 class of instructions with AVX.

It might be worth it to consider building another set of binaries that unlock these gains at the runtime level by setting GOAMD64=v3 at compile time.
Maybe something we can do easier now that we have Bazel?

camdencheek · 2023-05-04T17:01:52Z

AFAICT, the runtime uses very few v3-specific features so far. Am I looking at the right thing?

kiroma · 2023-05-10T14:57:52Z

I'm comparing this hand-written assembly to what clang generates from C++ code, and shouldn't there be a vzeroupper as the last instruction of the function?

camdencheek · 2023-05-10T16:39:29Z

@kiroma good catch! This was an oversight on my part. Fixed here

cla-bot bot added the cla-signed label May 2, 2023

camdencheek commented May 2, 2023

View reviewed changes

camdencheek added 6 commits May 2, 2023 12:18

translate dot product to go assembly

4623811

pass slices, not pointers

ca0054f

add fuzz test

5490953

add escape hatch to disable SIMD

d496478

bazel generate

8bca4c1

camdencheek force-pushed the cc/simd-embeddings branch from 805c336 to 8bca4c1 Compare May 2, 2023 18:20

disable SIMD by default

c0aedf2

camdencheek commented May 2, 2023

View reviewed changes

enterprise/internal/embeddings/dot_amd64.go Outdated Show resolved Hide resolved

camdencheek added 2 commits May 2, 2023 12:24

remove old CosineSimilarity

9cb8f23

move float32 dot into dot.go

d667d02

camdencheek force-pushed the cc/simd-embeddings branch from 0b9ac83 to d667d02 Compare May 2, 2023 18:26

camdencheek commented May 2, 2023

View reviewed changes

camdencheek marked this pull request as ready for review May 2, 2023 18:41

camdencheek requested a review from a team May 2, 2023 18:41

jtibshirani reviewed May 2, 2023

View reviewed changes

enterprise/internal/embeddings/dot_amd64.go Outdated Show resolved Hide resolved

enterprise/internal/embeddings/dot.go Show resolved Hide resolved

enterprise/internal/embeddings/dot_test.go Show resolved Hide resolved

use more descriptive env var name

c4a2375

camdencheek added the backport 5.0 label May 2, 2023

vdavid approved these changes May 2, 2023

View reviewed changes

jtibshirani approved these changes May 2, 2023

View reviewed changes

camdencheek merged commit 17a8ec9 into main May 2, 2023
17 checks passed

camdencheek deleted the cc/simd-embeddings branch May 2, 2023 20:58

github-actions bot added backports failed-backport-to-5.0 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels May 2, 2023

varungandhi-src reviewed May 2, 2023

View reviewed changes

enterprise/internal/embeddings/dot_test.go Show resolved Hide resolved

varungandhi-src reviewed May 2, 2023

View reviewed changes

enterprise/internal/embeddings/dot.go Show resolved Hide resolved

varungandhi-src reviewed May 2, 2023

View reviewed changes

enterprise/internal/embeddings/dot.go Show resolved Hide resolved

keegancsmith reviewed May 3, 2023

View reviewed changes

camdencheek mentioned this pull request May 3, 2023

Embeddings performance on monorepos #50861

Closed

camdencheek mentioned this pull request May 10, 2023

Embeddings: VZEROALL after using 256-bit registers #51730

Merged

camdencheek added backported-to-5.0 and removed release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases failed-backport-to-5.0 labels May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings: search with SIMD #51372

Embeddings: search with SIMD #51372

camdencheek commented May 2, 2023 •

edited

camdencheek May 2, 2023

camdencheek May 2, 2023

camdencheek May 2, 2023

camdencheek May 2, 2023

jtibshirani left a comment

vdavid left a comment

jtibshirani left a comment

github-actions bot commented May 2, 2023

varungandhi-src May 2, 2023

camdencheek May 2, 2023

camdencheek May 2, 2023

varungandhi-src May 2, 2023 •

edited

camdencheek May 2, 2023

varungandhi-src May 2, 2023 •

edited

camdencheek May 2, 2023 •

edited

varungandhi-src May 2, 2023

camdencheek May 2, 2023

varungandhi-src May 2, 2023 •

edited

camdencheek May 2, 2023

camdencheek May 2, 2023

camdencheek May 2, 2023 •

edited

burmudar May 3, 2023

keegancsmith left a comment

keegancsmith May 2, 2023

camdencheek commented May 3, 2023 •

edited

daxmc99 commented May 4, 2023

camdencheek commented May 4, 2023

kiroma commented May 10, 2023

camdencheek commented May 10, 2023

Embeddings: search with SIMD #51372

Embeddings: search with SIMD #51372

Conversation

camdencheek commented May 2, 2023 • edited

Test plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

vdavid left a comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

github-actions bot commented May 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varungandhi-src May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varungandhi-src May 2, 2023 • edited

Choose a reason for hiding this comment

camdencheek May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varungandhi-src May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camdencheek May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keegancsmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camdencheek commented May 3, 2023 • edited

daxmc99 commented May 4, 2023

camdencheek commented May 4, 2023

kiroma commented May 10, 2023

camdencheek commented May 10, 2023

camdencheek commented May 2, 2023 •

edited

varungandhi-src May 2, 2023 •

edited

varungandhi-src May 2, 2023 •

edited

camdencheek May 2, 2023 •

edited

varungandhi-src May 2, 2023 •

edited

camdencheek May 2, 2023 •

edited

camdencheek commented May 3, 2023 •

edited