all: consistently use cpu_profile as flag name by keegancsmith · Pull Request #895 · sourcegraph/zoekt

keegancsmith · 2025-01-17T10:06:23Z

This is more a workaround since a transitive dependency has introduced a global flag "cpuprofile", leading to a panic due to registring the flag twice.

To make ourselves immune to this issue we can refactor our usages to use a FlagSet, even for "main". This is a bigger and frankly inconvenient change for a somewhat rare occurance. Instead we just rename our flag.

I feel comfortable renaming since this flag should only really be used by Zoekt developers. There will be the issue that the flag will be shown twice for commands, but I will report to the upstream repo about this problem.

Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works

Fixes #893

This is more a workaround since a transitive dependency has introduced a global flag "cpuprofile", leading to a panic due to registring the flag twice. To make ourselves immune to this issue we can refactor our usages to use a FlagSet, even for "main". This is a bigger and frankly inconvenient change for a somewhat rare occurance. Instead we just rename our flag. I feel comfortable renaming since this flag should only really be used by Zoekt developers. There will be the issue that the flag will be shown twice for commands, but I will report to the upstream repo about this problem. Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works

keegancsmith · 2025-01-17T10:27:37Z

This change is part of the following stack:

all: consistently use cpu_profile as flag name #895 ◀

_{Change managed by git-spice.}

janhartman

Bit of a shame that we have to do this - but like you said, this is a dev-only flag so it should be completely fine.

This is more a workaround since a transitive dependency has introduced a global flag "cpuprofile", leading to a panic due to registring the flag twice. To make ourselves immune to this issue we can refactor our usages to use a FlagSet, even for "main". This is a bigger and frankly inconvenient change for a somewhat rare occurance. Instead we just rename our flag. I feel comfortable renaming since this flag should only really be used by Zoekt developers. There will be the issue that the flag will be shown twice for commands, but I will report to the upstream repo about this problem. Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works

* nix: use latest version of go in devenv (sourcegraph#890) * Support BM25 scoring for chunk matches (sourcegraph#889) Currently, BM25 scoring only applies to the overall `FileMatch` score. The algorithm gathered term frequencies from all candidate matches in the file to produce a file-level score. However `LineMatch` and `ChunkMatch` scores were still calculated using the classic Zoekt scoring algorithm. This PR implements BM25 scoring for `LineMatch` and `ChunkMatch`. It does so by calculating a BM25 per line. Compared to the classic Zoekt algorithm, this rewards multiple term matches on a line. Because our term frequency calculation also boosts symbol matches, the score smoothly balances between "many term matches" and "interesting term matches". Now, the code is structured as follows: * `scoreChunk`: goes through each line in the chunk, calculating its score through `scoreLine`, and returns the best-scoring line * `scoreLine`: calculates the score for a single line The mental model is that "the score of a chunk is always the score of its best line". * Remove special merging behavior for line matches (sourcegraph#888) Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when `opts.ChunkMatches = false`, we had special logic to merge overlapping matches. This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making `ChunkMatches` the default. Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like `new`, `queue`, `newqueue`, `queuenew`, etc.) Note: we currently only use BM25 with `ChunkMatches = true`, so there's not an active bug here. * Expand tests for match merging (sourcegraph#892) Follow up to sourcegraph#888, where I forgot to improve the test coverage. * Add metric for total number of shards (sourcegraph#891) Currently, we only record the number of compound shards. This PR adds a metric for the total number of index shards. This metric is helpful for alerting, as it's important to catch a significant decrease in the number of indexed data. * Update jidicula/go-fuzz-action to fix CI (sourcegraph#894) This fixes the failing 'fuzz test' check in CI. Example from sourcegraph#891: ``` Download action repository 'jidicula/go-fuzz-action@0206b61afc603b665297621fa5e691b1447a5e57' (SHA:0206b61afc603b665297621fa5e691b1447a5e57) Getting action download info Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v3`. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/ ``` * all: consistently use cpu_profile as flag name (sourcegraph#895) This is more a workaround since a transitive dependency has introduced a global flag "cpuprofile", leading to a panic due to registring the flag twice. To make ourselves immune to this issue we can refactor our usages to use a FlagSet, even for "main". This is a bigger and frankly inconvenient change for a somewhat rare occurance. Instead we just rename our flag. I feel comfortable renaming since this flag should only really be used by Zoekt developers. There will be the issue that the flag will be shown twice for commands, but I will report to the upstream repo about this problem. Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works * gomod: go get -u ./... (sourcegraph#896) * update sha1cd to v0.3.2 (sourcegraph#898) This removes the transitive dependency avo... which also removes the global registration of the cpuprofile flag. * Move several packages to internal/ (sourcegraph#901) This PR moves the following packages to `internal` to avoid exposing them in the API: * `ctags` * `debugserver` * `gitindex` * `shards` * `trace` * Move root-level index code to index package (sourcegraph#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index. * Document all the commands + packages (sourcegraph#904) This PR adds doc comments for all packages/ commands. * print raw file (sourcegraph#903) When adding a `format=raw` to a request `/print`, we receive the raw Content data file in response Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com> * Update README (sourcegraph#905) This PR updates the README to clarify Zoekt's current design, and explain the main usage patterns. * Add a title to README (sourcegraph#906) Follow up to sourcegraph#905. I realized the README looked much better with a title! * Add short contributing guide (sourcegraph#907) Closes sourcegraph#241 * Rename IndexBuilder -> ShardBuilder (sourcegraph#908) When navigating the code, I've often forgotten the difference between `NewBuilder` and `NewIndexBuilder`. This rename clarifies that one of these indexes a whole repo, while the other builds individual shards. Also `index.NewShardBuilder` sounds better. * Fully revert windows support (sourcegraph#909) We added support for Windows in sourcegraph#535. We then partially reverted the change, specifically the parts related to mmap (sourcegraph#706). This was okay because we no longer build for Windows. This PR fully removes support to avoid being in a partially-implemented state. * Tackle the issue of XML files filtered as binaries in search results (sourcegraph#910) When skipping a doc, we currently report the detected language as "binary" (if it looks like binary) or "skipped" (if it's skipped for any other reason). Skipped docs are still added to the index and can still be returned as search results, for example if you only match on filename. So sometimes file matches are returned with "skipped" as their language, even though the file path is clearly some other language like XML. This PR updates the indexing logic to still detect the language even if the document is skipped. However, we avoid passing the contents to the language detection library to avoid running detection on huge files. --------- Co-authored-by: Julie Tibshirani <julietibs@apache.org> * add security md (sourcegraph#911) * Correct the compound shards metric (sourcegraph#913) Fixes a bug I introduced in sourcegraph#891. * scoring: remove IDF from BM25 scoring (sourcegraph#912) We remove IDF from our BM25 scoring, effectively treating it as constant. This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus. Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time. Test plan: updated tests * support both classical nix and flakes (sourcegraph#916) * ranking: add tiebreakers to BM25 (sourcegraph#914) This adds repo freshness and file order as tiebreakers to the final bm25 score, just like we have for Zoekt's default scoring. During testing I found that it is a lot less likely for the tiebreakers to have an effect with BM25 because the score depends on qualites of the document, such as the relative length and number of matches, which usually differ even with the quality of the match is similar. Test plan: - Score tests still pass - manual testing: see screenshots * repoID bitmap for speeding up findShard in compound shards (sourcegraph#899) We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase. The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind. It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers. Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards. Added a benchmark to check the impact. See comments in the code. --------- Co-authored-by: Stefan Hengl <stefan@sourcegraph.com> * ranking: add phrase boosting to BM25 (sourcegraph#917) With this change we recognize boosted queries in our bm25 scoring and adjust the overall score accordingly. We need to take care of 2 parts: The overall bm25 score of the document, and the line score determining the order in which we return the chunks. Co-authored-by: Julie Tibshirani <julietibs@apache.org> * Remove unused repoPathRanks struct (sourcegraph#921) This was left over from our PageRank prototype. * sourcegraph-indexserver: add grpc server (sourcegraph#920) Relates to SPLF-874 This adds a grpc server to sourcegraph-indexserver. For now it supports just one method. The diff is quite big, so I left comments to mark the most important bits. I used the opportunity to clean up a bit (=> hence the big diff): - Reuse grpc logic from webserver and move those bits to "/gprc/..." - Move "protos" inside the new "grpc" directory (=> requires changes of import statements in Sourcegraph) - Refactor import aliases for grpc packages across the codebase Test plan: I tested this locally by calling the new grpc endpoint directly. * ranking: incorporate file signals into BM25F (sourcegraph#922) This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores. Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content. * sourcegraph-indexserver: GRPC, implement DeleteAllData (sourcegraph#923) Relates to sourcegraph#920 Relates to SPLF-874 This implements DeleteAllData. We hold the global lock while deleting all simple shards belonging to a tenant. We also handle compound shards by disassembling them first. Note that this "only" deletes persisted data. Updating the queue, for example, seems fragile because it might immediately get updated by Sourcegraph. This implies that Sourcegraph first has to delete the tenant in the Sourcegraph DB first and then call this new endpoint. Even if the queue still has a reference to a deleted tenant, indexserver won't be able to retrieve index options or clone the repo from gitserver. Test plan: - new unit tests - manual testing: I ran this together with Sourcegraph and triggered a delete by calling DeleteAllData directly. I confirmed that all shards, including compound shards are deleted. * ranking: downweight binary files (sourcegraph#924) In testing, I noticed another problem with BM25: sometimes a binary file is ranked highly because of a match on its filename. In classic Zoekt scoring, these are ranked low because they are skipped, and we always sort skipped docs to the end of the index. This PR ensures they're also ranked low for BM25 by adding a 'binary' category, and marking it low priority. Adding this category required updating `SkipReason` to track the reason a document was skipped. This is necessary because we set the content of skipped docs to `nil`, and `SkipReason` is the only lasting indication that it was binary. * all: run modernize across codebase (sourcegraph#919) The latest release of gopls has a feature called modernize which will update your code where it can to use modern go features/pkgs. https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0 Generated with: go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./... Test Plan: CI * fix empty branch names getting returned by using uint64 id (sourcegraph#926) The gatherBranches() function shifts a mask of type uint64 until it is 0, but uses an id for accessing a map of type uint32. This results in empty branch names getting returned from this function. It can be fixed by changing the type of id to also be uint64. * Move shards package to root and rename to 'search' (sourcegraph#930) In sourcegraph#901 I moved several packages to 'internal' to clean up the exported API. This PR moves the `shards` package back to root, since it contains important methods like `NewDirectorySearcher`. It also renames the `shards` package to `search` to clarify the usage. Relates to sourcegraph#901 (comment), sourcegraph#927 * sourcegraph-indexserver: add GRPC methods for Index and Delete (sourcegraph#933) Closes SPLF-913 This only affects Sourcegraph. This adds Index and Delete methods to sourcegraph-indexserver's GRPC server. The methods aren't called yet by Sourcegraph. Notable differences to our current indexing loop: - IndexOptions are pushed by the client instead of pulled by Zoekt - A semaphore restricts concurrency - New ENV `SRC_STOP_INDEXING` stops the "competing" indexing loop. This is just for testing and will eventually be removed - The Index method tries to recover a possibly older index from `indexDir/.trash` first before starting to index Review: I recommend reviewing the `.proto` file first, followed by `main.go` Test plan: - I did a bunch of manual testing to test the trash recovery and the semaphore logic. I successfuly triggered index and delete calls to the GRPC server. I will put the Insomnia collection I used for testing in the ticket. - New unit tests * Fix compilation on 32 bit architectures (sourcegraph#936) This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It also takes the opportunity to start using the `math` library everywhere instead of our own constants like `maxUInt32` to help prevent this sort of issue in the future by encouraging devs to select the most accurate "max" type for their specific situation. Closes sourcegraph#935 * zoekt-mirror-github: allow cloning without github token (sourcegraph#937) Make the flag "token" optional. Exit if the user provided token does not exist (as before). Only, fallback to the token from the default location if the user did not provide a token. Continue with an unauthenticated GitHub client (`nil` OAuth client) if the fallback did not work (does not exist, cannot be read, ...). * Simplify go-git optimization and reenable it (sourcegraph#940) This PR re-enables the go-git optimization that was disabled in sourcegraph#870 because it caused a huge bug. This PR removes the part of the optimization that was causing the bug, setting `LargeObjectThreshold`. An alternative would be to track down the root cause of the bug in go-git, so that we could keep setting `LargeObjectThreshold`. However, I don't feel that's worth it as: * The memory improvement from this change was pretty small (see sourcegraph#854) * Eventually we want to move away from go-git towards a "vanilla" direct git usage * Simplify trace methods (sourcegraph#942) Small refactor to simplify the logic in `trace.New`. It inlines a couple public methods that don't need to be exposed separately. * Add README for proto directories (sourcegraph#943) Somehow I just found out about `gen-proto.sh` recently, after many months of working on Zoekt! This PR adds brief `README` files in the proto directories to guide people (and LLMs) to it. * feat(web): Add context lines to search results (sourcegraph#931) This adds the before and after Lines in the search results. The default is still 0 so the user has to increase "context lines" to see more. * Avoid setting git author name in test (sourcegraph#944) This fixes an annoying problem where after running `go test ./...` locally, your local git `user.name` would be set to `thomas`. * only sample spans where parent is sampled (sourcegraph#945) * fix wildcards not getting expanded in delta build (sourcegraph#946) Hi zoekt-Team, when testing with our local instance I noticed that it seemed that delta builds where not correctly indexing only the delta that I expected, but much more. I looked at prepareDeltaBuild() and prepareNormalBuild() and found that prepareNormalBuild() expands the branches before processing them with expandBranches(), but prepareDeltaBuild() does not. This leads to a lot more indexing, because prepareDeltaBuild() can't find any known branches (that match the wildcard). I added a call of expandBranches() to prepareDeltaBuild() to fix this and also added a test. Regards, Daniel * index: remove shard_prefix option (sourcegraph#947) We instead rely on the environment which is inheritted to make this decision. The logic will become a little different in the next PR which will instead decide layout based on if we are in "workspaces" mode vs just wanting to enforce tenants. Note: this is Sourcegraph specific. We want to move to always enforcing tenant in our environments, even outside of our multitenant environments. Test Plan: go test * refactor(all): goimports -w -local github.com/sourcegraph/zoekt (sourcegraph#948) * add AGENT.md (sourcegraph#950) Generated this and refined it to make sense. Hopefully this helps. Test Plan: n/a * ci: remove docker step on main (sourcegraph#951) We no longer consume these images so lets stop building them. Additionally we recently rotated some credentials and this step broke. Rather than fixing it I think we should just remove it since we don't use the artifacts anymore. Test Plan: watch CI on main * index: decide between tenant and non-tenant shard name in one place (sourcegraph#953) We have two places with duplicated logic around how it decides the layout of shards on disk. This now moves that decision into one place. Additionally we can now unexport index.ShardName. It was only used in one place outside the package, and that was easy to replace with a hardcoded string since it is just a test. Test Plan: Just CI. This has no actual change in functionality, just refactoring. * docs: some updates (sourcegraph#952) Asked an agent to inspect the codebase and docs and update the docs to match reality more. I ended up not using all of its suggestions, but these are the ones that seem reasonable. Test Plan: n/a * index: remove Options.{TenantID,RepoID} (sourcegraph#954) These are duplicated with RepositoryDescription. There is exactly one place we set these values, and in that case we set it in the repository description as well. Note: we update some the calls in builder.go for o.IndexOptions.RepoID to be just o.RepoID. This is to make it consistent with how we access TenantID. IndexOptions is embedded in that struct (indexArgs) which is why we can do the shortcut. Test Plan: go test is sufficient for this change. I will do a larger round of manual tests soon with all my changes stacked. * index: inline tenant.SrcPrefix (sourcegraph#955) It had exactly one call site and the logic was super simple. So I think it is clearer to just inline the logic here. Test Plan: mechanical refactor so just CI. In a future PR I will be adding better test coverage just on this function. * index: use ID based shard names only on multi-tenant instances (sourcegraph#956) We only want to use ID based shard names on multi-tenant instances. Before this change we decided this based on if tenant enforcement was turned on. However, we would like to always have tenant enforcment on at Sourcegraph but not migrate all shards to the new naming scheme (just yet). This change is quite simple. However, it works since we never actually read the filenames to extract tenant ID or repository ID. We only ever use the data that is persisted inside of the shard to do the enforcement. The environment variable we read (WORKSPACES_API_URL) is the same one we use in the Sourcegraph codebase to determine multi-tenant specific code. Test Plan: Added a unit test. Will do a much larger manual E2E test with the Sourcegraph project. * tenant: unexport EnforceTenant (sourcegraph#957) The only place this was called outside of the tenant package was for deciding if we should call tenant.Log. However, we can inline that check and both simplify the tenant exported API as well as how you call tenant.Log. Test Plan: CI * gitindex: unmarshal tenantID from git config (sourcegraph#958) * gitlab: Move from github.com/xanzy/go-gitlab to maintained version of the client (sourcegraph#963) github.com/xanzy/go-gitlab has been archived and is now maintained by gitlab. Update the import path and catch-up with the API changes. * Add support for indexing and searching custom fields for repositories (sourcegraph#962) At GitLab, we encountered limitations when searching within large namespaces containing thousands of repositories. Specifically, we cannot pass a complete list of RepoIDs due to size constraints. This change introduces support for indexing and searching on custom repository metadata by extending Repository to include an additional Metadata field. All fields within Repository.Metadata are searchable using a regular expression evaluator. This enables more scalable filtering by allowing clients to express regular expression prefix queries on metadata fields, such as: traversal_ids:123-456-.* Or any field really: haystack:nee.*le * query: remove TestMetaSimplify (sourcegraph#965) This was a leftover from the commit merged yesterday for code that was removed. Test Plan: go test ./... * feat(indexserver): Introduce KeepDeleted Config for keeping stale Repos (sourcegraph#964) * query: handle parse function not consuming all of input (sourcegraph#569) Previously we silently ignored it. For example a search like `(foo))` would just work since we would only parse `(foo)`. We now report back to the user the problem. Note: we already had special handling for unbalanced parenthesis like this `((foo)`. Test Plan: added more test cases. Especially tried to explore areas we could validly not consume the whole input but couldn't find one. * Allow building on FreeBSD (sourcegraph#968) This fixes sourcegraph#967. I reported that also here: https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/issues/91 * bump go-ctags for handling non-fatal startup errors (sourcegraph#972) Currently if you switch to universal-ctags 6.2.0 it will log a few warnings on startup. The old version of go-ctags would error out. Now instead it will log to its info log. Test Plan: Added the problematic file in the linked issue to ./foo/repo directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index ./foo/index ./foo/repo" with uctags 6.2.0. * switch to github.com/sourcegraph/cody-public-snapshot for e2e test (sourcegraph#974) e2e tests were failing because github.com/sourcegraph/cody repository has been set to private and all tags were removed Test plan: CI * sourcegraph: remove GRPC index methods (sourcegraph#977) This was part of an effort to move the queue from Zoekt to Sourcegraph. However, we are not going to pursue this for now and we can remove the corresponging grpc methods. Sourcegraph never had callers of Index and Delete, so both are save to remove. `DeleteAllData` is still being called from Sourcegraph. Test plan: CI * chore: fix grpc drift (sourcegraph#978) In sourcegraph#962 a new field metadata was added to zoekt.Repository, but it was not added to the grpc types and converter methods, which is why the fuzz test occasionally fails. Test plan: CI * copy languages package from Sourcegraph to Zoekt (sourcegraph#979) We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it. Notes: - Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward. - Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case. Test plan: I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph * gitindex: move package from internal (sourcegraph#981) Co-authored-by: ARyzhikov <ARyzhikov@nota.tech> * Cache docMatchTree results for Meta conditions (sourcegraph#982) * go mod tidy (sourcegraph#983) * Add support for indexing submodules without repocache (sourcegraph#985) * Fix git.Open() fs argument for .git dir instead of worktree The filesystem argument submitted to git.Open() is of the .git directory, it should be for the worktree, as in go-git this logic is copied from: https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375 One side effect of this bug was that go-git would refuse to fetch repo submodules * Fix missing RepoURLs and LineFragments for result subrepositories RepoURLs for subrepositories are currently filtered out when results are streamed per-repo from shards. * Add submodule indexing without repo cache When not using repo cache, index git submodules recursively into subrepos using the go-git API * Adjust subrepo naming conventions - Revert repo-cache behavior to unchanged - Add support for naming repos and modules with no remote (use the dirname for the former and subrepopath for the latter) - Add test for submodule no-repo-cache behavior * Fix overallocation in repoURL and lineFragment filtering * Revert basename treatment from 08f6759 Missed the fact that it is already done in zoekt-git-index command, the test was improperly configured https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100 * Add shard filtering for meta queries (sourcegraph#986) Co-authored-by: John Mason <jmason@gitlab.com> * Support Gitea in the indexserver (sourcegraph#988) * zoekt-mirror-gerrit: Don't remove /a/ prefix from git clone URLs (sourcegraph#990) The /a/ prefix is required for authenticated git operations in default Gerrit configurations. Gerrit's HttpScheme intentionally includes /a/ in clone URLs to trigger authentication. Removing it breaks instances using default config. The ServerInfo API already returns the correct URL format based on the instance's configuration, so we should use it as-is. Amp-Thread-ID: https://ampcode.com/threads/T-6944ab20-837d-4e78-a9ad-51083263baec Co-authored-by: Amp <amp@ampcode.com> * Add support for filtering out private repositories (sourcegraph#991) * go get github.com/sourcegraph/mountinfo@main (sourcegraph#992) Includes one update to prevent some WARN log spam if your data directory is not backed by a block device. * ci: fix fmt string errors (sourcegraph#994) I think these snuck in as we changed go versions * Add cors_origin flag to zoekt-webserver to enable CORS headers (sourcegraph#993) * migrate otelgrpc from deprecated interceptors to stats handler API (sourcegraph#998) - Replace otelgrpc.StreamServerInterceptor() and otelgrpc.UnaryServerInterceptor() with otelgrpc.NewServerHandler() - Upgrade go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from v0.58.0 to v0.63.0 - Stats handler approach is the recommended way forward and more efficient than interceptor-based approach Amp-Thread-ID: https://ampcode.com/threads/T-2b4e5684-212d-42e5-b790-82569bc9f447 Co-authored-by: Amp <amp@ampcode.com> * Update dependencies to resolve CVEs (sourcegraph#999) - github.com/cloudflare/circl v1.5.0 -> v1.6.1 (fixes CVE-2025-8556) - golang.org/x/crypto v0.41.0 -> v0.45.0 (fixes CVE-2025-47914, CVE-2025-58181) Amp-Thread-ID: https://ampcode.com/threads/T-542737d5-98ed-4e85-879f-76abc31d5b77 Co-authored-by: Amp <amp@ampcode.com> * cmd/zoekt: add support for jsonl output (sourcegraph#997) Currently the zoekt command's output isn't able to surface information like which branch a match corresponds to. By streaming out the matches as lines of json, it can be easily processed by other programs like jq. This is potentially useful for a lot of cli tools. Personally I'm looking to use this with fzf and bat to build a little search tui. * fix: resolve infinite recursion in Repository.UnmarshalJSON (sourcegraph#1001) This fixes a bug caused by jsonv2 json.Unmarshal() behavior that is not backward compatible. The infinite recursion occurs when UnmarshalJSON calls json.Unmarshal on itself. See: golang/go#75361 * fix: Escape language queries correctly (sourcegraph#1003) `escape` escapes certain characters, but those don't include `+`, so clicking a `C++` language link will result in the `+`s not being escaped in the URL. This causes the search to search for `"C "` which finds no results. `encodeURIComponent` is what should be used. * Add ShardPrefixOverride to support custom shard naming (sourcegraph#1005) Co-authored-by: Ravi Kumar <rkumar@gitlab.com> * Update go-gitlab to a v1.x.x release (sourcegraph#1007) * bump deps with CVEs (sourcegraph#1014) Bumped a bunch of dependencies which trivy reported had CVEs to latest: go get \ github.com/cloudflare/circl@latest \ github.com/go-git/go-git/v5@latest \ go.opentelemetry.io/otel/sdk@latest go mod tidy * Fix zoekt-mirror-gitea documentation (sourcegraph#1006) This had likely been copied from the `zoekt-mirror-gerrit` documentation. * search: deflake TestDirWatcherUnloadOnce (sourcegraph#1015) TestDirWatcherUnloadOnce occasionally failed in CI with a queued "spurious load" event after Stop(). This is expected under fsnotify: a single logical write can emit multiple filesystem events, and DirectoryWatcher can therefore enqueue extra load notifications before shutdown completes. The old assertion treated any leftover load event as a bug, which made the test timing-sensitive. Update the test to drain queued load notifications after Stop() and keep asserting that no extra drop event is emitted. That preserves the behavior we care about (delete triggers unload, and unload is not repeated) while removing a brittle assumption about load event multiplicity. Reproduced flake before with: go test ./search -run TestDirWatcherUnloadOnce -count=500 Validated after change with: go test ./search -run TestDirWatcherUnloadOnce -count=500 Amp-Thread-ID: https://ampcode.com/threads/T-019d0104-367c-719f-a5f0-e7a95136407c Co-authored-by: Amp <amp@ampcode.com> * query: case has effect in nested expressions (sourcegraph#1022) Previously only the "top-level" "case:" affected the query. So if you had a "case" in any sub expression it would have no effect. This adjusts our implementation to track nested cases in our query parser and allow them to affect the final case sensitivity on query.Qs. Test Plan: added more test cases to demonstrate the problem. These failed before this commit. * fix/query: resolve type lifting after OR parsing (sourcegraph#1023) The parser used to lift `type:` directives before converting `or` placeholders into real boolean nodes. That let parse-time `orOp` sentinels leak into `Type` children for unparenthesized queries and allowed malformed inputs like `type:repo or` to parse successfully until runtime. Resolve operator placeholders before wrapping with `Type` so unparenthesized `or` inside a type scope produces a valid `Or` subtree and malformed `or` placement is rejected during parsing. The docs now clarify that `type:` applies to the whole expression in its current scope, including `or` clauses. Amp-Thread-ID: https://ampcode.com/threads/T-019d0be7-4c9d-76fd-b556-103420d14be9 Co-authored-by: Amp <amp@ampcode.com> * gitindex: diff config updates for existing clones (sourcegraph#1019) This supersedes sourcegraph#635 by porting its selective config-sync idea onto current main with a smaller, easier-to-read shape. Clone orchestration stays in gitindex/clone.go, while config argument generation and existing-clone sync logic now live in gitindex/clone_config.go. For existing clones, we now diff each zoekt.* setting before writing. Unchanged values are skipped, changed values are updated, and settings that disappear are removed. CloneRepo returns the repo destination only when a setting change actually happened so the caller can trigger reindexing only when needed. cmd/zoekt-mirror-github was also cleaned up so optional integer metadata keys are only added to the config map when present, which avoids pushing empty config values downstream. Note: On sourcegraph#635 review it mentioned using go-git. This commit initially explored that but it ended up being a _lot_ more code due to missing utilities around easily setting values based on a git config string. * index: add hybrid go-re2 engine for large file content matching (sourcegraph#1024) Adds an optional hybrid regex engine (internal/hybridre2) that transparently switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly) based on file content size. Disabled by default — no behaviour change without opt-in via ZOEKT_RE2_THRESHOLD_BYTES. ## Motivation Issue sourcegraph#323 identified regex as the dominant CPU consumer in zoekt's webserver profile. Go's regexp engine (including the grafana/regexp fork already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time matching with much better constant factors for alternations, character classes, and complex patterns on large inputs. The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making it slower than grafana/regexp for small inputs (<4KB) but dramatically faster above the threshold. A full engine swap would regress small-file searches, so a threshold-based hybrid is the pragmatic approach. ## Implementation ### New package: internal/hybridre2 hybridre2.Regexp compiles both engines once at query-parse time and dispatches FindAllIndex based on len(input) >= Threshold(): func (re *Regexp) FindAllIndex(b []byte, n int) [][]int { if useRE2(len(b)) { return re.re2.FindAllIndex(b, n) } return re.grafana.FindAllIndex(b, n) } ### Change to index/matchtree.go regexpMatchTree gains a hybridRegexp field used for file content matching; filename matching keeps using grafana/regexp directly (filenames are always short, so WASM overhead dominates there). ### Configuration ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup: -1 (default): disabled — always grafana/regexp, zero behaviour change 0: always use go-re2 (useful for evaluation/testing) 32768: use go-re2 for files >= 32KB (recommended starting point) ## Benchmarks Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO). Alternations — `func|var|const|type|import`: 32KB: grafana 2505µs go-re2 467µs 5.4x speedup 128KB: grafana 9900µs go-re2 1699µs 5.8x speedup 512KB: grafana 40.7ms go-re2 6.8ms 6.0x speedup Complex — `(func|var)\s+[A-Z]\w*\s*(`: 32KB: grafana 1237µs go-re2 230µs 5.4x speedup 128KB: grafana 4935µs go-re2 911µs 5.4x speedup 512KB: grafana 19.9ms go-re2 3.8ms 5.3x speedup Literal — `main` (grafana wins; threshold protects this case): 32KB: grafana 33.2µs go-re2 59.8µs ## Testing go test ./internal/hybridre2/ # unit + correctness matrix go test ./index/ -short # full existing suite: passes go test ./... -short # full suite: passes Correctness verified by asserting identical match offsets between grafana and go-re2 for 9 patterns x 5 sizes (64B-256KB). ## Notes - Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana which replaces with the replacement character). The default threshold of -1 ensures zero behaviour change. Operators enabling the threshold should be aware; future work could detect non-UTF-8 and force the grafana path. - Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system deps. Binary size increase: ~2MB (the embedded RE2 WASM module). - Rollout plan: enable in GitLab via feature flag starting at 32KB, compare p95 regex latency before/after using per-shard timing in search responses. * index: 3.7x faster posting list construction via direct-indexed ASCII array (sourcegraph#1020) * index: reduce GC pressure in posting list construction by 1.8x Three targeted changes to newSearchableString, the hot path where ~54% of CPU was spent on runtime memory management (memclr, memmove, madvise, mapassign): 1. Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes for kubernetes) rather than on every trigram occurrence (~169M). This cuts per-trigram map operations from 4 to ~1. 2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate each posting list to 1024 bytes, reducing slice growth events and eliminating map rehashing. 3. Pool postingsBuilder instances via sync.Pool on the Builder, so sequential and parallel shard builds reuse map and slice allocations across shards instead of re-creating them. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Time: 9.3s → 5.3s (-43%) Allocs: 901K → 677K (-25%) B/op: 1358 → 1536 MB (+13%, from larger pre-alloc) Warm path (pooled reuse across shards): Time: 9.3s → 5.1s (-45%) Allocs: 901K → 23K (-97%) B/op: 1358 → 0.6 MB (-99.96%) WritePostings: ~155ms, unchanged. * index: direct-indexed array for ASCII trigrams, 3.7x faster builds Replace map lookups with a direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers). Since 99%+ of trigrams in source code are pure ASCII, this eliminates nearly all hash and probe overhead from the hot loop. Non-ASCII trigrams still fall back to the map. Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid utf8.DecodeRune function call overhead on the 95-99% of bytes that are ASCII. In writePostings, collect ngrams from both the array and map into a single sorted slice for writing. The on-disk format is unchanged. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Before (with map opt): 5.3s, 677K allocs, 1536 MB After: 2.5s, 676K allocs, 1544 MB Speedup: 2.1x (3.7x vs original baseline) Warm path (pooled reuse): Before (with map opt): 5.1s, 23K allocs, 0.6 MB After: 2.3s, 23K allocs, 0.6 MB Speedup: 2.2x (4.0x vs original baseline) WritePostings: ~130ms, unchanged. The non-ASCII map now holds only ~6K entries (vs ~282K before), since the vast majority of trigrams are served by the direct array. * index: fix stale comments after ASCII array introduction Update three comments that still referenced map-only storage after the direct-indexed ASCII array was added: postingList doc, estimateNgrams doc, and reset() doc. * index: address review feedback from keegancsmith - Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder) that returns both content and name builders to the pool and nils the fields, so any subsequent misuse panics obviously. - Drop error-path pool returns in newShardBuilder: setRepository errors are extremely rare (invalid templates or >64 branches), not worth the code complexity. - Thread shardMax through newShardBuilder(shardMax int). Callers without a shard size context (merge, tests, public API) pass 0 for the default (100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool. - Switch sort.Slice to slices.SortFunc in writePostings for type safety and to avoid interface boxing overhead. * index: sparse ASCII index, reduce initialPostingCap to 64 Address review feedback: - Add asciiPopulated []uint32 sparse index so reset() and writePostings iterate only populated slots (~275K) instead of scanning all 2M. Retains postingList allocations for pool reuse via len(pl.data)==0 detection on the hot path. - Reduce initialPostingCap from 1024 to 64. On kubernetes (282K trigrams), median posting list is 10 bytes and 78% are under 64. Drops pre-allocation waste from 244 MB to 11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%). * gitindex: replace go-git blob reading with pipelined git cat-file --batch (sourcegraph#1021) * gitindex: replace go-git blob reading with pipelined git cat-file --batch Replace the serial go-git BlobObject calls in indexGitRepo with a single pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline that eliminates per-object packfile seek overhead and leverages git's internal delta base cache. Submodule blobs fall back to the existing go-git createDocument path. Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs: go-git BlobObject (before): Time: 2.94s Allocs: 685K Memory: 691 MB cat-file pipelined (after): Time: 0.60s Allocs: 58K Memory: 276 MB Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory * gitindex: streaming catfileReader API, skip large blobs without reading Replace the bulk readBlobsPipelined (which read all blobs into a []blobResult slice) with a streaming catfileReader modeled after archive/tar.Reader: cr, _ := newCatfileReader(repoDir, ids) for { size, missing, err := cr.Next() if size > maxSize { continue } // auto-skipped, never read content := make([]byte, size) io.ReadFull(cr, content) } Next() reads the cat-file header and returns the blob's size. The caller decides whether to Read the content or skip it — calling Next() again automatically discards unread bytes via bufio.Reader.Discard. Large blobs over SizeMax are never allocated or read into Go memory. Also split the single interleaved loop into two: one for main-repo blobs streamed via cat-file, one for submodule blobs via go-git's createDocument. The builder sorts documents internally so ordering between the loops does not matter. Peak memory is now bounded by ShardMax (one shard's worth of content) rather than total repository size. * gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing Address review feedback on PR sourcegraph#1021: - Make Close() idempotent via sync.Once; kill the git process first (matching Gitaly's pattern) instead of draining all remaining stdout, so early termination is fast. Suppress the expected SIGKILL exit error. Add defer close(writeErr) in the writer goroutine to prevent deadlock on double-close. - Change Next() return and pending field from int64 to int, use strconv.Atoi. Removes casts at all call sites; SizeMax is already int. - Add SkipReasonMissing for blobs that git cat-file reports as missing, instead of reusing SkipReasonTooLarge. Missing is unexpected for local repos (corruption, shallow clone, gc race) so log a warning. - Extract indexCatfileBlobs() with defer cr.Close(), eliminating four manual Close() calls on error paths. - Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs fall back to the go-git createDocument path. - Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason). - Add 19 hardening tests covering Close lifecycle (double close, concurrent close, early termination), Read edge cases (partial reads, 1-byte buffer, empty blobs, read-without-next), missing object sequences, large blob byte precision, and duplicate SHAs. Benchmarked on kubernetes (29,188 files): no performance regression (geomean -0.89%, within noise). * Handle git indexing path correctly for Git worktrees (sourcegraph#1025) `gitindex` assumed repositories could be opened with the default go-git repo opener. That works for normal clones, but it breaks for Git worktrees because refs and object data are split between the worktree git dir and the shared common git dir. This change introduces a small `plainOpenRepo` helper that uses `git`. `PlainOpenWithOptions` with `DetectDotGit` and `EnableDotGitCommonDir`. The helper is then used in the plain-open code paths inside `gitindex`, including config/template loading. With this change, Zoekt can read Git worktrees through go-git using the same logical repository view as Git itself, instead of failing on missing refs or incomplete repository layouts. * search: use structured logging for crashed shards (sourcegraph#1028) We recently had some crashes in production while experimenting with ZOEKT_RE2_THRESHOLD_BYTES and found it hard to consume the stack traces since they were over multiple lines. * gitindex: set filter for cat-file (sourcegraph#1026) At Sourcegraph we do a sparse clone which excludes files based on max file size. However, cat-file will hydrate in missing objects. So we pass in the same filter to avoid hydrating in those files. * indexserver: set blob:limit based on FileLimit (sourcegraph#1027) Previously we hardcoded this value to 1mb (which aligns with what is set in Sourcegraph). But we might as well set it based on the limit that is passed in. Additionally we set the special HTTP headers sourcegraph needs in the config so that when git double checks if it needs to hydrate an object it can do it successfully. * gitindex: handle bare repos in plainOpenRepo (sourcegraph#1030) If you set ZOEKT_DISABLE_GOGIT_OPTIMIZATION=true then zoekt-git-index would fail on bare repositories. This was a regression from a0f5789 Handle git indexing path correctly for Git worktrees As such we also increase our test coverage to handle bare vs normal vs worktrees. Additionally we add a test for when we disable the optimization for bare repos, specifically targetting the regression we saw. * gitindex: speed up tests (sourcegraph#1031) * search: log query when a shard crashes (sourcegraph#1032) It would be useful to know the query which caused a shard to crash. * indexserver: integration test for sourcegraph (sourcegraph#1033) * gitindex: disable git cat-file optimization by default (sourcegraph#1038) In our production sourcegraph.com cluster we are regularly seeing git grow to over 9GB in size leading to OOMs. Once we disabled this flag the OOMs went away. For now lets default this feature to off until it is resolved. * gitindex: optimize git index time by ~21% (sourcegraph#1036) * gitindex: BenchmarkPrepareNormalBuild uses ZOEKT_BENCH_REPO (sourcegraph#1041) * ci: publish image to ghcr.io (sourcegraph#1042) * ci: correctly cleanup ctags tmp in docker (sourcegraph#1043) * zoekt-mirror-gitea: fix getting all repos (sourcegraph#1045) * gitea: remove duplication in fetching repos (sourcegraph#1046) We had the same pagination logic repeated 3 times. Factor it out. * Potential fix for pull request finding 'CodeQL / Incorrect conversion between integer types' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for pull request finding 'CodeQL / Incorrect conversion between integer types' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for pull request finding 'CodeQL / Incorrect conversion between integer types' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --------- Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com> Co-authored-by: Julie Tibshirani <julietibs@apache.org> Co-authored-by: sinyawskiy <sinyawskiy@gmail.com> Co-authored-by: Srijan Saket <srijanskt@gmail.com> Co-authored-by: Vincent <evict@users.noreply.github.com> Co-authored-by: Stefan Hengl <stefan@sourcegraph.com> Co-authored-by: Janik <80165193+Janik-Haag@users.noreply.github.com> Co-authored-by: dschiemann80 <36302850+dschiemann80@users.noreply.github.com> Co-authored-by: teleivo <teleivo@users.noreply.github.com> Co-authored-by: Marc W. <113890636+MarcWort@users.noreply.github.com> Co-authored-by: Camden Cheek <camden@sourcegraph.com> Co-authored-by: Holger Freyther <holger@moiji-mobile.com> Co-authored-by: John Mason <john@johnmason.io> Co-authored-by: Matthias Fechner <mfechner@users.noreply.github.com> Co-authored-by: Aleksander Ryzhickov <makseljoinb@gmail.com> Co-authored-by: ARyzhikov <ARyzhikov@nota.tech> Co-authored-by: Phillip Dykman <Phil.d324@gmail.com> Co-authored-by: John Mason <jmason@gitlab.com> Co-authored-by: Keshane Gan <keshanegan@gmail.com> Co-authored-by: Amp <amp@ampcode.com> Co-authored-by: Rikard Almgren <rikard.almgren@protonmail.com> Co-authored-by: Łukasz Kurowski <crackcomm@users.noreply.github.com> Co-authored-by: Josh Heinrichs <joshiheinrichs@gmail.com> Co-authored-by: ZheNing Hu <adlternative@gmail.com> Co-authored-by: Daryl Haresign <github@daryl.haresign.com> Co-authored-by: Ravi Kumar <kumar.ravi.ec@gmail.com> Co-authored-by: Ravi Kumar <rkumar@gitlab.com> Co-authored-by: Marc Plano-Lesay <marc.planolesay@gmail.com> Co-authored-by: Dmitry Gruzd <dgruzd@gitlab.com> Co-authored-by: Clémence Lesné <clemence@lesne.pro> Co-authored-by: Manuel van Rijn <manuel@manuelvanrijn.nl> Co-authored-by: yufan <yufan@live.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

keegancsmith requested a review from a team January 17, 2025 10:06

keegancsmith mentioned this pull request Jan 17, 2025

gomod: go get -u ./... #896

Merged

janhartman approved these changes Jan 17, 2025

View reviewed changes

keegancsmith merged commit 3287245 into main Jan 17, 2025

keegancsmith deleted the k/rename-flag branch January 17, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: consistently use cpu_profile as flag name#895

all: consistently use cpu_profile as flag name#895
keegancsmith merged 1 commit intomainfrom
k/rename-flag

keegancsmith commented Jan 17, 2025

Uh oh!

keegancsmith commented Jan 17, 2025

Uh oh!

janhartman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keegancsmith commented Jan 17, 2025

Uh oh!

keegancsmith commented Jan 17, 2025

Uh oh!

janhartman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants