Skip to content

Commit

Permalink
rockskip: skip non-utf8 paths (#62197)
Browse files Browse the repository at this point in the history
We have seen errors on S2 for inserting non-utf-8 data to the symbols table. `path` and `name` (symbol name) are the only user controlled fields. We already check for non-utf-8 content in go-ctags, so that leaves `path`, which is not required to be utf-8 compatible.

With this change we simply skip the path, which means that symbols within that path will not exist in the index. This is somewhat in line with ctags, where we skip indexing non-utf-8 content.

I believe this is an edge case and it is acceptable to skip those files to keep the code simple.

Alternatives:
- We could store arbitrary bytes `bytea` instead of `text` in Postgres.

Test plan:
CI
  • Loading branch information
stefanhengl committed Apr 29, 2024
1 parent 3de01f9 commit 83885b0
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions internal/rockskip/index.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"database/sql"
"fmt"
"time"
"unicode/utf8"

"github.com/amit7itz/goset"
pg "github.com/lib/pq"
Expand Down Expand Up @@ -136,6 +137,15 @@ func (s *Service) Index(ctx context.Context, repo, givenCommit string) (err erro
deletedPaths := []string{}
addedPaths := []string{}
for _, pathStatus := range entry.PathStatuses {
if !utf8.ValidString(pathStatus.Path) {
s.logger.Warn(
"Rockskip skipping file due to path not being utf-8 encoded",
log.String("repo", repo),
log.String("path", pathStatus.Path),
)
continue
}

if pathStatus.Status == gitdomain.DeletedAMD || pathStatus.Status == gitdomain.ModifiedAMD {
deletedPaths = append(deletedPaths, pathStatus.Path)
}
Expand Down

0 comments on commit 83885b0

Please sign in to comment.