Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul npm detector #2264

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

rgmz
Copy link
Contributor

@rgmz rgmz commented Dec 28, 2023

Description:

This fixes #1455. It matches new npm tokens (npm_xxx...), old npm tokens (NpmToken.0000-..., 0000-...), and "non-standard" tokens such as Artifactory using a JWT or GitHub packages using a PAT. Basic auth is a separate can of worms, so is out of scope for this PR.

The high-level flow for all detectors is:

flowchart TD
    A[Got token] -->|Search for associated URL| B{URL found?}
    B -->|Yes| C[Verify against URL]
    B -->|No| D[Search for any potential URLs]
    D --> E[Verify against all URLs]
    
    C-->Z[END]
    E-->Z
Loading

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@rgmz rgmz requested a review from a team as a code owner December 28, 2023 18:37
@rgmz rgmz force-pushed the feat/update-npm branch 2 times, most recently from f583bf3 to 0080bf4 Compare December 28, 2023 18:44
Copy link
Contributor Author

@rgmz rgmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went through several iterations so there may be some leftover cruft or confusing bits.

}
}

func BenchmarkFromDataV1(benchmark *testing.B) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main downside of consolidating everything under one package is that you can't have BenchmarkFromData multiple times.

pkg/detectors/npm/registry.go Show resolved Hide resolved
// Skip the first two indices: 1 is the entire string, 2 is the protocol.
index, uri := firstNonEmptyMatch(matches, 2)
info := &registryInfo{
RegistryType: registryType(index - 1),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is spooky black magic that relies on knownRegistryPat having the same number of capture groups as registryType has values. Unfortunately, it doesn't seem possible to programmatically get enum values so this can't be verified at compile time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to create some sort of registry struct that contains a pattern and a name and a const int, and then smoosh them all together into a list in one place? That way there wouldn't have to be these multiple lists maintained across multiple files. (If not, then I think some loud comments in all the places that are magically correlated are warranted.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered doing that but was concerned that looping through n patterns would be much less efficient. In the worst case, you'd need to loop through all the "known" patterns, find all "generic" registry matches, and do a secondary lookup to determine the scheme for each registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm either misunderstanding something or being unclear. I'm just proposing merging the three lists of registries you have (const ints, string names, and patterns) into one list of structs, each of which contains three fields. Why would that add any additional loops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm misunderstanding. Can you provide an example of what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type registry struct {
    // maybe an int too, i don't know if it's necessary
    name    string
    pattern string
}

// somewhere, maybe at a top level
registries := []registry {
    {
        name: "npm",
        pattern: `(registry\.(?:npmjs\.org|yarnpkg\.com))`
    },
    {
        name: "artifactoryCloud",
        pattern: `([a-z0-9-]+\.jfrog\.io/(?:artifactory|[a-z0-9._-]+)/api/npm/[a-z][a-z0-9._-]+)`
    {,
    // etc
}

(This could also be some kind of map. I see that some of the patterns are pretty long and decomposable, so maybe they get built out of a function somehow.) Building the big monster pattern is a matter of looping through the slice and just joining all the patterns together. This would eliminate the spooky magic that ties together the big pattern and the registry list.

pkg/engine/ahocorasick/ahocorasickcore.go Show resolved Hide resolved
detectors.Versioner
} = (*ScannerGeneric)(nil)

func (s ScannerGeneric) Version() int { return 0 }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 0 might not be the most idiomatic way to approach this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcastorina is there any reason not to use an explicit version 0 when also using versioned detectors?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc the way we implemented filtering depends on the version being non-zero, so you wouldn't be able to exclude/ignore just the v0 detector.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test here to codify that assumption.

Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ton of work, thanks for diving into it. I added some questions inline.

req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", token))
res, err := client.Do(req)
if err != nil {
// A |tls.RecordHeaderError| likely means that the server is using HTTP, not HTTPS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I paranoid to be concerned by an automatic fallback here? It seems like we should be trying to the scheme in the original URI and failing if https is expected but not present. Am I thinking about this right?

Copy link
Contributor Author

@rgmz rgmz Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some circumstances it isn't possible to determine the scheme of the registry URI. For example, //registry.host.com/npm/:_authToken=... does not contain a scheme but we can do a secondary lookup to tell that it should be HTTP.

//registry.host.com/npm/:_authToken=...
registry=http://registry.host.com/npm/

If that information (registry=http://...) isn't present in the chunk then the scheme is unknown. HTTPS is a reasonable default but may not necessarily be correct. Trying to connect to a HTTP server with HTTPS seems to reliably result in a tls.RecordHeaderError; conversely, I haven't found a reliable indicator that you're to connect to a HTTPS port with HTTP.

We could limit the automation fallback to cases where the scheme is unknown. Personally, I don't think there's any risk in having it.

_ = res2.Body.Close()

if res2.StatusCode == http.StatusOK {
return false, nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me if I've got this right:

  • If the whoami user is anonymous,
  • And the request succeeds even without the token sent at all,
  • Then we conclude that the token isn't valid and the endpoint is just bad at communicating that fact
  • But if the request fails after removing the token, then we conclude that the token legitimately has no user attached

If I got that correctly, then it took me a bit to understand it. (And if I didn't, then I definitely didn't understand it!) I think I would have figured it out faster if this logic were factored out into a named function, but maybe there's another approach that would clarify it as well.

Copy link
Contributor Author

@rgmz rgmz Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got that correctly, then it took me a bit to understand it. (And if I didn't, then I definitely didn't understand it!) I think I would have figured it out faster if this logic were factored out into a named function, but maybe there's another approach that would clarify it as well.

Your understanding is correct, but the verification logic is probably going to get even more convoluted, so it would make sense to extract this into its own function.

Some registries do not implement the whoami endpoint and return a 404. Others, like npm.fontawesome.com, have a no-op whoami endpoint that always returns 200 {"username": null}. That's not to mention that some registries don't implement either /-/whoami or /-/all but do implement /-/v1/search or only support direct package references.

e.g., I have encountered dozens of Azure and Google registry URLs but they all return 404. I don't know if that means the tokens are invalid, the registry doesn't implement the endpoint, or the registry no longer exists.

pkg/detectors/npm/common.go Show resolved Hide resolved
detectors.Versioner
} = (*ScannerGeneric)(nil)

func (s ScannerGeneric) Version() int { return 0 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcastorina is there any reason not to use an explicit version 0 when also using versioned detectors?

func (s ScannerGeneric) Version() int { return 0 }

// genericKeyPat should match all possible values for .npmrc auth tokens.
// TODO: Ensure this works with Yarn and UPM configurations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this PR degrade any existing functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some further testing and having dedicated scanners for each pattern type (.npmrc, .yarnrc, .upmconfig.toml, etc.) seems to produce fewer false positives. However, I'm not sure how to best organize them. Are "versions" meant to have any semantic meaning/order?

i.e., is it confusing to have one detector type with numerical versions that are completely unrelated (npm version 1 & 2 are siblings, version 3 is yarn, version 4 is rennovate, version 5 is a hypothetical update to the yarn config, etc.)

},
"_authToken/scoped/artifactory": {
input: `registry=https://artifactory.example.com/artifactory/api/npm/npm/
//artifactory.example.com/artifactory/api/npm/npm/:_authToken=eyJ2ZXIiOiIyIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyNTYiLCJraWQiOiJSZ25DcnBEVXlKOV9yNElVRnNSU2hqU0E0aGpibEpjZ0M2bnJhN3ZqcTNNIn0.eyJzdWIiOiJqZnJ0QDAxY2pwdDc0N3ZyNHo0MTU4MHNiN3MxYW14XC91c2Vyc1wvYXJ0dXJvLmNhbXBvcyIsInNjcCI6ImFwcGxpZWQtcGVybWlzc2lvbnNcL2dyb3VwczpyZWFkZXJzLGRlcGxveS1kZXYtbnBtLGRlcGxveS1sb2NhbCIsImF1ZCI6ImpmcnRAMDFjanB0NzQ3dnI0ejQxNTgwc2I3czFhbXgiLCJpc3MiOiJqZnJ0QDAxY2pwdDc0N3ZyNHo0MTU4MHNiN3MxYW14XC91c2Vyc1wvYXJ0dXJvLmNhbXBvcyIsImV4cCI6MTY1NjAwNDUxOSwiaWF0IjoxNjU2MDAwOTE5LCJqdGkiOiJjOWZhM2VhNS01MzE0LTQwMzUtYjNiYy03OGNjMDJmYmM1NWMifQ.EIEPLzz9H2oQ3rKLKI_t1qtxi-G9ym6P6J5z7xWLsq79aHK3XV_E8_yuUK8giaOOdl_CaITOjl8Bt2aUVXTZKFKIOiQMLscBM4B6qGj_n4Qq8jiMwcKjnD_iWx8Wo-aNHHxgjvrRdWf-2UPIm6lSc77oZbUNAjhA5Q-W3uQRG7d50FGZpq_EEZsfbOcD7EMU2ZnvfYNTgTtmhZWfLefzB6xUF8WHgiDAVHJKQ2fKLX45Z9trc2SkKQmPBxaS-pBtKBhK15kQZ3x625KtLRr2ZwgOaJKHcg4SwuGOpyTF48nTk53SDorSj6fqlypTavVQRi-5cuSGTPrqLObk6lwpRg`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, how did you generate your test file secrets?

"invalid/placeholder_x": {
input: `//registry.npmjs.org/:_authToken=npm_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX`,
},
"invalid/word_boundary": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ all your tests

// Skip the first two indices: 1 is the entire string, 2 is the protocol.
index, uri := firstNonEmptyMatch(matches, 2)
info := &registryInfo{
RegistryType: registryType(index - 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to create some sort of registry struct that contains a pattern and a name and a const int, and then smoosh them all together into a list in one place? That way there wouldn't have to be these multiple lists maintained across multiple files. (If not, then I think some loud comments in all the places that are magically correlated are warranted.)

pkg/engine/engine.go Outdated Show resolved Hide resolved
Comment on lines +49 to +58
for _, registry := range registries {
verified, extraData, err = doVerification(ctx, s.client, registry, token)
if verified {
return true, extraData, err
}
if err != nil {
errs = append(errs, err)
}
}
return false, nil, errors.Join(errs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me if I'm reading this right: Any verification errors returned from any registry will propagate to the final result, even if any (other) individual registry returns a determinate failure (i.e. no error.) This means that if any registry reports indeterminate verification, then either the token will be verified or indeterminately unverified.

If I'm right, this is an issue, because it means that tokens can plausibly be impossible to determinately revoke. (A similar problem exists with the JDBC detector.)

Copy link
Contributor Author

@rgmz rgmz Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. The 'happy path' is that a token is associated with a single known registry(1) and we get either a determinate verification or failure. If the token is associated with a single registry but it isn't a 'known' pattern(2), it's possible to encounter an indeterminate failure caused by implementation-specific differences.

# 1
//artifactory.corp.com/artifactory/api/npm/npm/:_authToken=*abcdef123456789*

# 2 .npmrc
//packages.corp.net/custom_layout/:_authToken=...
# 2 CLI flag
npm config set registry https://packages.corp.net/custom_layout/

The worst-case scenario is that a token cannot be correlated with a single registry, and there are multiple potential registries1. In that case, it can be difficult to tell whether something determinately or indeterminately failed. Some implementations are extremely sensitive, with things as small as a trailing slash being the difference between a 200 API response and a generic 400 HTML error page.

So you can either report failures for all attempted URLs, or report none of them and just mark the token as unverified2. Realistically, this is a problem when any multi-part secret has multiple potential matches. For example, Azure is a combination of secret (distinct) + clientId (uuid) + tenantId (uuid); if it fails and you have multiple UUIDs is it better to report ALL possible combinations or none of them?

Footnotes

  1. Honestly, the detector that's most likely to trigger this right now is the existing npm v1 one that just searches for npm.+{uuid}. It's prone to false positives — even GitHub Advanced Security flags random UUIDs as npm tokens.

  2. prior versions had more logic to handle invalid registries (e.g., being redirected to https://landing.jfrog.com/reactivate-server/test or a HTML 404 page). That was a bit complicated to refactor so I ended up dropping it.

@CLAassistant
Copy link

CLAassistant commented Jan 4, 2024

CLA assistant check
All committers have signed the CLA.

if err != nil {
return false, nil, fmt.Errorf("request failed for %s: %w", reqUrl, err)
}
_, _ = io.Copy(io.Discard, res.Body)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be res2.Body?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. 😅

// language=regexp
`(?i)(//%s(?:/[a-z0-9._-]+)*)/:(?:_auth(?:Token)?|_password).{1,20}%s`, hostPat, token))
matches := registryAuthPat.FindStringSubmatch(data)
if len(matches) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be len(matches) < 2 since we're checking matches[1] next?

Copy link
Contributor Author

@rgmz rgmz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. If a regex pattern matches, it always returns a fixed number of match groups. If you have mutually exclusive match groups like (?:(foo)|(bar)) then the non-matching group will just be an empty string.

func main() {
	pat := regexp.MustCompile(`(?i)hello (?:(john)|(anne)|(world)|(.+))`)
	matches := pat.FindStringSubmatch("Hello Anne")
	fmt.Printf("Matches return %d groups: %s\n", len(matches), strings.Join(matches, ", "))
}

// Matches return 5 groups: Hello Anne, , Anne, , 

https://go.dev/play/p/cdPwcqZwDcT

TL;DR len(matches) should always be 0 or > 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could make the argument that it prevents failure, but if your pattern changes I think it's better to explicitly fail than silently break.

e.g., stuff like this will silently break if the pattern is updated to add another match group.

if len(match) != 2 {
continue
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

NpmToken detector should verify against the registry URL, if present
5 participants