New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA1 -> SHA256 #816

Open
jorangreef opened this Issue Oct 12, 2016 · 63 comments

Comments

Projects
None yet
@jorangreef

jorangreef commented Oct 12, 2016

I have done a quick check but it looks like Yarn is using SHA1 for checksumming?

If so, I think it would be a good time to switch to SHA256.

@ljharb

This comment has been minimized.

ljharb commented Oct 12, 2016

it's worth noting that SHA1 is achievable in POSIX, but SHA256 is much harder to make work cross-platform.

@Daniel15

This comment has been minimized.

Member

Daniel15 commented Oct 12, 2016

SHA256 is much harder to make work cross-platform.

Just need to use PowerShell everywhere, it has it built-in ($hash = (Get-FileHash -Path $file -Algorithm SHA256).Hash) 😛

Node's crypto library supports SHA256, I think we'd use that.

> var crypto = require('crypto')
undefined
> crypto.createHash('sha256').update('Hello World').digest('hex');
'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e'
@Daniel15

This comment has been minimized.

Member

Daniel15 commented Oct 12, 2016

Looks like there's three hashing functions in the codebase:

"hash" in src/util/misc, uses SHA256
"HashStream" in src/util/crypto, uses SHA1
"hash" in src/util/crypto, takes the algorithm as an argument and defaults to MD5

hash in src/util/misc and hash in src/util/crypto should probably be unified and they should all use SHA256, I think we'd need to think about backwards compatibility though.

@kittens

This comment has been minimized.

Member

kittens commented Oct 12, 2016

Thanks for the issue! The shasums we use are given to us by the npm registry so there's not much we can do about this unfortunately. 😢

@kittens kittens closed this Oct 12, 2016

@Daniel15

This comment has been minimized.

Member

Daniel15 commented Oct 12, 2016

@kittens - Can we compute our own hash once the file has been downloaded, and store that in the local cache?

@graingert

This comment has been minimized.

Contributor

graingert commented Mar 6, 2017

might as well jump straight to SHA3

@bestander

This comment has been minimized.

Member

bestander commented Mar 6, 2017

Now it is time to reconsider the issue.
A few problems to solve first:

  • forwards compatibility / migration path
  • speed concerns
  • should we just ignore sha1 coming from npm or use it anyway?

Maybe sha2/3 could be an additional field?

@graingert

This comment has been minimized.

Contributor

graingert commented Mar 6, 2017

It looks like "hash" in src/utils/misc.js is gone now, and I can only see md5 used in:

const dest = this.config.getTemp(crypto.hash(url));

and
https://github.com/yarnpkg/yarn/blob/master/src/util/git.js#L35

@graingert

This comment has been minimized.

Contributor

graingert commented Mar 6, 2017

looks like it would be safe to change the default to sha256. I'm not sure where the yarn.lock tgz hash files are checked though.

@bestander

This comment has been minimized.

Member

bestander commented Mar 6, 2017

@graingert

This comment has been minimized.

Contributor

graingert commented Mar 6, 2017

@bestander where is the value from the yarn.lock file passed in?

@bestander

This comment has been minimized.

Member

bestander commented Mar 7, 2017

@zkat

This comment has been minimized.

Contributor

zkat commented May 13, 2017

Hi y'all.

Just dropping in to point out that the npm registry has started serving more secure hashes with packages that have been published with it.

You'll find the new hashes in the dist.integrity field of package manifests, in the same object as dist.tarball and dist.shasum. The integrity field is a proper subresource integrity field. Please note that in order to be fully forward-compatible here, you should follow the subresource integrity spec and parse this field out, pick algorithms, etc (or just use the linked library). We will almost certainly use/abuse the ?opt syntax from the spec in the future, and eventually start shipping multi-hash strings. It's also possible we'll have multiple hashes for the same algorithm, only one of which will apply to the specific tarball being downloaded, but that's more a door to leave open in the future.

On the publishing side, you should continue generating sha1 hashes into shasum, but add an integrity field with a stronger algorithm. See https://github.com/npm/npm-registry-client/pull/157/files for the npm-registry-client implementation of it.

Right now, the registry only supports sha512 for tarballs that have been published using npm5, but we'll be generating integrity strings for all tarballs on the registry at some point in the near-ish future, so they should be more widely available then.

lmk if y'all have any questions! It's exciting to start moving on from sha1 and setting ourselves up for a smoother transition to sha3 and beyond :)

edit: p.s. I saw something about generating these from the command line. If you're relying on sha1sum for that, srisum should be a near-drop-in replacement.

@bestander

This comment has been minimized.

Member

bestander commented May 19, 2017

Thanks for commenting this, @zkat, very much appreciated!

@bestander

This comment has been minimized.

Member

bestander commented May 19, 2017

Afaik this is not a large change to do now, so a PR with a fix is welcome

@nikshepsvn

This comment has been minimized.

nikshepsvn commented Aug 24, 2017

@bestander Is this still open? I would love to give it a shot :) -- Also, I hope I'm not asking for too much, but can you give me an idea on how I should get started?

@bestander

This comment has been minimized.

Member

bestander commented Aug 24, 2017

@nikshepsvn yes it is, give it a go.

A good start would be to check where Yarn checks hashes, for example https://github.com/yarnpkg/yarn/blob/master/src/fetchers/tarball-fetcher.js#L83.
Another point - npm started returning sha256 or sha512 in package info, we probably should start using it.

@nikshepsvn

This comment has been minimized.

nikshepsvn commented Aug 24, 2017

@bestander Thank you so much, I'll do my best to understand as much as I can and get back to you with any questions -- is that okay?

@bestander

This comment has been minimized.

Member

bestander commented Aug 24, 2017

@ghost

This comment has been minimized.

ghost commented Oct 4, 2017

Currently NPM has SHA-512 for new packages and SHA-1 for old ones.

@bestander

This comment has been minimized.

Member

bestander commented Oct 4, 2017

Well someone send us a PR

@BYK

This comment has been minimized.

Member

BYK commented Nov 27, 2017

@imsnif I remember seeing an issue around this earlier but I thought it was fixed. I'd say this is a new issue that we are not aware of so go ahead and send a PR if you have something handy. If not I'd appreciate a bug report. I'd close it as a duplicate if I find the old one open and similar enough :)

@bestander

This comment has been minimized.

Member

bestander commented Nov 27, 2017

Great to see this progressing and thanks for thinking about backwards compat.
Looks good to me, too.

@zkat

This comment has been minimized.

Contributor

zkat commented Nov 28, 2017

@BYK @imsnif If you decide to represent integrity in your own format, please note that there can be multiple valid entries for each individual hash, so the right datastructure would end up being:

{
  integrity: { sha1: ['deadbeef', 'c0ffee'], sha3: [...] }
}

Note also that order within an algorithm matters. And yes, this is definitely a feature that we're planning on using in the future. You should also be sure to preserve the ?options values for each individual hash. My 2c is that keeping it as a single string is by far the best representation, but if you -really- want line-by-line, I recommend splitting the SRI on spaces and keeping each entry whole instead of trying to split it up further:

{
    integrity: [
      'sha1-deadbeef',
      'sha1-c0ffee',
      'sha512-deadc0ffee?blah'
    ]
}

This will be the most compatible long-term.

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 28, 2017

@BYK - I personally think there is great value in consistency between the two clients here (not to mention adherence to the standard), and so I would opt for the string representation (such as the one created by ssri's stringify method). I say this even though I already implemented the method discussed above and would have to go back on it. :) What do you think?

@zkat - thanks for weighing in! There has been a concern raised here about the sorting of the sri string. Right now, the ssri module does not enforce any order: https://github.com/zkat/ssri/blob/latest/index.js#L81-L85 (unless I missed something?)
So I wonder if you would be open to the ssri module sorting the algorithms when stringifying? I'd be happy to issue a PR for that before continuing with this one.

@BYK

This comment has been minimized.

Member

BYK commented Nov 28, 2017

@zkat thanks a lot for weighing in, super useful feedback!

@imsnif I still think having this as a real array as @zkat suggested would be better than having an opaque string unless there's something I'm missing :)

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 28, 2017

@BYK - So, my reasoning goes like this.

The pros of having the integrity field as a string are:

  1. yarn would be adhering to the standard: https://www.w3.org/TR/CSP2/#source-list-syntax, as linked from the sri standard: https://w3c.github.io/webappsec-subresource-integrity/#integrity-metadata-description.
  2. The yarn.lock integrity field would be 100% consistent with the package-lock.json integrity field, even going so far as using the same module for parsing/stringifying/validating. I think this is very valuable and would help protect against future changes in npm.

The cons are:

  1. A slightly less readable yarn.lock file.
  2. Potential problems with consistency and sorting.

Regarding the cons:

  1. I think this is the major rub. And I admit it's a trade-off. But on the other hand, I don't think there will be many cases in which users would need to scrutinize the hashes in the yarn.lock file, seeing as all errors that are given for inconsistencies are quite verbose. But I will of course defer to your experience in the matter.
  2. Even if providing a consistently sorted string would be an issue for the ssri module (re-reading @zkat's response, I think she mentioned the ordering of the hashes might be important - so I'd of course understand if they'd want to keep things as they are), the only thing that would change a yarn.lock in such a case is a change in the Object.keys function or in the ssri module itself (please correct me if I'm wrong and didn't think of something!) And even then, the only thing that would happen is a slight diff in the file, with all other functionality remaining intact.

In my opinion, the pros outweigh the cons here. But I will of course defer to your judgement if you disagree. :)

@BYK

This comment has been minimized.

Member

BYK commented Nov 28, 2017

@imsnif I understand where you are coming from and thanks for sharing the relevant specs which helped me see a clearer picture. That said, the only reason for the large string blob concatenated with spaces seems to be the restrictions in HTML (you cannot have the same attribute name more than once in an element so when you need to express an array you either need children, which is not possible with every tag, or need a serialization method).

These are indeed a list of hash specs. If you ask me, we should denormalize this data, even more, to make it explicit which part is the hash-specifier and which part is the hash itself. I don't see any value in doing another pass of string parsing where we already have a string format data supports lists and objects.

I haven't been a part of package-lock.json file format so I cannot comment on how or why they made the decision to use a single string blob for this field but I do not see a good reason to store serialized data in an already serialized data format.

@zkat - is it possible for you to provide the reasons why you have chosen to use this field as a string blob instead of a JSON array?

@BYK

This comment has been minimized.

Member

BYK commented Nov 28, 2017

I don't see any value in doing another pass of string parsing where we already have a string format data supports lists and objects.

To contradict myself, URLs themselves are serialized string blobs in an already serialized data format here but when it comes to lists, I think this makes a significant improvement in terms of reviewing changes in the lockfile. Does that make sense? :)

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 28, 2017

@BYK - alright, I understand the reasoning. Thanks for taking the time to explain it.

@zkat - I too would be very interested in hearing your reasoning, if you'd be willing.

Otherwise though, I'll continue working on this (I've got some other issues to get through, so this won't block me for now). For the time being though, just to make sure, our agreed solution is @zkat's suggestion:

{
    integrity: [
      'sha1-deadbeef',
      'sha1-c0ffee',
      'sha512-deadc0ffee?blah'
    ]
}

right?

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 28, 2017

Or, rather:

    integrity:
      sha1-deadbeef
      sha1-c0ffee
      sha512-deadc0ffee?blah
@BYK

This comment has been minimized.

Member

BYK commented Nov 28, 2017

@imsnif yup :)

@graingert

This comment has been minimized.

Contributor

graingert commented Nov 28, 2017

what's the behavior with two different sha1s?

@graingert

This comment has been minimized.

Contributor

graingert commented Nov 28, 2017

what's the behavior if the sha1 matches but the sha512 does not match?

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 28, 2017

@graingert - You raise valid and important questions, whose answers I plan to articulate with tests. So would it be fine if we go into the details after I submit the PR? I think it'll be much easier. :)
If you'd like a sneak preview, I'm validating with https://github.com/zkat/ssri (my current implementation uses integrityStream, but it's subject to change).

@BYK BYK added cat-compatibility and removed help wanted labels Nov 28, 2017

@zkat

This comment has been minimized.

Contributor

zkat commented Nov 28, 2017

The reason for having had a string was purely to be close to the actual SRI spec. That is, providing them as single, whitespace-separated strings. I can see npm bumping the lockfileVersion in the future for the sake of making it an array, but it'll be a long time before it starts making any real impact on diffs at all.

While in theory there can be a bunch of stuff in that integrity field, in practice, it's almost always going to be a single entry (until we start upgrading algorithms or using more advanced options). The idea that it would significantly affect diffs didn't really cross my mind, cause it's a relatively stable field: if you get a diff, you'll likely get a diff for the entire entry, so you won't get individual lines for each hash anyway.

As far as ordering goes: I may have misspoken here as far as whether ordering within an algorithm matters. I think it's actually fine because individual hashes within an algorithm will just be tried until one succeeds.

When you have multiple algorithms, though, for security reasons, you must always use the strongest one that you know you can use. So if you have sha512 and sha1, you must always use sha512 and never, ever fall back to sha1. OTOH, if you see a sha3 and sha1, and you're on a version of Yarn that does not support sha3, you can use sha1. This is a future-proofing feature. You'll notice that ssri does its own algorithm ordering, and it's fairly conservative about it (again, for security reasons).

@imsnif

This comment has been minimized.

Member

imsnif commented Nov 30, 2017

@BYK - slight hitch with the array approach. It is not backward compatible. Issuing the yarn command from latest with this yarn.lock file:

# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY.
# yarn lockfile v1


array-flatten@1.1.1:                                                                                                                                                                                                                           
  version "1.1.1"
  resolved "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz#9a5f699051b1e7073328f2a008968b64ea2955d2"
  integrity:
    sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=

Results in:

$ yarn
yarn install v1.3.2
error An unexpected error occurred: "Invalid value type 9:0 in /home/aram/code/yarn-test/yarn.lock".

If I am not mistaken, this is because the current lockfile format does not support arrays.

So seeing as we cannot use arrays (unless I'm missing something with the above? please do correct me as I'm fairly new to the codebase and could be overlooking something!)
And we can't use objects (because the spec allows for multiple identical keys and npm plan on using that option)... that brings us back to strings?
I know that is not an ideal solution as you mentioned, but given the circumstances - if my reasoning is correct - I think it's the best option. What say you?

Edit:
I already got the lockfile to parse/stringify arrays before thinking about this. It is not extremely difficult, but I feel it would be easier to save this breaking change for another occasion and handle all its ramifications separately.

@zkat

This comment has been minimized.

Contributor

zkat commented Nov 30, 2017

Like I said above: you can totally get away with strings until such a time when you're ready for the breaking change. integrity will almost definitely only contain sha512 for at least a couple of years, is my bet. So if the issue is visual diffs, there will be no difference between the two approaches until a future where node integrates new algorithms and npm starts publishing using them (I think it'll be a while after node even adds sha3 support for npm to actually start using it -- sha512 is much more stable, implementation-wise, and there's no reason to believe it's going to be vulnerable any time soon)

@BYK

This comment has been minimized.

Member

BYK commented Dec 1, 2017

@imsnif that's a bummer, for some reason I thought arrays were possible in the lockfile. Alright, let's go with the sting and see how bad it hurts. I don't like this but the only alternative without a breaking change seems to treat arrays as objects with numeric indexes which would both make the lockfile (0: sha256-xxxx) and the code using it quite weird.

@vog

This comment has been minimized.

vog commented May 22, 2018

Has there been any progress on this?

Given the strong security promises on the website, it is very disappointing to see SHA-1 checksums in use.

@imsnif

This comment has been minimized.

Member

imsnif commented May 22, 2018

Hey @vog - you can follow our progress here: #5042
Hoping to have this merged really soon.

@Gudahtt

This comment has been minimized.

Contributor

Gudahtt commented Aug 19, 2018

Now that that PR has been merged, should this issue be closed?

The only missing piece now is that some dependencies (e.g. those installed from git repositories) still don't have an integrity field. npm includes an integrity field for these packages, so I'm assuming they must calculate it after downloading the package.

@zkat

This comment has been minimized.

Contributor

zkat commented Aug 20, 2018

npm has not been generating integrity hashes for git dependencies for a while, because git packages can often change hashes unless devs are careful to make builds reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment