New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block package names that conflict with core libraries #2151

Closed
GadgetSteve opened this Issue Jun 28, 2017 · 21 comments

Comments

Projects
None yet
8 participants
@GadgetSteve

GadgetSteve commented Jun 28, 2017

It has been pointed out online, on Hacker Noon, that the current PyPI allows people to register and upload packages with the same names as core python libraries which presents a potential attack vector as pip -U will "upgrade" the core library to the uploaded package, which may be given as a dependency of some other package.

Anybody, with the possible exception of the core python developers, trying to do this should definitely fail with an error message and possibly be flagged as suspicious activity.

I have tried to suggest blocking any upgrades to core packages at pip level, in 4527, but there is a consensus that this is really a problem at the PyPI/Warehouse end.

@jonemo

This comment has been minimized.

Show comment
Hide comment
@jonemo

jonemo Sep 15, 2017

Contributor

What would be the correct/best way to compile the list of standard library modules that should be blocked? I am aware of the standard library module index at https://docs.python.org/3/py-modindex.html However, that only covers the CPython 3.6 standard library. Other Python implementations have additional modules (e.g. IronPython has clr for example). Occasionally, module names change between versions (e.g. xmlrpclib vs xmlrpc and copy_reg vs copyreg from 2.7 to 3.0).

In summary: The first step to dealing with this is to compile an authoritative list of package names.

It seems like the only place where the name of the uploaded package is checked is here. If that's true, the only blocked package names are requirements.txt and rrequirements.txt. Note that I'm very new to this codebase, this is definitely worth double checking.

Contributor

jonemo commented Sep 15, 2017

What would be the correct/best way to compile the list of standard library modules that should be blocked? I am aware of the standard library module index at https://docs.python.org/3/py-modindex.html However, that only covers the CPython 3.6 standard library. Other Python implementations have additional modules (e.g. IronPython has clr for example). Occasionally, module names change between versions (e.g. xmlrpclib vs xmlrpc and copy_reg vs copyreg from 2.7 to 3.0).

In summary: The first step to dealing with this is to compile an authoritative list of package names.

It seems like the only place where the name of the uploaded package is checked is here. If that's true, the only blocked package names are requirements.txt and rrequirements.txt. Note that I'm very new to this codebase, this is definitely worth double checking.

@GadgetSteve

This comment has been minimized.

Show comment
Hide comment
@GadgetSteve

GadgetSteve Sep 15, 2017

GadgetSteve commented Sep 15, 2017

@jonemo

This comment has been minimized.

Show comment
Hide comment
@jonemo

jonemo Sep 15, 2017

Contributor

I am quite curious about this issue and would be willing to help move it forward, but after another half hour of background reading, I am not certain whether there is community/maintainer support for this proposal.

A few observations and thoughts (please correct me if I'm wrong):

  • In the discussion in pypa/pip#4527, the agreed on response is that pip should not be responsible for preventing user from installing (potentially) malicious code.
  • The conclusion there is that maybe PyPI can provide this functionality by blocking specific names.
  • It seems like nobody is suggesting going beyond filtering package names (e.g. by inspecting package content).
  • Examples show that package names that could have unintended or dangerous effects on one system, are useful on other systems:
  • Given that the selection of "dangerous" package names is system dependent, it cannot be performed by the package index. I know, this is the opposite conclusion from the one reached in pypa/pip#4527.
Contributor

jonemo commented Sep 15, 2017

I am quite curious about this issue and would be willing to help move it forward, but after another half hour of background reading, I am not certain whether there is community/maintainer support for this proposal.

A few observations and thoughts (please correct me if I'm wrong):

  • In the discussion in pypa/pip#4527, the agreed on response is that pip should not be responsible for preventing user from installing (potentially) malicious code.
  • The conclusion there is that maybe PyPI can provide this functionality by blocking specific names.
  • It seems like nobody is suggesting going beyond filtering package names (e.g. by inspecting package content).
  • Examples show that package names that could have unintended or dangerous effects on one system, are useful on other systems:
  • Given that the selection of "dangerous" package names is system dependent, it cannot be performed by the package index. I know, this is the opposite conclusion from the one reached in pypa/pip#4527.
@jonemo

This comment has been minimized.

Show comment
Hide comment
@jonemo

jonemo Sep 16, 2017

Contributor

Related PR: #2396

Contributor

jonemo commented Sep 16, 2017

Related PR: #2396

@dstufft

This comment has been minimized.

Show comment
Hide comment
@dstufft

dstufft Sep 17, 2017

Member

One problem to sort out here is what do we do when a new standard library module is added which already has a namespace collision with an existing project on PyPI what should happen? What about if someone wants to backport a new module to older versions of Python?

Member

dstufft commented Sep 17, 2017

One problem to sort out here is what do we do when a new standard library module is added which already has a namespace collision with an existing project on PyPI what should happen? What about if someone wants to backport a new module to older versions of Python?

@jonemo

This comment has been minimized.

Show comment
Hide comment
@jonemo

jonemo Sep 17, 2017

Contributor

List of Python 3.6 standard library packages as text file: https://gist.github.com/jonemo/57c0eeff88ac5495592d4a4f9d60a96b
Script I used to check for existence and author/maintainer of each on PyPI: https://gist.github.com/jonemo/a1c0f4768f2c0aa25e31388c0fd6e377
Output of said script shortly before the timestamp of this comment: https://docs.google.com/spreadsheets/d/15WoAkoaUW1BRSVt9yAOcObHgkWhfQOqUY0_xNbkTwL8/edit?usp=sharing

Stats:

  • standard lib module names that are also PyPI package names: 71
  • of those 71, registered by @GadgetSteve by @stestagg as part of his disclosure: 13
  • standard lib module names that are not PyPI package names: 139

My (relatively uninformed newbie/bystander) suggestion is to:

Possible next steps after this:

  • Review the 58 remaining PyPI-registered packages that clash with standard library names for:
    • malicious content
    • abandoned, unused and otherwise delete-worthy content
  • Collect a list of standard library module names from previous Python versions and add to the list of banned names (e.g. xmlrpclib)
  • Collect list of standard library module names from other Python implementations to add to the list of banned names (e.g. clr from IronPython)
  • Also block obvious cases of "type-squatting" (either manually or automatically via string-similarity metric) to avoid the problem described here
Contributor

jonemo commented Sep 17, 2017

List of Python 3.6 standard library packages as text file: https://gist.github.com/jonemo/57c0eeff88ac5495592d4a4f9d60a96b
Script I used to check for existence and author/maintainer of each on PyPI: https://gist.github.com/jonemo/a1c0f4768f2c0aa25e31388c0fd6e377
Output of said script shortly before the timestamp of this comment: https://docs.google.com/spreadsheets/d/15WoAkoaUW1BRSVt9yAOcObHgkWhfQOqUY0_xNbkTwL8/edit?usp=sharing

Stats:

  • standard lib module names that are also PyPI package names: 71
  • of those 71, registered by @GadgetSteve by @stestagg as part of his disclosure: 13
  • standard lib module names that are not PyPI package names: 139

My (relatively uninformed newbie/bystander) suggestion is to:

Possible next steps after this:

  • Review the 58 remaining PyPI-registered packages that clash with standard library names for:
    • malicious content
    • abandoned, unused and otherwise delete-worthy content
  • Collect a list of standard library module names from previous Python versions and add to the list of banned names (e.g. xmlrpclib)
  • Collect list of standard library module names from other Python implementations to add to the list of banned names (e.g. clr from IronPython)
  • Also block obvious cases of "type-squatting" (either manually or automatically via string-similarity metric) to avoid the problem described here
@GadgetSteve

This comment has been minimized.

Show comment
Hide comment
@GadgetSteve

GadgetSteve Sep 17, 2017

@jonemo Nice report but please note that I don't have a single package registered in my name on PyPI the above sounds like I have 13 the registration of those 13 names was performed by @stestagg another Steve I know who did specifically state in pypa/pypi-legacy#585 that "As the owner of these packages, I don't mind them being taken off me, or access to them disabled as part of any fix."
I did raise an enhancement proposal to build filtering into pip pypa/pip#4527 but that was felt not to be worth pursuing at the pip end as it was not treating the root cause and would not address any other package installer hence this ticket.

GadgetSteve commented Sep 17, 2017

@jonemo Nice report but please note that I don't have a single package registered in my name on PyPI the above sounds like I have 13 the registration of those 13 names was performed by @stestagg another Steve I know who did specifically state in pypa/pypi-legacy#585 that "As the owner of these packages, I don't mind them being taken off me, or access to them disabled as part of any fix."
I did raise an enhancement proposal to build filtering into pip pypa/pip#4527 but that was felt not to be worth pursuing at the pip end as it was not treating the root cause and would not address any other package installer hence this ticket.

@ewdurbin

This comment has been minimized.

Show comment
Hide comment
@ewdurbin

ewdurbin Sep 17, 2017

Member

https://pypi.org/project/stdlib-list is maintained and appears to be kept up to date. looks like it could be helpful, thanks to @jackmaney

Member

ewdurbin commented Sep 17, 2017

https://pypi.org/project/stdlib-list is maintained and appears to be kept up to date. looks like it could be helpful, thanks to @jackmaney

@ewdurbin

This comment has been minimized.

Show comment
Hide comment
@ewdurbin

ewdurbin Sep 18, 2017

Member

with #2409 shipped here's what I see as remaining items to wrap this up:

  • Audit currently registered packages which conflict. (thanks for analysis @jonemo)
  • Remove project names currently prohibited by the blacklist from said list
  • Determine what stdlib modules exist in other Python Interpreters, PR to stdlib_list
  • Improve messaging/documentation (https://pypi.org/help)

Anything else?

I think that

Also block obvious cases of "type-squatting" (either manually or automatically via string-similarity metric) to avoid the problem described here

Is another issue as that will be more difficult problem to get right.

Member

ewdurbin commented Sep 18, 2017

with #2409 shipped here's what I see as remaining items to wrap this up:

  • Audit currently registered packages which conflict. (thanks for analysis @jonemo)
  • Remove project names currently prohibited by the blacklist from said list
  • Determine what stdlib modules exist in other Python Interpreters, PR to stdlib_list
  • Improve messaging/documentation (https://pypi.org/help)

Anything else?

I think that

Also block obvious cases of "type-squatting" (either manually or automatically via string-similarity metric) to avoid the problem described here

Is another issue as that will be more difficult problem to get right.

@ewdurbin

This comment has been minimized.

Show comment
Hide comment
@ewdurbin

ewdurbin Sep 18, 2017

Member

#2410 addresses messaging/documentation

Member

ewdurbin commented Sep 18, 2017

#2410 addresses messaging/documentation

@jackmaney

This comment has been minimized.

Show comment
Hide comment
@jackmaney

jackmaney Sep 18, 2017

Thank you for using my library (stdlib-list)! I update it after every minor version release (ie the next one will be 3.7). Please let me know if you find something that's missing in any of the lists.

jackmaney commented Sep 18, 2017

Thank you for using my library (stdlib-list)! I update it after every minor version release (ie the next one will be 3.7). Please let me know if you find something that's missing in any of the lists.

@hangtwenty

This comment has been minimized.

Show comment
Hide comment
@hangtwenty

hangtwenty Sep 20, 2017

Regarding this point

[Blocking obvious cases of typo-squatting] Is another issue as that will be more difficult problem to get right.

I understand this hesitation, but -- Perfect is the enemy of good, no? Seems like it could be gotten right enough for the top N most popular downloads. If there is a possibility of going down this path, I would be glad to enlist to help.

hangtwenty commented Sep 20, 2017

Regarding this point

[Blocking obvious cases of typo-squatting] Is another issue as that will be more difficult problem to get right.

I understand this hesitation, but -- Perfect is the enemy of good, no? Seems like it could be gotten right enough for the top N most popular downloads. If there is a possibility of going down this path, I would be glad to enlist to help.

@jonemo

This comment has been minimized.

Show comment
Hide comment
@jonemo

jonemo Sep 20, 2017

Contributor

Now that new uploads of stdlib-shadowing names are no longer possible, can someone with the power to do so please remove the dummy packages that have been placed there by @stestagg? See @GadgetSteve's comment for context and pypa/pypi-legacy#585 for a list of these dummy packages.

@GadgetSteve: Apologies for confusing you with @stestagg, who could have known that one Steve reports an issue previously blogged about by another Steve? 😬

Contributor

jonemo commented Sep 20, 2017

Now that new uploads of stdlib-shadowing names are no longer possible, can someone with the power to do so please remove the dummy packages that have been placed there by @stestagg? See @GadgetSteve's comment for context and pypa/pypi-legacy#585 for a list of these dummy packages.

@GadgetSteve: Apologies for confusing you with @stestagg, who could have known that one Steve reports an issue previously blogged about by another Steve? 😬

@GadgetSteve

This comment has been minimized.

Show comment
Hide comment
@GadgetSteve

GadgetSteve Sep 21, 2017

@jonemo No problem on the confusion - it is not exactly new at work we have, in a different division another with the same first & surname and one in the same office with a surname that sounds similar.
@hangtwenty Just to point out that there are 2 types of typo-squatting one is things like duplicate & transposed letters, (e.g.: urlllib or urlilb), and the other, increasingly popular is UTF-8 mimicry, e.g.: a package called аррӏе (actually u"\u0430\u0440\u0440\u04cf\u0435"), could spoof one called apple. One approach to the latter would be to require all packages to be named with 7 bit ASCII or similar but that has obvious limitations and may not be desirable.

GadgetSteve commented Sep 21, 2017

@jonemo No problem on the confusion - it is not exactly new at work we have, in a different division another with the same first & surname and one in the same office with a surname that sounds similar.
@hangtwenty Just to point out that there are 2 types of typo-squatting one is things like duplicate & transposed letters, (e.g.: urlllib or urlilb), and the other, increasingly popular is UTF-8 mimicry, e.g.: a package called аррӏе (actually u"\u0430\u0440\u0440\u04cf\u0435"), could spoof one called apple. One approach to the latter would be to require all packages to be named with 7 bit ASCII or similar but that has obvious limitations and may not be desirable.

@ncoghlan

This comment has been minimized.

Show comment
Hide comment
@ncoghlan

ncoghlan Sep 21, 2017

Member

@GadgetSteve We do indeed restrict PyPI name registrations to 7-bit ASCII: https://www.python.org/dev/peps/pep-0508/#names

While we don't spell out the reasoning there, the vast array of Unicode confusables is indeed the reason we have that restriction - with ASCII, it's mainly only l1 and O0 that you need to worry about.

As far as the actual typosquatting problem goes, my proposal in #2268 is to distribute the review workload by notifying the maintainers of the projects with similar names, rather than always notifying the PyPI admins (since admin time and attention is a very limited resource). The PyPI admins would then only get direct notifications when registered project names are close to ones on the already prohibited list.

Member

ncoghlan commented Sep 21, 2017

@GadgetSteve We do indeed restrict PyPI name registrations to 7-bit ASCII: https://www.python.org/dev/peps/pep-0508/#names

While we don't spell out the reasoning there, the vast array of Unicode confusables is indeed the reason we have that restriction - with ASCII, it's mainly only l1 and O0 that you need to worry about.

As far as the actual typosquatting problem goes, my proposal in #2268 is to distribute the review workload by notifying the maintainers of the projects with similar names, rather than always notifying the PyPI admins (since admin time and attention is a very limited resource). The PyPI admins would then only get direct notifications when registered project names are close to ones on the already prohibited list.

@GadgetSteve

This comment has been minimized.

Show comment
Hide comment
@GadgetSteve

GadgetSteve Sep 21, 2017

GadgetSteve commented Sep 21, 2017

@hangtwenty

This comment has been minimized.

Show comment
Hide comment
@hangtwenty

hangtwenty Sep 21, 2017

This might be obvious to people but for calculating the similarity we could use Levenshtein distance.

Relevant blog post by the way:

hangtwenty commented Sep 21, 2017

This might be obvious to people but for calculating the similarity we could use Levenshtein distance.

Relevant blog post by the way:

@stestagg

This comment has been minimized.

Show comment
Hide comment
@stestagg

stestagg Sep 23, 2017

Please only remove my packages if the name blocking is applied to pypi as well as warehouse!

stestagg commented Sep 23, 2017

Please only remove my packages if the name blocking is applied to pypi as well as warehouse!

@ewdurbin

This comment has been minimized.

Show comment
Hide comment
@ewdurbin

ewdurbin Sep 23, 2017

Member

@stestagg blocking of names only occurs on upload of a new package name and all such uploads must now be via warehouse, so we’re good here!

Member

ewdurbin commented Sep 23, 2017

@stestagg blocking of names only occurs on upload of a new package name and all such uploads must now be via warehouse, so we’re good here!

@stestagg

This comment has been minimized.

Show comment
Hide comment
@stestagg

stestagg Sep 23, 2017

ok, cool, I wasn't aware that had happened :)

stestagg commented Sep 23, 2017

ok, cool, I wasn't aware that had happened :)

@GadgetSteve

This comment has been minimized.

Show comment
Hide comment
@GadgetSteve

GadgetSteve Sep 24, 2017

Very happy with the outcome.

Apologies to @stestagg for not CCing on the original submission of this ticket.

GadgetSteve commented Sep 24, 2017

Very happy with the outcome.

Apologies to @stestagg for not CCing on the original submission of this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment