Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect packages being published with typo'ish names #4998

Open
aaronlelevier opened this issue Nov 2, 2018 · 9 comments
Open

Detect packages being published with typo'ish names #4998

aaronlelevier opened this issue Nov 2, 2018 · 9 comments
Labels
feature request malware-detection Issues related to automated malware detection.

Comments

@aaronlelevier
Copy link

What's the problem this feature will solve?
Prevent malicious packages being published with typo'ish names

Describe the solution you'd like
I'd like to propose an algorithm that blocks malicious packages with similar names to well known packages from being published.

Recently there were articles about 12 malicious packages found. Several of them had names very close to Django, and as an avid Django user, this got my attention.

An algorithm could be used that uses Levenshtein distance combined with other input features like number of similar file names, number of similar code lines compared to legitimate packages of a similar name. If there is a close resemblance, then the package could be initially blocked from being published until a human reviews it or permanently blocked.

The algorithm could also be a lot more sophisticated, something such as Android's algorithm that uses machine learning to detect malicious apps and measures over 700+ features I believe.

I am just proposing something of this nature if it hasn't already been proposed.

Additional context
Here is the article link that I am referencing:

https://www.zdnet.com/article/twelve-malicious-python-libraries-found-and-removed-from-pypi/

Thanks,
Aaron

@di
Copy link
Member

di commented Nov 3, 2018

This is more or less the same as #2268, so I'm going to close this as a duplicate, but thanks for the feature request!

@di di closed this as completed Nov 3, 2018
@aaronlelevier
Copy link
Author

@aaronlelevier thanks for taking a look at my issue. I'll check the existing issue then. Thanks.

@brainwane
Copy link
Contributor

I'm actually going to reopen this, because I think it would be useful to have this issue (about typosquatting prevention/detection before/during upload) distinct from #2268 (which is about notifications, alerts, a "packages with similar names" widget, etc.). Thanks @aaronlelevier!

@brainwane
Copy link
Contributor

Today I discussed this idea -- checking for typosquatting, pre-upload -- with @dstufft and @ewdurbin. It would be pretty hard to do this without LOTS of false positives. Donald mentioned a person at Netflix whose approach was: remove the dashes from popular project names, register the resulting strings.

We could increase the scope of our current normalization rules to cover more scenarios -- there will be existing collisions, including with that preemptive registration project.

In any case, this kind of checking ought to be built as part of a pipeline where automated systems run checks, and then flag packages/projects for deletion/review/ok by PyPI admins.

@brainwane brainwane changed the title Prevent malicious packages being published with typo'ish names Detect packages being published with typo'ish names Sep 2, 2019
@brainwane
Copy link
Contributor

Per conversation last week: We'll be addressing this problem during upcoming work on automated detection of malicious uploads/typosquatting. First we'll need to develop good tools to detect and flag the pytosquatting/typosquatting, then we'll add tools in that pipeline for PyPI to automatically prevent/reject publication of packages that hit a certain "hey, that looks dodgy" score.

@xmunoz
Copy link
Contributor

xmunoz commented Feb 18, 2020

PR #7377 has been merged. If someone wants to contribute such a malware check, the documentation for how is here: https://warehouse.pypa.io/development/malware-checks/

@pradyunsg
Copy link
Contributor

From pypi/support#526 (comment):

One idea that has been floated before is automatically blocking any new project name that currently have a non-trivial amount of 404 requests served by PyPI for it (or at least requiring approval). This would easily identify internal-only project names like this and prevent them from being used maliciously.

@brainwane
Copy link
Contributor

@Julian
Copy link
Contributor

Julian commented May 13, 2021

Are we aware of this paper from March 2020 which investigated typosquatting on PyPI? Or have the authors reached out?

I've only skimmed the abstract, and haven't looked at the tool they say they developed there, but it seemed like an interesting read.

(I'm finding this issue after seeing #9527).

@miketheman miketheman added the malware-detection Issues related to automated malware detection. label Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request malware-detection Issues related to automated malware detection.
Projects
None yet
Development

No branches or pull requests

7 participants