Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect malicious packages, for later removal #5117

Closed
E3V3A opened this issue Nov 26, 2018 · 16 comments
Closed

Detect malicious packages, for later removal #5117

E3V3A opened this issue Nov 26, 2018 · 16 comments
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss

Comments

@E3V3A
Copy link

E3V3A commented Nov 26, 2018

Looking at the simple package index, there are a number of highly questionable packages (at least so by their names.)

Packages without proper names, authors or descriptions should probably be removed. If not for bloat reasons, but for security concerns.

Stuff like this:

@E3V3A E3V3A changed the title Remove scary number of suspicious packages Remove bad or suspicious packages Nov 26, 2018
@di
Copy link
Member

di commented Nov 26, 2018

There are almost 200K projects on PyPI. We don't have the ability to manually audit each one. How do you propose this should be done?

@E3V3A
Copy link
Author

E3V3A commented Nov 26, 2018

There are almost 200K projects on PyPI

Exactly! -- And probably 99.9% useless, outdated, fake, deprecated (at best), or possibly containing malware, at worst!

How do you propose this should be done?

:) We are programmers so I'm sure we can figure that out!

How about about searching for packages that:

  1. Has weird name (random or repeated ASCII)?
  2. Has no author
  3. Author has not provided an email
  4. Package has:
    • No valid homepage URL
    • No description
    • No releases in the last 3 years
    • No downloads/installs in the last 2 years

That's just a start... and would probably remove a siht load of crud.
It would definitely be interesting to make such a search to see just how many hits we'd get.

@E3V3A
Copy link
Author

E3V3A commented Nov 26, 2018

Another related issue, is that there seem to be some kind of cyber squatting for package names going on there as well. Packages with little or meaningless content but occupies useful names.

How do you plan to deal with that?

@di
Copy link
Member

di commented Nov 27, 2018

Another related issue, is that there seem to be some kind of cyber squatting for package names going on there as well. Packages with little or meaningless content but occupies useful names.

How do you plan to deal with that?

See PEP 541 and #1506.

@brainwane brainwane changed the title Remove bad or suspicious packages Find and remove malicious packages Jun 20, 2019
@brainwane brainwane added the needs discussion a product management/policy issue maintainers and users should discuss label Jun 21, 2019
@brainwane
Copy link
Contributor

Thanks for filing this issue, @E3V3A!

Per discussion today, we'll be addressing this problem during upcoming work on automated detection of malicious uploads. In this issue we'll be nailing down our criteria for "how do we determine what is a bad package?" and plans for removing those packages.

(Note that we're distinguishing between a malicious upload and spam, and between malware and typosquatting, and that there are other issues -- like #194, #4319 and #4004 -- that concentrate on filtering re: packages that have noncompliant metadata or no recent releases.)

@brainwane brainwane changed the title Find and remove malicious packages Detect malicious packages, for later removal Sep 2, 2019
@brainwane
Copy link
Contributor

Per a discussion with @ewdurbin last week:

The work we'll do on automated detection of malicious uploads will first concentrate on finding malicious packages, and building the tools around that. Only after that will we be able to provide automated tools to help PyPI admins remove them.

@di
Copy link
Member

di commented Dec 5, 2019

From #7061:

What's the problem this feature will solve?
Malicious and insecure packages are a challenge in the open source community. Malicious packages have been removed several times in the last few years. Improved automated auditing techniques would make it easier for security specialists to quickly remove malicious packages. Smart bad actors would be able to use the same test suite, certainly, but it would at minimum allow for the vetting of existing packages. Likewise, this would set up an automated process which could be enhanced over time.

Describe the solution you'd like
Python's exec() function is not secure and may be a good heuristic for finding malicious packages. There may be other additional heuristics that make a package appear more suspicious, and a likely target for manual auditing. Add a badge or other indicator for packages that pass/fail these tests.

@mertzjames
Copy link

I'm very interested in this effort and would like to help. With the fact that there are so many packages here are a few suggestions that I have:

  • Most legit packages will have a few things in common:
    • a readme/description
    • link to source code
    • 2 or more contributors
    • other common fields filled out such as classifiers
  • A tally of the top 1000 (or more) top downloaded packages could be collected and compared to others/new
    • Compare the name of the package to see if it's very similar (typo squatting)
    • Compare the name of the package to see if it's very different. This is a common issue with malicious websites and there's a tool to calculate this (https://github.com/MarkBaggett/freq)
    • Code analysis? This would be a very difficult thing to do I think

@xmunoz
Copy link
Contributor

xmunoz commented Dec 13, 2019

Hello friends! I will be working on the backend implementation of the system for adding malware checks. You can track the progress of this work by checking out the malware-detection label.

@nicowaisman
Copy link

Hey everyone.
We are currently working on a proof of concept at GitHub to detect malicious code on Package manager.
We are currently setting-up an environment to run our test, but our first step is to use a static analysis tool: CodeQL to model the way certain backdoor works to detect them as they get included into pypi.

@brainwane
Copy link
Contributor

@xmunoz I'm excited about this work! Will we be able to discuss it with you at PyCon and/or help improve it during the sprints?

@xmunoz
Copy link
Contributor

xmunoz commented Feb 12, 2020

Yes, absolutely! I'm actually giving a charla about this system at PyCon, but for interested non-Spanish speakers, I can give the English version during the sprints. Also, I'd really love to get feedback on this contribution documentation, and this sounds like a great way to do that.

#7369

@nicowaisman
Copy link

@xmunoz Are there any slides of that charla?
Do you guys mantain a database of previous backdoor/malware introduced to pypi ? I have slowly start building my own collection, and I would love to expand it.

@xmunoz
Copy link
Contributor

xmunoz commented Feb 12, 2020

For the first question, I'll follow up over email :)

The second question could potentially be answered by @ewdurbin.

@xmunoz
Copy link
Contributor

xmunoz commented Feb 18, 2020

The malware-detection branch has been merged onto master with PR #7377

@farhaan710
Copy link

I need to develop a tool that detects malicious repositories. can you @xmunoz help me with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

7 participants