Tools for PyPi analysis
PyPi metadata (including project description, release information etc) for all ~130000 Python packages uploaded to PyPi are provided for research purposes as part of this project. Files are provided in a gzip compressed JSON format.
- February 11th, 2018 -
pypi-metadata-anonymized-2018-02-11.tar.xz
(309 MB xz compressed, 119008 packages) - February 19th, 2018 -
pypi-metadata-anonymized-2018-02-11.tar.xz
(318 MB xz compressed, 128712 packages)
These datasets are obtained by running the pypi_metadata_download.py
download script (see below), followed by pypi_metadata_anonymize.py
script to remove author email information for privacy reasons.
-
pypi_metadata_download.py
- download PyPi metadata for all packagesThe included script uses PyPi JSON API to asynchronously fetch metadata for all Python packages (~130000) uploaded to PyPi. It uses aiohttp (requires Python 3.5+) and takes ~30 min with 50 parallel download channels (cf warehouse/issues/2912(Comment) for more details), assuming a good internet connection.
-
pypi_metadata_json_to_parquet.py
convert the downloaded PyPi metadata from one gzipped JSON per package to Appache parquet format for more efficient analytics. -
pypi_metadata_anonymize.py
remove theauthor_email
fields from the individual jsons.
pypi_spam_url_blacklist_matching.ipynb
: following the PyPi spam incident in February 2018 this notebook is a somewhat unsuccessful attempt to detect spam in the uploaded Python packages by matching links contained in the description against blacklisted domain names.
Please open an issue or a Pull Request for any comments about this repository.