Skip to content

rth/pypi-stats-viz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pypi-stats-viz

Tools for PyPi analysis

PyPi metadata

PyPi metadata (including project description, release information etc) for all ~130000 Python packages uploaded to PyPi are provided for research purposes as part of this project. Files are provided in a gzip compressed JSON format.

These datasets are obtained by running the pypi_metadata_download.py download script (see below), followed by pypi_metadata_anonymize.py script to remove author email information for privacy reasons.

Scripts

  1. pypi_metadata_download.py - download PyPi metadata for all packages

    The included script uses PyPi JSON API to asynchronously fetch metadata for all Python packages (~130000) uploaded to PyPi. It uses aiohttp (requires Python 3.5+) and takes ~30 min with 50 parallel download channels (cf warehouse/issues/2912(Comment) for more details), assuming a good internet connection.

  2. pypi_metadata_json_to_parquet.py convert the downloaded PyPi metadata from one gzipped JSON per package to Appache parquet format for more efficient analytics.

  3. pypi_metadata_anonymize.py remove the author_email fields from the individual jsons.

Notebooks

  1. pypi_spam_url_blacklist_matching.ipynb: following the PyPi spam incident in February 2018 this notebook is a somewhat unsuccessful attempt to detect spam in the uploaded Python packages by matching links contained in the description against blacklisted domain names.

Contributing

Please open an issue or a Pull Request for any comments about this repository.

About

PyPi statistics visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published