Exploring complex data with Elasticsearch and Python

Supporting code from my Elasticsearch and Python talk at PyBay 2016.

Slides: https://speakerdeck.com/simon/exploring-complex-data-with-elasticsearch-and-python

Video should be available online some time after the conference.

index_docs.py

This script recursively walks a folder containing the official Django documentation (though it should work on other folders containing restructured text documentation as well) and outputs it in a format suitable to be passed directly to Elasticsearch via the _bulk indexing endpoint.

To run this script, first download the latest version of Django and unzip it.

Then run the following:

python index_docs.py django/docs/ https://docs.djangoproject.com/en/1.10/

This will output newline separated JSON.

To index those documents in Elasticsearch, first run Elasticsearch on a known URL (on your local machine on port 9200 is fine), then pipe the output of the above command to curl like so:

python index_docs.py django/docs/ \
    https://docs.djangoproject.com/en/1.10/ | \
    curl -s XPOST localhost:9200/docsearch/doc/_bulk \
    --data-binary @-

fetch_pypi_metadata.py

My second demo used data pulled from the Python Package Index. PyPI offers a JSON API to retrieve metadata about individual packages - this script loops through the 80,000+ list of packages (retrieved using XMLRPC) and downloads them to a local metadata/ directory.

Since this hits PyPI 80,000+ times, you shouldn't run this! You'll have to edit the script to get it to work. In place of using this script, I suggest downloading this .zip file containing the metadata/ directory I used during the talk:

http://s3.amazonaws.com/files.simonwillison.net/2016/pypi-metadata/metadata.zip

index_pypi_metadata.py

This script reads every .json file in the metadata/ directory described above and indexes the corresponding packages and releases into Elasticsearch. It first initializes indexes for those data types with the relevant mapping (the Elasticsearch equivalent of a schema).

Suggested usage:

wget http://s3.amazonaws.com/files.simonwillison.net/2016/pypi-metadata/metadata.zip
unzip metadata.zip
python index_pypi_metadata.py

This script needs some dependencies, listed in requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
example-pypi-searches.txt		example-pypi-searches.txt
fetch_pypi_metadata.py		fetch_pypi_metadata.py
index_docs.py		index_docs.py
index_pypi_metadata.py		index_pypi_metadata.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring complex data with Elasticsearch and Python

index_docs.py

fetch_pypi_metadata.py

index_pypi_metadata.py

About

Releases

Sponsor this project

Packages

Languages

simonw/pybay-2016-elasticsearch-talk

Folders and files

Latest commit

History

Repository files navigation

Exploring complex data with Elasticsearch and Python

index_docs.py

fetch_pypi_metadata.py

index_pypi_metadata.py

About

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages