Skip to content

vmarkovtsev/ggmbox

Repository files navigation

ggmbox Build Status Build status Docker Build Status

Google Groups raw emails crawler and parser. Turbo speed and reliable! The downloaded messages are in RFC 822 format - taken verbatim from the Google servers.

Installation

Docker

Docker is the simplest option. Go to DockerHub Prepend docker run -it --rm vmarkovtsev/ggmbox to all the commands in the "Usage" section.

Crawler

Requirements: Python 3 and Scrapy. Download ggmbox.py file.

Parser

Requirements: Go.

go get -v github.com/vmarkovtsev/ggmbox

Usage

Crawler

scrapy runspider -a name=golang-nuts -o result.json -t json ggmbox.py

Replace "golang-nuts" with the actual group name. The raw emails will be saved by default to the corresponding directory.

scrapy runspider -a name=chromium-dev -a prefix=a/chromium.org -o result.json -t json ggmbox.py

Note the usage of "prefix" argument - it sets the name of the parent. Some groups require that.

Parser

./parse golang-nuts > dataset.csv

Replace "golang-nuts" with the actual directory name with raw emails. The plain text threads will be written to dataset.csv, one thread per line. Special characters are escaped.

Performance

Crawler

golang-nuts group was fully fetched on 24/02/2018 with 30043 topics and 192654 messages in 3 hours at 1gbps connection speed. The raw emails occupied 1.6 GB on disk.

Compare to 1 day using icy/google-group-crawler, it fetched only 63% and then stopped without any errors reported, or to henryk/gggd, it fetched only 3% within one hour and then unexpectedly stopped too.

Parser

It takes 7 seconds to parse 1.6 GB of raw emails on a 32-core machine.

Contributions

...are welcome! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

License

MIT.

About

Google Groups raw email crawler and parser

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published