Skip to content
Branch: master
Find file History
vmarkovtsev Add the missing context methods
Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
Latest commit 83876e2 Aug 7, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md Fix the package name in the README Aug 7, 2019
custom_newline.py Add the missing context methods Aug 7, 2019
parse.py Harden the message parsing script May 30, 2019

README.md

Commit Messages size 46GB

Download link.

1.3 billion commit messages extracted from GHTorrent dumps.

The dataset consists of 3 files:

  1. commits.bin - commit hashes, 24GB.
  2. repos.txt.xz - GitHub repository names, 5GB.
  3. messages.txt.xz - commit messages, 17GB.

The precise number of commits is 1288456749. There can be duplicate commits. Please contribute a deduplicated dataset if you can.

Format

  1. commits.bin - continuous binary stream, 20 bytes per commit hash. The hashes are random by definition, so it makes no sense to compress this file.
  2. repos.txt.xz - strings separated by \0 - NULL character, xz-compressed. The order matches commits.bin. There is a trailing '\0'.
  3. messages.txt.xz - strings separated by \0, xz-compressed. The order matches commits.bin. There is a trailing '\0'.

Sample code

Python:

import lzma
from custom_newline import CustomNewlineReader

with open("commits.bin", "rb") as commf:
    with CustomNewlineReader(lzma.open("repos.txt.xz"), b"\0") as reposf:
        with CustomNewlineReader(lzma.open("messages.txt.xz"), b"\0") as msgf:
            for msg, repo in zip(msgf, reposf):
                commit = commf.read(20).hex()
                print(commit, repo.decode(), msg.decode())
                

custom_newline.py is included into this repository.

Origin

GHTorrent MongoDB dumps before 2019-03-18. The command to generate the dataset was:

(
  for dd in 2019-03-17 2019-03-16 ... 2015-12-01; do
    wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-daily/mongo-dump-$dd.tar.gz |
    tar -xzO dump/github/commits.bson
  done
  for dd in 2015-12-01 2015-10-03 2015-08-03; do
    wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/commits-dump.$dd.tar.gz |
    tar -xzO dump/github/commits.bson
  done
  wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/commits-1-dump.2015-08-04.tar.gz |
  tar -xzO dump/github/commits.bson
) | python3 parse.py

2019-03-17 2019-03-16 ... 2015-12-01 are the dump dates from ghtorrent.org/downloads.html. parse.py is included into this repository.

License

Open Data Commons Open Database License (ODbL)

You can’t perform that action at this time.