Extract, transform, and analyze bibliographic data from Wikidata dumps
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
20141020 Change naming scheme Oct 9, 2017
20150330 Change naming scheme Oct 9, 2017
20150921 Change naming scheme Oct 9, 2017
20160104 Add publication types May 11, 2018
20160509 Add publication types May 11, 2018
20160912 Add publication types May 11, 2018
20170102 Add publication types May 11, 2018
20170206 Add publication types May 11, 2018
20170626 Change naming scheme Oct 9, 2017
20170731 Add publication types May 11, 2018
20170814 Add publication types May 11, 2018
20170821 Change naming scheme Oct 9, 2017
20170904 Change naming scheme Oct 9, 2017
20171002 Change naming scheme Oct 9, 2017
20171016 Add 2017-10-16 Oct 18, 2017
20171106 Add publication types May 11, 2018
20171120 Add publication types May 11, 2018
20171204 Add publication types May 11, 2018
20171211 Add publication types May 11, 2018
20180101 Add publication types May 11, 2018
20180129 Add publication types May 11, 2018
20180219 Add publication types May 11, 2018
20180312 Add publication types May 11, 2018
20180402 Add publication types May 11, 2018
20180505 Add publication types May 11, 2018
js Extract DOIs Mar 22, 2018
.gitignore Add first experimentally useful version Sep 5, 2017
.npmignore Add first draft Sep 1, 2017
LICENSE.md Change license to CC0 Sep 12, 2017
Makefile Extract DOIs Mar 22, 2018
README.md Change naming scheme Oct 9, 2017
dataflow.dot Extend dataflow diagram Oct 6, 2017
dataflow.png Extend dataflow diagram Oct 6, 2017
download-dump Move pubtypes to publications.types Jan 8, 2018
package.json
stats.json Add 20181001 Oct 7, 2018
stats.pl Move pubtypes to publications.types Jan 8, 2018

README.md

WikiCite data

This repository contains scripts to extract, transform, and analyze bibliographic data from Wikidata.

License

the current state of this project is experimental

Overview

Bibliographic data can be extracted from Wikidata dumps which are provided weekly at https://dumps.wikimedia.org/wikidatawiki/entities/ as documented at https://www.wikidata.org/wiki/Wikidata:Database_download. Old JSON dumps are archived at Internet Archive starting from October 2014. Then Wikidata JSON dump format was introduced in July 2014 so data from between February 2013 until would require additional preprocessing.

Processing Wikidata dumps requires storage, processing time, and knowledge. With the scripts in this repository, Wikidata dumps can be pre-processed and provided in simplified form, more suitable for use of bibliographic data from Wikidata. The repository further contains checksums, lists of publication types, and statistics derived from Wikidata dumps. Full dumps are not included but must be shared by other means.

Data processing flow

The following diagram illustrates processing of Wikidata dumps to bibliographic records and summaries. Dotted parts are not included in the git repository. Grey parts have not been implemented yet.

data processing flow

Requirements

The current scripts require the following technologies:

  • standard Unix command line tools (bash, make, wget, gzip, zcat)
  • node >= 6.4.0, npm, and packages listed in packages.json Node
  • jq

Usage

Download dumps

The download-dump script can be used to download a full, compressed JSON dump from https://dumps.wikimedia.org/wikidatawiki/entities/ and place it in a subdirectory named by date of the dump:

./download-dump 20170626

Old dumps must be downloaded manually from Internet Archive.

A MD5 hash of the extracted dump can be computed like this:

make 20170626/wikidata-20170626-all.md5

The MD5 hash is commited in git for reference.

The number of entities in a dump is counted as following - it is also committed in git:

make 20170626/wikidata-20170626-all.ids.count

Extract publication types

To find out which Wikidata items refer to bibliographic objects, we must extract all subclasses of Q732577 (publication). The class hierarchy must be derived from the JSON dump because it will likely have been changed in meantime.

First extract all truthy subclass-of statements:

make 20170626/wikidata-20170626.classes.csv

Then get all subclasses of Q732577 and Q191067 (the latter was missing as subclass of the former until mid-September 2017):

make 20170626/wikidata-20170626.pubtypes

The list of publication types is sorted and commited for reference.

Extract bibliographic items

Extract all bibliographic items, with simplified truthy statements, based on the list of publication types:

make 20170626/wikidata-20170626.publications.ndjson.gz

The number of bibliographic items is counted as following - it is also committed in git:

make 20170626/wikidata-20170626.publications.ids.count

FIXME:

  • Author names are not sorted yet
  • Claims with special value "unknown" are not included although this might be useful

Extract labels

The wikicite dump does not contain information about non-bibliographic items such as people and places. To further make use of the data you likely need labels.

make 20170626/wikidata-20170626-all.labels.ndjson

Uncompressed label files tend do get large so compression or reduction to a selected language may be added later.

Convert to other bibliographic formats

To be done (especially CSL-JSON and MARCXML)

Combine statistics

File stats.json contains summarizing statistics:

  • md5: MD5 hash of the full, uncompressed Wikidata JSON dump
  • size: size of compressed dump
  • entities: number of entities
  • publications
    • items: number of publication items
    • size: size of compressed dump
  • pubtypes: number of publication types

Run make stats in the base directory for updates.

See also

License

CC0