bionames-data

Data loaders for BioNames project

Journals

Publications

Unique ids

Many publications will lack external identifiers so we create a MD5 hash of a clean citation string and use that to cluster references that are probably the same. If we have very limited publication details (such as journal and year) then we will merge references that are different. To try and minimise this the clean citation string includes any external identifier we may have, such as DOI or BioStor id.

Firstly we dump a list of all ION ids that have publication information:

SELECT id INTO OUTFILE "/tmp/pub_ids.txt"
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
FROM names where publication IS NOT NULL;

Query OK, 1520478 rows affected (7 min 9.37 sec)

Then we read this list and generate MD5 hashes for each publication. From root of bionames-data:

cd cleaning/ion-pubids

cp /tmp/pub_ids.txt .

php unique_pub_ids.php > all.sql

(about 5 mins)

mysql -u <username> <password> <db> < all.sql

(about 5 mins)

Load publications into CouchDB

Get list of all MD5 hashes:

SELECT DISTINCT sici INTO OUTFILE "/tmp/all_md5.txt"
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
FROM names;

Query OK, 393545 rows affected (2 min 17.07 sec)

From root of bionames-data:

cd publication

cp /tmp/all_md5.txt .

php load_all.php

(overnight)

Load GBIF

Load clusters

Get list of all cluster ids

SELECT DISTINCT cluster_id INTO OUTFILE "/tmp/cluster_ids.txt"
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
FROM names;

Query OK, 3913417 rows affected (19 min 55.74 sec)

From root of bionames-data:

cd clusters

cp 	/tmp/cluster_ids.txt .

php load_all.php

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Costello et al		Costello et al
Gender		Gender
authors		authors
avibase		avibase
bhl_coverage		bhl_coverage
big-fix		big-fix
citation		citation
classifications		classifications
cleaning		cleaning
clusters		clusters
couchdb		couchdb
darwincore		darwincore
didyoumean		didyoumean
dima		dima
eol_collections		eol_collections
europeana		europeana
examples		examples
external-logos		external-logos
gengis		gengis
host-parser		host-parser
html-publications		html-publications
ion-extinct		ion-extinct
ion-graph		ion-graph
ion-harvest		ion-harvest
ipni		ipni
journals		journals
jstor_thumbnails		jstor_thumbnails
lib		lib
logo		logo
manuscript		manuscript
microsporidia		microsporidia
mutual		mutual
ocr		ocr
pbdb		pbdb
pdf-join		pdf-join
publications		publications
selector		selector
synonyms		synonyms
taxonconcept		taxonconcept
to add		to add
toCouchDB		toCouchDB
tony_rees		tony_rees
zootaxa		zootaxa
.gitignore		.gitignore
README.md		README.md
config.inc.php.in		config.inc.php.in
couchsimple.php		couchsimple.php
delete.php		delete.php
eol-missing-image.idraw		eol-missing-image.idraw
eol-missing-image.svg		eol-missing-image.svg
lib.php		lib.php
load.php		load.php
load_ncbi.php		load_ncbi.php
nameparse.php		nameparse.php
openurl.request		openurl.request
ris.php		ris.php
utils.php		utils.php

rdmpage/bionames-data

Folders and files

Latest commit

History

Repository files navigation

bionames-data

Journals

Publications

Unique ids

Load publications into CouchDB

Load GBIF

Load clusters

About

Resources

Stars

Watchers

Forks

Languages