Skip to content
MEDOC is a free python wrapper to clone MEDLINE into local mySQL database
Branch: master
Clone or download
Pull request Compare This branch is 12 commits ahead, 18 commits behind MrMimic:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib
log
tests/data
utils
.gitignore
LICENSE.txt
README.md
__medoc__.py
configuration.cfg
requirements.txt

README.md

MEDOC (MEdline DOwnloading Contrivance)

More information about MEDOC on OMICTools website or on MEDOC's publication on arXiv.org:

About

Development

Thanks to rafspiny for his multiple corrections and feedback !

What is MEDLINE?

MEDLINE is a database of scientitifc articles released by the NIH. Pubmed is the most common way to query this database, used daily by many scientists around the world.

The NIH provides free APIs to build automatic queries, however a relational database could be more efficient.

The aim of this project is to download XML files provided by MEDLINE on a FTP and to build a relational mySQL database with their content.

Launch

Clone this repository

The first step is to clone this Github repository to your local machine.

Open a terminal.

git clone "https://github.com/MrMimic/MEDOC"
cd ./MEDOC

Setup

Here prerequisites and installation procedures will be discussed.

Prerequisites

XML parsing libraries may be needed. You can install them on any Debian-derived system with:

sudo apt-get install libxml2-dev libxslt1-dev zlib1g-dev

You may also need python-dev. You can also install it with the same command:

sudo apt-get install python-dev

Installation

The quickest thing to do is to just install the dependency included in the repo through pip. Simply run the following command from the MEDOC folder.

pip3 install -r requirements.txt

This will install the following python packages

beautifulsoup4==4.6.0
bs4==0.0.1
Cython==0.28.1
html5lib==1.0.1
lxml==4.2.1
PyMySQL==0.8.0
six==1.11.0
SQLAlchemy==1.2.5
webencodings==0.5.1

NOTE: If python3 is your default, you do not need to specify python3 or pip3 but just use python and pip.

Alternatively you can run the file SETUP.py that will check for all the dependencies and print out what is missing.

This script will:

  • Check for pip3 and give you command to install it
  • Check for pymysql and give you command to install it
  • Check for bs4 and give you command to install it

Configuration

Before you can run the code, you should take a look at configuration.cfg file and customize it according to your environment.

Plus, if you have already a user to access the DB you wish to create you can change the schema file to reflect that. You can change the DB_USER and the DB_PASSWORD fields with the following command. You can change other option of the database connection as well. Suppose your credentials are: my_custom_user/my_secret_password

export MEDOC_SQL_FILE='database_creation.sql'
sed -i'' -e "s/\bdb_user\b/my_custom_user/g" $MEDOC_SQL_FILE
sed -i'' -e "s/\bDB_PASSWORD\b/my_secret_password/g" $MEDOC_SQL_FILE

One thing you want to change for sure, is the path for the log files. Suppose the $MEDOC_SOURCE_DIR contains your MEDOC path

escaped_rhs=$(printf '%s\n' "$MEDOC_SOURCE_DIR" | sed 's:[\/&]:\\&:g;$!s/$/\\/')
sed -i'' -e "s/\/home\/emeric\/1_Github\/MEDOC/${escaped_rhs}/g" configuration.cfg

Launch the programm

Make sure you have correctly configured the configuration.cfg option file first, before proceeding.

If your computer has 16G or more of RAM, you can set 'insert_command_limit' to '1000' of greater.

Then, simply execute :

python3 __execution__.py 

Output

First line should be about database creation and number of files to download.

Then, a regular output for a file loading should look like:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - DOWNLOADING FILE
Downloading baseline/medline17n0216.xml.gz ..
Elapsed time: 12.32 sec for module: download
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - FILE EXTRACTION
Elapsed time: 0.42 sec for module: extract
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - XML FILE PARSING
Elapsed time: 72.47 sec for module: parse
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - SQL INSERTION
10000 articles inserted for file baseline/medline17n0216.xml.gz
20000 articles inserted for file baseline/medline17n0216.xml.gz
30000 articles inserted for file baseline/medline17n0216.xml.gz
Total time for file medline17n0216.xml.gz: 5.29 min

Issues

Program stop running because of 'Segmentation fault (core dumped)'

Indexing a file with 30K article take some time and RAM (if you know other parser than LXML, more RAM-frieldy, do a PR). Try to open the function /lib_medline/python_functions/E_parse_xml.py and go to the line:

soup = BeautifulSoup(file_content, 'lxml')

Change 'lxml' to 'html-parser' and re-launch SETUP.py.

Or simply try to lower the 'insert_command_limit' parameter, to insert values more often in the database, thus saving RAM usage.

SQL insertions are taking really a lot of time (more than 15min / file)'

Recreate the SQL database after dropping it, by running the following command:

DROP DATABASE pubmed ;

Then, comment every line about indexes (CREATE INDEX) or foreigns keys (ALTER TABLE) into the SQL creation file. Indexes are slowing up insertions.

When the database is full, launch the indexes and alter commands once at a time.

Problem installing lxml

Make sure you have all the right dependencies installed

On Debian based machines try running:

sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev
You can’t perform that action at this time.