Skip to content

w2wei/dataset_retrieval_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. Configure running environment
    Set up a Ubuntu 14.04 system, need about 32 GB memory and 500GB disk. 

    Set up your working path
    	WORK_PATH = "/your/path/here/"
    	cd $WORK_PATH
    	mkdir $WORK_PATH/code $WORK_PATH/data $WORK_PATH/results $WORK_PATH/tools $WORK_PATH/downloads
    Install Oracle JAVA JDK (https://www.digitalocean.com/community/tutorials/how-to-install-java-on-ubuntu-with-apt-get)
        sudo apt-get install python-software-properties
        sudo add-apt-repository ppa:webupd8team/java
        sudo apt-get update
        sudo apt-get install oracle-java8-installer
        If you have multiple JAVA JDK installed on the VM, you may refer to this link for management tips: https://askubuntu.com/questions/233190/what-exactly-does-update-alternatives-do
    Install Anaconda for Python 2.7 (https://docs.continuum.io/anaconda/install)
        wget -P /tools https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
        bash ~/tools/Anaconda2-4.3.1-Linux-x86_64.sh
    Install Python packages NLTK corpora/stopwords, NLTK punkt/english, Biopython, elasticsearch
        NLTK packages:
            Option 1: python -m nltk.downloader all ## install all corpus and models
            Option 2: python -m nltk.downloader ## enter an interactive interface and select the required packages
        Biopython: 
            conda install -c anaconda biopython=1.68
        elasticsearch
            pip install elasticsearch
    Install Elasticsearch 5.0.2 (ES) (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/install-elasticsearch.html) 
        cd $WORK_PATH/tools ## install ES under /tools
        wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.2.tar.gz
        sha1sum elasticsearch-5.0.2.tar.gz 
        tar -xzf elasticsearch-5.0.2.tar.gz
    Configure ES
        Change the JVM heap size for ES (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/heap-size.html)
            vim $WORK_PATH/tools/elasticsearch-5.0.2/config/jvm.options
        Set both Xms and Xmx to a reasonable value, such as 10g: -Xms10, -Xmx10g
        Add protected words to ES
            wget -P $WORK_PATH/downloads ftp://nlmpubs.nlm.nih.gov/online/mesh/2016/asciimesh/d2016.bin          
            mkdir $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis 
            ## call get_MeSH_vocab.py to generate "mesh_and_entry_vocab.txt"
            python $WORK_PATH/code/get_MeSH_vocab.py $WORK_PATH/downloads/d2016.bin $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis 
            ## check if "mesh_and_entry_vocab.txt" is in $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis
        Start ES as a daemon process (https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)
            $WORK_PATH/tools/elasticsearch-5.0.2/bin/elasticsearch -d -p
        Get ES status
            curl -XGET '127.0.0.1:9200/_stats?pretty'  ## show general information
            curl -XGET 'http://127.0.0.1:9200/_cat/indices?v' ## show all indices
            curl -XGET 'http://127.0.0.1:9200/_count?pretty' ## count the documents in all indices
			## delete an index
			curl -XDELETE 'http://127.0.0.1:9200/index_name_here?pretty'
    Install MetaMap (optional)
        ## follow the instruction from https://metamap.nlm.nih.gov/Installation.shtml

2. Prepare data 
    a. Metadata of datasets provided by the bioCADDIE Challenge
        wget -P $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
        unzip $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
        mv $WORK_PATH/data/update_json_folder $WORK_PATH/data/datamed_json ## rename the decompressed directory datamed_json
    b. Collect additional information for the datasets
       Option 1: use prepared documents
        mv $WORK_PATH/code/additional_fields.tar.gz $WORK_PATH/data
        tar -zxf $WORK_PATH/data/additional_fields.tar.gz
       Option 2: get additional information from scratch
        mkdir $WORK_PATH/data/additional_fields
        ## Run split_tasks.py to prepare data
        cd $WORK_PATH/code/ext_fields
        python split_tasks.py
        ## Run ret.sh to collect additional information
        sh ret.sh
    c. Re-generate keyword queries file kw_questions.txt for PSD-keywords(optional). 
       *kw_questions.txt is already included under $WORK_PATH/code*
        ## Create a file for each question
        ## run metamap on each file using the default setting, save the output files.
        cd $WORK_PATH/code/rerank/metamap
        javac *.java
        java KeyWordExtractor /path/to/your/metamap_output_file
        ## Merge the results to generate kw_questions.txt
    d. Re-generate google returned documents for the Distribution Shift method (optional)
       *google_questions.txt is already included under $WORK_PATH/code*
        ## make sure you have Graphic User Interface
        ## Install Selenium tool from http://www.seleniumhq.org/
        ## Update $WORK_PATH/code/rerank/dist_shift/Google.java according to the browser you use. Currently it uses Chrome by default.
        ## javac *.java
        java Google $WORK_PATH/code/all_questions.txt $WORK_PATH/code/google_questions.txt

3. Run experiments
    ## Bash scripts for experiments are under $WORK_PATH/code/experiments
    ## Check $CODE_DIR/rerank/PSD/Constants.java, make sure all the paths are correct

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published