GitHub - w2wei/dataset_retrieval

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Self-annotated_30_questions		Self-annotated_30_questions
analysis		analysis
evaluation		evaluation
experiments		experiments
ext_fields		ext_fields
index_ext_settings		index_ext_settings
index_raw_settings		index_raw_settings
index_std_ext_settings		index_std_ext_settings
index_std_settings		index_std_settings
rerank		rerank
schema_template		schema_template
README		README
T_questions.txt		T_questions.txt
all_questions.txt		all_questions.txt
autoquery.py		autoquery.py
autoquery_no_exp.py		autoquery_no_exp.py
build_indices_template.sh		build_indices_template.sh
build_mapping_schema.py		build_mapping_schema.py
citation_docid_pmid_dict.pkl		citation_docid_pmid_dict.pkl
constants.py		constants.py
constants.pyc		constants.pyc
copy_raw_phenodisco_files.py		copy_raw_phenodisco_files.py
count_doc_pmid_pairs.py		count_doc_pmid_pairs.py
delete_indices.sh		delete_indices.sh
edit_java_constants.py		edit_java_constants.py
format_PSD_model_data.py		format_PSD_model_data.py
get_MeSH_vocab.py		get_MeSH_vocab.py
get_docid_pmid_mappings.py		get_docid_pmid_mappings.py
get_oversized_datasets.py		get_oversized_datasets.py
get_phenodisco.py		get_phenodisco.py
google_questions.txt		google_questions.txt
index_ext.py		index_ext.py
index_raw.py		index_raw.py
index_std.py		index_std.py
index_std_ext.py		index_std_ext.py
kw_questions.txt		kw_questions.txt
notes		notes
phenodisco_datasets		phenodisco_datasets
pubmed_query_analyzer.py		pubmed_query_analyzer.py
pubmed_query_analyzer.pyc		pubmed_query_analyzer.pyc
remove_long_fields.py		remove_long_fields.py
remove_long_fields_strict.py		remove_long_fields_strict.py
run_index_templates.sh		run_index_templates.sh
setup_build_indices.py		setup_build_indices.py

Repository files navigation

1. Configure running environment
    Set up a Ubuntu 14.04 system, need about 32 GB memory and 500GB disk. 

    Set up your working path
    	WORK_PATH = "/your/path/here/"
    	cd $WORK_PATH
    	mkdir $WORK_PATH/code $WORK_PATH/data $WORK_PATH/results $WORK_PATH/tools $WORK_PATH/downloads
    Install Oracle JAVA JDK (https://www.digitalocean.com/community/tutorials/how-to-install-java-on-ubuntu-with-apt-get)
        sudo apt-get install python-software-properties
        sudo add-apt-repository ppa:webupd8team/java
        sudo apt-get update
        sudo apt-get install oracle-java8-installer
        If you have multiple JAVA JDK installed on the VM, you may refer to this link for management tips: https://askubuntu.com/questions/233190/what-exactly-does-update-alternatives-do
    Install Anaconda for Python 2.7 (https://docs.continuum.io/anaconda/install)
        wget -P /tools https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
        bash ~/tools/Anaconda2-4.3.1-Linux-x86_64.sh
    Install Python packages NLTK corpora/stopwords, NLTK punkt/english, Biopython, elasticsearch
        NLTK packages:
            Option 1: python -m nltk.downloader all ## install all corpus and models
            Option 2: python -m nltk.downloader ## enter an interactive interface and select the required packages
        Biopython: 
            conda install -c anaconda biopython=1.68
        elasticsearch
            pip install elasticsearch
    Install Elasticsearch 5.0.2 (ES) (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/install-elasticsearch.html) 
        cd $WORK_PATH/tools ## install ES under /tools
        wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.2.tar.gz
        sha1sum elasticsearch-5.0.2.tar.gz 
        tar -xzf elasticsearch-5.0.2.tar.gz
    Configure ES
        Change the JVM heap size for ES (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/heap-size.html)
            vim $WORK_PATH/tools/elasticsearch-5.0.2/config/jvm.options
        Set both Xms and Xmx to a reasonable value, such as 10g: -Xms10, -Xmx10g
        Add protected words to ES
            wget -P $WORK_PATH/downloads ftp://nlmpubs.nlm.nih.gov/online/mesh/2016/asciimesh/d2016.bin          
            mkdir $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis 
            ## call get_MeSH_vocab.py to generate "mesh_and_entry_vocab.txt"
            python $WORK_PATH/code/get_MeSH_vocab.py $WORK_PATH/downloads/d2016.bin $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis 
            ## check if "mesh_and_entry_vocab.txt" is in $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis
        Start ES as a daemon process (https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)
            $WORK_PATH/tools/elasticsearch-5.0.2/bin/elasticsearch -d -p
        Get ES status
            curl -XGET '127.0.0.1:9200/_stats?pretty'  ## show general information
            curl -XGET 'http://127.0.0.1:9200/_cat/indices?v' ## show all indices
            curl -XGET 'http://127.0.0.1:9200/_count?pretty' ## count the documents in all indices
			## delete an index
			curl -XDELETE 'http://127.0.0.1:9200/index_name_here?pretty'
    Install MetaMap (optional)
        ## follow the instruction from https://metamap.nlm.nih.gov/Installation.shtml

2. Prepare data 
    a. Metadata of datasets provided by the bioCADDIE Challenge
        wget -P $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
        unzip $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
        mv $WORK_PATH/data/update_json_folder $WORK_PATH/data/datamed_json ## rename the decompressed directory datamed_json
    b. Collect additional information for the datasets
       Option 1: use prepared documents
        mv $WORK_PATH/code/additional_fields.tar.gz $WORK_PATH/data
        tar -zxf $WORK_PATH/data/additional_fields.tar.gz
       Option 2: get additional information from scratch
        mkdir $WORK_PATH/data/additional_fields
        ## Run split_tasks.py to prepare data
        cd $WORK_PATH/code/ext_fields
        python split_tasks.py
        ## Run ret.sh to collect additional information
        sh ret.sh
    c. Re-generate keyword queries file kw_questions.txt for PSD-keywords(optional). 
       *kw_questions.txt is already included under $WORK_PATH/code*
        ## Create a file for each question
        ## run metamap on each file using the default setting, save the output files.
        cd $WORK_PATH/code/rerank/metamap
        javac *.java
        java KeyWordExtractor /path/to/your/metamap_output_file
        ## Merge the results to generate kw_questions.txt
    d. Re-generate google returned documents for the Distribution Shift method (optional)
       *google_questions.txt is already included under $WORK_PATH/code*
        ## make sure you have Graphic User Interface
        ## Install Selenium tool from http://www.seleniumhq.org/
        ## Update $WORK_PATH/code/rerank/dist_shift/Google.java according to the browser you use. Currently it uses Chrome by default.
        ## javac *.java
        java Google $WORK_PATH/code/all_questions.txt $WORK_PATH/code/google_questions.txt

3. Run experiments
    ## Bash scripts for experiments are under $WORK_PATH/code/experiments
    ## Check $CODE_DIR/rerank/PSD/Constants.java, make sure all the paths are correct