Prerequisites
1. Install necessary software for FTP server based data retrieval (Windows 10)
    - Install Chocolatey: [Chocolately Installation](https://chocolatey.org/install#individual).
    - Install Wget by typing the following into your shell: `choco install wget`.
2. Retrieve RELISH data set
    - Retrieve from [RELISH](https://figshare.com/projects/RELISH-DB/60095).
    - Extract the `RELISH_v1.json` at `data/RELISH`.
3. Retrieve pubmed data via FTP server (Windows 10)
    - Retrieve it using Wget by typing the following into your shell: `wget -mkx -e robots=off https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/`.
    - Extract all `.gz` files in `ftp.ncbi.nlm.nih.gov/pubmed/baseline`.
    - Move resulting xml files to `data/RELISH/pubmed`.

In [None]:
import os
os.chdir('../')
os.chdir('../code')
from InputDataPreprocess import parseRelish
os.chdir('FTP-approach')
from preProcessFTP import structureDataset

Code Strategy
1. Get PMIDs from RELISH by parsing `RELISH_v1.json` to a set of pmids
2. Retrieve PMIDs from FTP data set and structure resulting data set.
    - Iterate through the FTP data set and check if each entry belongs to the RELISH data set.
    - If a pmid has been found its metadata gets saved in `data/RELISH/Original`.
    - Create new xml files at `data/RELISH/Formatted` which take the format of the BioC API pubmed data.
    - Remove unecessary HTML headings.
    - Create a tsv file at `data/RELISH/RELISH_Formatted.tsv` containing pmid, title and abstract.

Step 1: Get PMIDS from `RELISH_v1.json`. Returns pmids as set.

In [None]:
os.chdir('../')
pmidList = parseRelish('../data/RELISH/RELISH_v1.json')
pmidList = set(list(pmidList)[:10]) # To retrieve all data just delete this row and rerun the code.

Step 2: Retrieve PMIDs from FTP data set and structure results. Returns two sets of xml files: `data/RELISH/Original` and `data/RELISH/Formatted`, as well as a tsv file: `data/RELISH/RELISH_Formatted.tsv`.

In [None]:
structureDataset(pmidList, 'data/RELISH/pubmed', 'data/RELISH', 'data/RELISH/RELISH_Formatted.tsv')