## ACS XML Parser
This parser is used to extract the text, sequences, and meta data from ACS articles. The parsing process has two parts: First it works on the article xml files to extract the text and meta data from them, then it looks at the supplementary files for genetic sequences. At the end, the parser outputs the extracted data as pickle files for downstream tasks. For a more detailed specification on the input/output formats, please visit the [Github repo](https://github.com/synbioks/ACS-XML-to-text/tree/acs-parser-jiawei)

### Environment Configuration
To run this script you need to install the following package: `lxml` for text, metadata extraction, `pprint` and `tqdm` for debug outputs. Also you need to install `pdftotext` for pdf sequence extraction (We are using the pdftotext cli instead of importing it as a python package because cli has the option we need).

In [1]:
import os
import re
import sys
import requests
import pickle
import subprocess as sp
from pprint import pprint
from lxml import etree
from tqdm.notebook import tqdm

### Input Data
The input to this script is the data dump from ACS. The data dump consist of folders where each folder contains one ACS article in xml format and one folder called suppl. The suppl folder has all the supplementary files related to that article.

example:
```
sb000001           <- this is one folder, and we refer "sb000001" as the publication number
|-- sb000001.xml   <- this is the article file
+-- suppl          <- this folder contains all the supplementary file of this article
    |-- file_1.pdf
    +-- file_2.pdf
```

Currently there are 1597 folders like the example in the ACS data dump, and you can find those folders in `/sbksvol/data/acs-data/article-files`. We call this path the input path of the parser.

### Script Arguments
The parser takes three arguments:\
`input_path` tells the parser where to look for ACS data dump.\
`output_path` tells the parser where to store the extracted text, metadata, and genetic sequences of articles.\
`txt_path` tells the parser where to store the text of articles. 

The differences between `output_path` and `txt_path` are: `output_path` contains pickle files that have all the information extracted from articles and supplementary files (text, metadata, genetic sequences), whereas `txt_path` contains txt files that only have the text of articles. We did this because storing the text of articles in txt format is easier to read for tasks that do not require metadata and sequence information. The text data in both paths are identical.

In [2]:
# arguments
# the directory that stores our supplementary data dump
# since the supplementary data dump also contains the original article
# we can use this as the input path
input_path = os.path.abspath("/mnt/data1/jiawei/acs-data/suppl-files/")

# the directory that stores our processed pickle files
output_path = os.path.abspath("/mnt/data1/jiawei/acs-data/processed-files/")

# we also provide extracted text in plain text format, this is the output path
# of the plain text
txt_path = os.path.abspath("/mnt/data1/jiawei/acs-data/txt-files")

### Additional Parameters
The parser also has some additional parameters:\
`valid_cache_path` points to a pickle files that stores the result of sbol api queries. This is to save time (api query is very slow over the network) for repeated run/debug.\
`non_acs_article_path` points to a txt files that has a list of publication numbers. Articles having these publication numbers are non-research article. 

**Important**: `non_acs_article_path` is deprecated because the list is not accurate.

In [3]:
# local file parameters don't change unless you know what you are doing

# sbol api query cache, use to speed up the process time
valid_cache_path = os.path.abspath("./sbol-validation-cache.pkl")

# deprecated, the list is inaccurate (false negative, non research article not in the list)
non_acs_article_path = os.path.abspath("./non-acs-article.txt")

### SBOL API
The [sbol api](http://synbiodex.github.io/SBOL-Validator/#introduction) is an online service provided by [Myers Research Group at University of Utah](https://async.ece.utah.edu/tools/sbol-validatorconverter/). It takes any file, checks if the file is a *valid sequence file* (meaning: a file contains sequences information and adheres to one of the many popular computer-readable formats), and converts the sequence file into one of the following four formats (fasta, genbank, sbol, sbol2). We utilize this api in two ways: We use it to check whether a supplementary file is a valid sequence file, and convert a valid sequence file in any format to [fasta format](https://en.wikipedia.org/wiki/FASTA_format).

Through an inspection at the supplementary files, we found that files that have extension like ".gb", ".sbol", ".fasta", and ".dna" are valid sequence files, ".txt" supplementary files can be genbank files, and ".xml" supplementary files can be sbol files. 

In [4]:
# sbol api request params
sbol_validator_url = "https://validator.sbolstandard.org/validate/"

# these are the file format 
sbol_allowed_file_type = set([
    "gb", "fasta", "sbol", "txt", "xml", "dna"
])
validator_param = {
    'options': {
        'language' : "FASTA",
        'test_equality': False,
        'check_uri_compliance': False,
        'check_completeness': False,
        'check_best_practices': False,
        'fail_on_first_error': True,
        'provide_detailed_stack_trace': False,
        'subset_uri': '',
        'uri_prefix': 'dummy', # we need to have something here or the api will return an error
        'version': '',
        'insert_type': False
    },
    "main_file": None,
    "return_file": True
}

## Part 1: Text and Metadata Extraction

### Part 1.1: Helper Functions for Extraction

**Note**: To better understand the code below, I strongly suggest you to first read [lxml documentation](https://lxml.de/tutorial.html) and [XPath 1.0 documentation](https://www.w3.org/TR/1999/REC-xpath-19991116/), and open an article xml file on the side when going through this section (I used sb9b00030.xml as a reference when writing this documentation).

The following code cells are many helper functions help the text and metadata extraction process. Currently, the parser have the following features:

- **Extract Text:** The text of the articles are wrapped in `<p>` and `<title>` tags inside the xml files following the xpath `/article/body`. We use `extract_body()` to find those tags. An example is shown below:
```
<p>
    Spontaneous and random back and forth switching between states 
    is certainly not suitable for processes requiring irreversible
    cell fate determination, such as development and cell 
    differentiation. In these contexts, the intracellular noise 
    would need to be low enough to avoid spontaneous state 
    switching. To identify possible ratio control strategies in 
    such a low-noise environment, we transitioned experimentally 
    to a less-noisy system: the chromosomally integrated mutual 
    inhibition toggle in
    <italic toggle="yes">S. cerevisiae</italic>
    .
    <xref rid="ref25" ref-type="bibr"/>
    This circuit has the same topology as that shown in
    <xref rid="fig1" ref-type="fig">
    Figure
    <xref rid="fig1" ref-type="fig"/>
    </xref>
    a, exhibits hysteretic behavior (
    <xref rid="fig2" ref-type="fig">
    Figure
    <xref rid="fig2" ref-type="fig"/>
    </xref>
    a), and favors the TetR-dominant, low-GFP state under no induction. 
    However, differences in promoters, copy number, and transcription
    –translation processes between
    <italic toggle="yes">E. coli</italic>
    and yeast serve to reduce intracellular noise and shift the 
    bistable region up to roughly 3–17 ng/mL ATc. Unlike in
    <italic toggle="yes">E. coli</italic>
    , the bulk of the bistable range was impervious to the effects 
    of intracellular noise, resulting in a single peak homogeneous 
    expression profile even when the system is operating within the 
    bistable region.
</p>
```
In general the text contains formatting tags, embedded references, and captions. Our goal is to remove these and extract the plain text. To do this, we implement `p_helper()` to retrieve the plain text in fragments, join them together into one line, and fix punctuation/parenthesis placement. We call this one line of sanitized text returned by `p_helper()` a *paragraph*. `title_helper()` is used to retrive the title in plain text. In addition to extracting paragraphs and titles, we also use the titles to label what section (intro, method, results, discussion, and so on) the paragraphs are in. The logic is coded in `extract_body()`: If we see a paragraph following a title containing "method", we indicate the paragraph is in "method" section.


- **Extract Abstract:** Abstract is found in tag `<abstract>`. It is wrapped in a `<p>` tag similar to a paragraph in the body. We use `extract_abstract()` to locate the tag and `p_helper()` to extract the plain text.


- **Extract Keywords:** Keywords can be found in tag `<kwd-group>`. `extract_keywords()` and `kwd_helper()` is used to locate and retrieve the keywords in the article.


- **Extract Publication Dates:** There are two type of dates we can find in the metadata, one is electronic publication date and the other is issue puiblication date, they are stored in `<pub-date>` tags. You can tell which is the electronic publication date by looking at the attributes of the tags. `extract_date()` handles dates extraction.


- **Extract History of Events:** Stored in `<history>`, history of events is a list of event the article has been through. Each entry in the list contains the name of event and the date of the event. `extract_history()` handles history extraction.


- **Extract Article Type:** Located in `<subj-group>`, It indicate the type of the article, and we are currently using this to determine if an article is research article or not. `extract_article_type()` handles this process.


- **Extract id and internal id:** Overall, we can find three different ids that can uniquely identify an article: 1) the publication number (mentioned in "Input Data" section above), 2) the id in `<article-id>`, and 3) the internal id in the attribute of `<article>`. We believe 3) is an id used internally by ACS because we cannot find any public available information using this id.

In [5]:
# function to collect matching files and dirs
# basically a fancy find method
# pattern is a re pattern
# collect_dirs is a bool, when set to true it will collect directories/folders
# return a list of files/directories that matches the pattern
def collect_files(root, res, pattern="", collect_dirs=True, min_depth=None, max_depth=None):
    
    # check max depth
    if not max_depth is None and max_depth == 0:
        return
    
    # go through all item in the dir
    for item in os.listdir(root):
        
        # process item
        item_path = os.path.join(root, item)
        item_is_dir = os.path.isdir(item_path)
        
        # put valid file in res if min depth has reached
        if min_depth is None or min_depth - 1 <= 0:
            if re.match(pattern, item_path):
                if not item_is_dir or collect_dirs:
                    res.append(item_path)
        
        # recursively collect all files
        if item_is_dir:
            next_min_depth = None if min_depth is None else min_depth - 1
            next_max_depth = None if max_depth is None else max_depth - 1
            collect_files(item_path, res, pattern, collect_dirs, next_min_depth, next_max_depth)

In [6]:
# helps to extract text from paragraph
def p_helper(node):
    
    # <p/> does not have text
    if node.text is None:
        return ""
    
    # each paragarph is put into a line
    line_list = [node.text]
    for child in node:

        # get the text inside the child if the tag isn't 
        # named-content and inline-formula (those are mathametical formulas)
        # and the text following the child
        if not child.tag in ("named-content", "inline-formula"):
            line_list.append(" ".join(child.xpath(".//text()")))
        line_list.append(child.tail)

    # there might be none in line_list
        
    # re dark magic below
    # remove new line and spaces
    line = " ".join(line_list)
    line = line.strip()
    line = line.replace("\n", " ")

    # clean up consecutive spaces
    line = re.sub("\s+", " ", line)

    # fix the space around punctuation
    line = re.sub("\s([.,\):;])", r"\1", line)
    line = re.sub("\(\s", r"(", line)
    line = re.sub("\s*([-/])\s*", r"\1", line)
    return line

In [7]:
# strip format from keyword nodes
def kwd_helper(node):
    
    # return a keyword string
    kwd_tokens = node.xpath(".//text()")
    kwd = " ".join(kwd_tokens).replace("\n", " ").strip()
    kwd = re.sub("\s+", " ", kwd)
    return kwd

In [8]:
# this returns interesting titles
# for example: intro, method, and results
# return None for non interesting titles
def title_helper(node):
    
    # extract text from title node
    title = " ".join(node.xpath(".//text()"))
    title = title.replace("\n", " ")
    title = re.sub("\s+", " ", title)
    title = title.strip()
    title = title.lower()
    
    # categorize title
    res = []
    if "intro" in title:
        res.append("introduction")
    if "result" in title:
        res.append("result")
    if "discuss" in title:
        res.append("discussion")
    if "material" in title:
        res.append("materials")
    if "method" in title or "procedure" in title:
        res.append("method")
    if "summary" in title:
        res.append("summary")
    return res

In [9]:
# extract text from xpath nodes
def extract_body(root):
    
    # we are interested in the text in the body section
    curr_title = []
    text = []
    text_nodes = root.xpath("/article/body//*[self::p or (self::title and not(ancestor::caption))]")
    for text_node in text_nodes:
        
        # handle title
        if text_node.tag == "title":
            tmp_title = title_helper(text_node)
            if len(tmp_title) > 0:
                curr_title = tmp_title
            title = " ".join(text_node.xpath(".//text()"))
            title = title.replace("\n", " ")
            title = re.sub("\s+", " ", title)
            title = title.strip()
            text.append({
                "text": title,
                "section": curr_title
            })
        
        # handle paragraph
        elif text_node.tag == "p":
            text.append({
                "text": p_helper(text_node),
                "section": curr_title
            })
    return text

In [10]:
# extract abstract
def extract_abstract(root):
    
    # get the abstract paragraph
    abstract = []
    abstract_nodes = root.xpath("//abstract/p")
    if abstract_nodes:
        abstract.append(p_helper(abstract_nodes[0]))
    return abstract

In [11]:
# extract_keywords from meta data
def extract_keywords(root):
    
    # get the keywords
    keywords = []
    kwd_nodes = root.xpath("//kwd-group/kwd")
    for kwd_node in kwd_nodes:
        keywords.append(kwd_helper(kwd_node))
    return keywords

In [12]:
# extract date information from meta data
def extract_date(root):
    
    issue_pub_date = None
    electron_pub_date = None
    
    # traverse to the date note
    date_nodes = root.xpath("/article/front/article-meta/pub-date")
    
    # get the time
    for node in date_nodes:
        year = node.xpath("./year")[0].text.strip()
        month = node.xpath("./month")[0].text.strip()
        day = node.xpath("./day")[0].text.strip()

        if "date-type" in node.attrib and node.attrib["date-type"] == "issue-pub":
            issue_pub_date = "%s/%s/%s" % (month, day, year)
        else:
            electron_pub_date = "%s/%s/%s" % (month, day, year)
    
    return issue_pub_date, electron_pub_date

In [13]:
# extract article-id from meta data
def extract_id(root):
    
    id_node = root.xpath("/article/front/article-meta/article-id")[0]
    article_id = id_node.text.strip()
    return article_id

In [14]:
# extract internal id, the one in the first line of the xml file
def extract_internal_id(root):
    
    id_node = root.xpath("/article")[0]
    return id_node.attrib["id"]

In [15]:
# extract a list of history from the meta data
def extract_history(root):
    
    res = []
    dates = root.xpath("/article/front/article-meta/history/date")
    for date in dates:
        year = date.xpath("./year")[0].text.strip()
        month = date.xpath("./month")[0].text.strip()
        day = date.xpath("./day")[0].text.strip()
        res.append({
            "event": date.attrib["date-type"],
            "time": "%s/%s/%s" % (month, day, year)
        })
    return res

In [16]:
# extract the subject of the article from meta data
def extract_article_type(root):
    article_type = root.xpath("/article/front/article-meta/article-categories/subj-group")
    if len(article_type) > 1:
        print("article have 2 or more types")
    raw = article_type[0].xpath(".//text()")
    # clean up
    res = ''.join(raw).strip()
    res = re.sub("\n", "", res)
    return res

### Part 1.2: Extraction Procedure

1. **Collect xml Files:** `colect_files()` is used to find all the article xml files. Notice we are only collecting xml file that is one directory deep because the suppl folder also contains 
2. **Extract Information:** We open each xml file, and use the helper methods above to extract information we want.

All information is stored in a dictionary call `processed_files`. It maps publication numbers to extracted information from articles. We will refer the values of this dictionary as *article info*. For example `processed_files["sb9b00030"]` returns the article info retrieved from sb9b00030.xml.

In [17]:
# collect all xml files
xml_paths = []
collect_files(input_path, xml_paths, pattern=".*\.xml$", collect_dirs=False, min_depth=1, max_depth=2)
print(f"total xml files: %d" % len(xml_paths))

total xml files: 1597


In [18]:
# deprecated: get a list of non research article
non_acs_article_list = set()
with open(non_acs_article_path, "r") as f:
    for line in f:
        non_acs_article_list.add(line.strip())

In [19]:
# parse the files
# after this block, the information will be extracted to var processeed_files
# processed_files: key(pub_num) -> value(article_info)
processed_files = {}
for xml_path in tqdm(xml_paths):

    # print("\nparsing %s" % xml_path)
    pub_num = xml_path.split("/")[-1].split(".")[0]

    # get the root of the xml
    root = etree.parse(xml_path).getroot()
    
    # get the pub date
    issue_pub_date, electron_pub_date = extract_date(root)
    
    # get the article type
    article_type = extract_article_type(root)

    # create a dictionary holding the extracted data
    xml_data = {
        "is_research": "research" in article_type.lower(),
        "keywords": extract_keywords(root),
        "abstract": extract_abstract(root),
        "body": extract_body(root),
        "issue_pub_date": issue_pub_date,
        "electron_pub_date": electron_pub_date,
        "article_id": extract_id(root),
        "internal_id": extract_internal_id(root),
        "history": extract_history(root),
        "type": article_type
    }

    # save the data
    processed_files[pub_num] = xml_data

HBox(children=(FloatProgress(value=0.0, max=1597.0), HTML(value='')))




In [20]:
# print one sample for inspection
pprint(processed_files["sb9b00030"])

{'abstract': ['Robust and precise ratio control of heterogeneous phenotypes '
              'within an isogenic population is an essential task, especially '
              'in the development and differentiation of a large number of '
              'cells such as bacteria, sensory receptors, and blood cells. '
              'However, the mechanisms of such ratio control are poorly '
              'understood. Here, we employ experimental and mathematical '
              'techniques to understand the combined effects of signal '
              'induction and gene expression stochasticity on phenotypic '
              'multimodality. We identify two strategies to control phenotypic '
              'ratios from an initially homogeneous population, suitable '
              'roughly to high-noise and low-noise intracellular environments, '
              'and we show that both can be used to generate precise '
              'fractional differentiation. In noisy gene expression contexts, '
   

In [21]:
# show some stats
type_count = {}
event_count = {}
for k, article_info in processed_files.items():
    article_type = article_info["type"]
    if not article_type in type_count:
        type_count[article_type] = 0
    type_count[article_type] += 1
    
    events = article_info["history"]
    for e in events:
        if not e["event"] in event_count:
            event_count[e["event"]] = 0
        event_count[e["event"]] += 1
pprint(type_count)
print()
pprint(event_count)

{'Correction': 4,
 'Editorial': 27,
 'In This Issue': 69,
 'Introducing Our Authors': 66,
 'Letter': 368,
 'Research Article': 907,
 'Review': 27,
 'Technical Note': 82,
 'Tutorial': 4,
 'Viewpoint': 43}

{'accepted': 1,
 'asap': 1398,
 'issue-pub': 483,
 'just-accepted': 1285,
 'received': 1597}


## Part 2: Genetic Sequences Extraction

For this part we want to go through all the supplementary files in the data dump and recover genetic sequences from them. This requires handling of different file type and using the sbol api. Once we extracted the genetic sequences, we will store the sequence data in fasta format in `processed_files[pub_num][suppl_files][sequences]` for later use.

### Part 2.1: Unzipping

Some supplementary files are compressed into zip files, we want to unzip them first before doing anything. Because we only need to do it once, the code below is commented out.

In [None]:
# find out all the zip files and unzip them
# do this once when there is new data
# suppl_files_zip = []
# collect_files(input_path, suppl_files_zip, pattern=".*\.zip$", collect_dirs=False, min_depth=3)
# print("zip files: %d" % len(suppl_files_zip))
# for zip_file in suppl_files_zip:
#     zip_file_dir = re.sub("/[^/]*$", "", zip_file)
#     res = sp.run(["unzip", "-n", zip_file, "-d", zip_file_dir])
#     if res.returncode != 0:
#         print(res)

### Part 2.2: Finding Paths to All Supplementary Files

Unlike the locations of article xml files, the structures of suppl folder is unpreditable. Therefore we need to recursively find all the files in the suppl folder. We store the absolute path to all supplementary files in a list called `suppl_files_all`. After we have the list, we need to put this information into the `processed_file` so that we can achive two goals: 1) to query supplementary files using publication number, and 2) to locate supplementary files relative to `input_path` using publication number.

In [None]:
# collect all suppl files
suppl_files_all = []
collect_files(input_path, suppl_files_all, pattern="", collect_dirs=False, min_depth=3)
suppl_files_all = [x for x in suppl_files_all if not re.match(".*__MACOSX.*", x)]
print("suppl files: %d" % len(suppl_files_all))

In [None]:
# attach all the suppl path to processed files
# clear the suppl list of each article first to make the following code idempotent
for article_info in processed_files.values():
    article_info["suppl_files"] = []
    
for path in suppl_files_all:
    path = path.split("/")
    
    # get basic attrib
    suppl_filename = path[-1]
    suppl_dir = input_path.split("/")[-1]
    suppl_dir_idx = 0
    for i, item in enumerate(path):
        if item == suppl_dir:
            suppl_dir_idx = i
            break
    else:
        assert False # we should not be here
    pub_num = path[suppl_dir_idx + 1]
    rpath = os.path.join(*path[suppl_dir_idx + 1:]) # relative path
    
    # create info dict
    suppl_info = {
        "suppl_filename": suppl_filename,
        "rpath": rpath,
        "sequences": None
    }
    
    # push it into the processed files dict
    if pub_num in processed_files:
        processed_files[pub_num]["suppl_files"].append(suppl_info)

### Part 2.3 SOBL API Validation

In this part, we use sbol api to find all the valid sequence files and extract the sequence into fasta format. There are some limitation on the sbol api: 1) the reponse time over network is long and 2) the api only accept file size up to 64MB. To address this problem, currently we are ignoring large files and maintaining a cache locally to increase the performance.

In [None]:
# get supplementary files by extension
def filter_suppl_file_by_ext(all_articles, allowed_ext):
    suppl_files_to_check = []
    for article_info in all_articles.values():
        for suppl_info in article_info["suppl_files"]:
            # only allow the following extension
            ext = suppl_info["suppl_filename"].split(".")[-1]
            if ext in allowed_ext:
                suppl_files_to_check.append((suppl_info, article_info))
    return suppl_files_to_check

In [None]:
# use sbol api to validate sequence files
def validate_sequence(file):
    
    # restrict file size to be less than 64mb
    # this is an api restriction
    file_size = os.path.getsize(file)
    if file_size >= 64 * 2 ** 20:
        return False, None
    
    # try to read the content
    try:
        with open(file, "r", encoding="utf-8") as f:
            content = f.read()
    except UnicodeDecodeError:
        return False, None
    
    # validate file
    validator_param["main_file"] = content
    res = requests.post(sbol_validator_url, json=validator_param).json()
    return res["valid"], res

In [None]:
# use the api to check if the file is a sequence file
# we will cache the request result to reduce server load
valid_res_cache = None
if os.path.exists(valid_cache_path):
    with open(valid_cache_path, "rb") as f:
        valid_res_cache = pickle.load(f)
else:
    valid_res_cache = {}

# create a list of all the suppl files that we need to check
sbol_suppl_files_to_check = filter_suppl_file_by_ext(processed_files, sbol_allowed_file_type)

for suppl_info, article_info in tqdm(sbol_suppl_files_to_check):

    # use api or cache to get convert and store the sequence file in fasta format
    path = os.path.join(input_path, suppl_info["rpath"])
    if not suppl_info["rpath"] in valid_res_cache:
        is_valid, data = validate_sequence(path)
        valid_res_cache[suppl_info["rpath"]] = (is_valid, data)
    else:
        is_valid, data = valid_res_cache[suppl_info["rpath"]]
    suppl_info["sequences"] = data["result"] if is_valid else None

# save cache
with open(valid_cache_path, "wb") as f:
    pickle.dump(valid_res_cache, f)

### Part 2.4 Extract Sequences from PDF

Most of the supplementary files are not valid sequence files, but a lot of them are pdf files and contain sequences. In this part we try to use the pdftotext to convert the pdf to plain text first, and then salvage sequences from the plain text.

When a sequence in pdf is converted to plain text, two thing can happend: 1) the sequence is splited across multiple lines, or 2) the sequence remain in one line. The script extract the multiline sequences first and then replace them with a place holder and then extract the single line sequences.

In [None]:
# finds any strings that looks like a sequence in pdf
# need to have pdftotext installed
def extract_sequence_from_pdf(path):
    sp.run(["pdftotext", "-raw", path, "/tmp/sbks-pdf-tmp.txt"]) # use tmp dir maybe it is ram so faster?
    seqs = []
    with open("/tmp/sbks-pdf-tmp.txt", "r") as file:
        lines = file.readlines()
        text = "".join(lines).upper()
        
        # first find all the multiline sequences
        for multi_line_seq in re.findall("([ATCG]{3,}(\n[ATCG]{3,}){1,})", text):
            seqs.append(re.sub("\n", "", multi_line_seq[0]))
            
        # remove extracted sequences to avoid double counting
        text = re.sub("([ATCG]{3,}(\n[ATCG]{3,}){1,})", " ###REMOVED### ", text)
        
        # then find all the one line sequences
        for single_line_seq in re.findall("[ATCG]{10,}", text):
            seqs.append(single_line_seq)
            
#     sp.run(["rm", "tmp.txt"])
    return seqs

In [None]:
# extract sequences from pdf
# create a list of all the suppl files that we need to check
pdf_suppl_files_to_check = filter_suppl_file_by_ext(processed_files, set(["pdf"]))

for suppl_info, article_info in tqdm(pdf_suppl_files_to_check):

    # use pdftotext to extract the sequences from pdf
    path = os.path.join(input_path, suppl_info["rpath"])
    sequences = extract_sequence_from_pdf(path)
    if sequences:
        res = []
        for i, s in enumerate(sequences):
            res.append(f">{article_info['internal_id']}_{suppl_info['suppl_filename']}_{i}\n")
            res.append(s + "\n")
        suppl_info["sequences"] = "".join(res)

### 2.5 Writing Results

There are three things written to the disk:
- **Processed Files:** Each entry is saved into one pickle file and named using the publication number. The pickle files are saved in `output_path`.
- **Plain Text:** The text data of each article is saved into one txt file and named using the publication number. The txt files are saved in `txt_path`.
- **sequences:** The sequences extracted from each article is saved into one fasta formatted txt file. They are saved in `./sequence-files` fol

In [None]:
# write the sequence files
for pub_num, article_info in processed_files.items():
    seq_to_write = []
    for suppl_info in article_info["suppl_files"]:
        if not suppl_info["sequences"] is None:
            seq_to_write.append(suppl_info["sequences"])
    if len(seq_to_write) > 0:
        with open(os.path.join("sequence-files", pub_num + "_" + article_info["internal_id"] + ".seq.txt"), "w") as outfile:
            for seq in seq_to_write:
                outfile.write(seq)

In [None]:
# pickle the processed files
for pub_num, data in tqdm(processed_files.items()):
    with open(os.path.join(output_path, pub_num + ".pkl"), "wb") as out:
        pickle.dump(data, out)
    text_data = [d["text"] + "\n" for d in data["body"]]
    
    # save plain text
    if data["is_research"]:
        text_data_path = os.path.join(txt_path, "research")
    else:
        text_data_path = os.path.join(txt_path, "non-research")
    with open(os.path.join(text_data_path, pub_num + ".txt"), "w") as out:
        out.writelines(text_data)

In [22]:
# check one pickle
with open(os.path.join(output_path, "sb9b00030.pkl"), "rb") as ifile:
    article_info = pickle.load(ifile)
    pprint(article_info.keys())
    pprint(article_info["type"])
    pprint(article_info["is_research"])

dict_keys(['is_research', 'keywords', 'abstract', 'body', 'issue_pub_date', 'electron_pub_date', 'article_id', 'internal_id', 'history', 'type', 'suppl_files'])
'Research Article'
True
