In [1]:
import re

# Finding Evidence Workshop

## DIY Text Mining Session

In this session, we learn how a dictionary-based text mining works to find the entity from publications. At the first, we define a simple dictionary and find them in the publications. As problems arose, we will improve our dictionary step by step.

## 1. A Simple dictionary based text mining

Dictionary based text mining usually uses Regular Expressions to capture text patterns. Regular expression is a language for specifying text search strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. A regular expression search function will search through the corpus, returning all texts that match the pattern. Below are some examples of regular expressions.

<img src="img/regex_examples.png" width="500"/>

By describing the text pattern, we can extract wanted texts from the corpus. Therefore, we need a dictionary to define the pattern. Let's have a look at a simple dictionary below.

##### A dcitionary is a collection of terms.

In [2]:
sample_dictionary = {'cancer', 'diabete', 'asthma'}

##### We use a dictionary to find out whether a term appear in a give sample text.

In [3]:
sample_text = 'Cancer, diabete and asthma are diseases. Retrospective analysis carried out with NCBI GEO "Analyze Dataset" tool of the study GDS1695. Thanks for support from Cancer Research Organization.'

##### We need a piece of magic to find the terms from a context. This piece of codes will go through the sample text and check wehther the word in the dictionary actually appears in the sample text. 

If it does appear in the sample text, then keep a record of its location in the sample text. 

In [4]:
def print_evidences(evidences, text):
    char_lens = 20
    for evidence in evidences:
        start,end = evidence.span(0)
        pre_start = max(0, start-char_lens)
        pre = text[pre_start:start]
        post = text[end:end+char_lens]
        print(f"one evidence found: '{evidence.group(0)}', with pre: '{pre}' and post: '{post}'")
    
def find_evidence_sample(dictionary, text):
    """
    given a dictionary and a piece of text, find the evidence from the text
    """
    evidences = []
    for d in dictionary:
        pattern = re.compile(re.escape(d))
        matches = pattern.finditer(text)
        if matches:
            for m in matches:
                evidences.append(m)
    print_evidences(evidences, text)
    

##### Let's run the codes and see what happens. You may notice that we keep a note about a few characters befroe and after the match i.e. `pre` and `post`.

<img src="img/match.png" width="500"/>

In [5]:
find_evidence_sample(sample_dictionary, sample_text)

one evidence found: 'diabete', with pre: 'Cancer, ' and post: ' and asthma are dise'
one evidence found: 'asthma', with pre: 'Cancer, diabete and ' and post: ' are diseases. Retro'


##### We successfully found 'diabete' and 'asthma' in the text. However, "Cancer" with a capital 'C' is missing because we only have lowercased "cancer" in our dictionary. It's case sensitive.

It seems we need to either lowercase the sample text, add another word 'Cancer' to the dictionary or tell regular expression to ignore cases. Let's convert the sample text to lowercase (simply use `sample_text.lower()`) and try it again.

In [6]:
find_evidence_sample(sample_dictionary, sample_text.lower())

one evidence found: 'cancer', with pre: '' and post: ', diabete and asthma'
one evidence found: 'cancer', with pre: 'ks for support from ' and post: ' research organizati'
one evidence found: 'diabete', with pre: 'cancer, ' and post: ' and asthma are dise'
one evidence found: 'asthma', with pre: 'cancer, diabete and ' and post: ' are diseases. retro'


## 2. Improved version

From the above section, you can easily notice that all the "cancer" in the sample text are identified, however, we don't want the "Cancer" in "Cancer Research Organization" to be identified. What can we do? 

A simple strategy is to validate the match by looking at the surrounding words. For example, if we find 'organisation' after 'cancer', we discard the match. 

<img src="img/match.png" width="500"/>

This is a heuristic approach and requires some sense of domain knowledge to make the constraints as accurate as possible. Let's improve our dictionary by looking at the words afte the match.

In [7]:
sample_dictionary_refined = {'cancer': {'exclude': {'post':{'organization'}}}, 'diabete':{}, 'asthma':{}}

In the refined dictionary, we added `exclude` to `cancer` for validation. `post` means look at the context after the match and `pre` means look at the context before the match. 

In this case, if 'organization' appears in the context after, we invalidate and ignore the match.

In [8]:
def check_exclude(dictionary, context, direction):
    """
    Given a context and a dictionary, validate the entity if 
    any of words in the dictionar don't appear in the context 
    """
    if direction not in ('pre', 'post'):
        raise ValueError(f"directions must be 'pre' or 'post' but get '{direction}'")
    for exclude in dictionary[direction]:
        exclude_pattern = re.compile(re.escape(exclude))
        if exclude_pattern.search(context):
            return True
    return False
    
def find_evidence_refined(dictionary, text, chars_len=30):
    """
    given a dictionary and a piece of text, find the evidence from the text
    """
    evidences = []
    for d in dictionary:
        pattern = re.compile(re.escape(d))
        matches = pattern.finditer(text)
        if matches:
            for m in matches:
                start,end = m.span(0)
                if 'exclude' in dictionary[d] and dictionary[d]['exclude']:
                    if 'pre' in dictionary[d]['exclude'] and dictionary[d]['exclude']['pre']:  
                        exclude = check_exclude(dictionary[d]['exclude'], text[end:end+chars_len], direction='pre')
                        if exclude:
                            print("skip this entity")
                            continue
                    if 'post' in dictionary[d]['exclude'] and dictionary[d]['exclude']['post']:
                        exclude = check_exclude(dictionary[d]['exclude'], text[end:end+chars_len], direction='post')
                        if exclude:
                            print("skip this entity")
                            continue
                evidences.append(m)
    print_evidences(evidences, text)

##### We can also exclude an entity by looking at the context before the words. 

If we try it again, we can see that, one match has been ignored because it hits the `exclude` rule. 

In [9]:
find_evidence_refined(sample_dictionary_refined, sample_text.lower())

skip this entity
one evidence found: 'cancer', with pre: '' and post: ', diabete and asthma'
one evidence found: 'diabete', with pre: 'cancer, ' and post: ' and asthma are dise'
one evidence found: 'asthma', with pre: 'cancer, diabete and ' and post: ' are diseases. retro'


## 3. Final optional version
To vaildate an entity, we can also check whether certain words appear in the `pre` and `post` contexts. This is useful for accession numbers as they sometimes clash with some random numbers in the publications.

In [10]:
sample_dictionary_final = {
    'cancer': {'exclude': {'post':{'organization'}}}, 
    'diabete':{}, 
    'asthma': {},
    'gds1695': {'include': {'pre': {'geo'}}}
}

In [11]:
def check_exclude(dictionary, context, direction):
    """
    Given a context and a dictionary, validate the entity if 
    words in the dictionary don't appear in the context 
    """
    if direction not in ('pre', 'post'):
        raise ValueError(f"directions must be 'pre' or 'post' but get '{direction}'")
    for exclude in dictionary[direction]:
        exclude_pattern = re.compile(re.escape(exclude))
        if exclude_pattern.search(context):
            return True
    return False

def check_include(dictionary, context, direction):
    """
    Given a context and a dictionary, validate the entity if 
    any of words in the dictionary indeed appear in the context 
    """
    if direction not in ('pre', 'post'):
        raise ValueError(f"directions must be 'pre' or 'post' but get '{direction}'")
    for include in dictionary[direction]:
        include_pattern = re.compile(re.escape(include))
        if include_pattern.search(context):
            return True
    return False

def find_evidence_final(dictionary, text, chars_len=30):
    """
    given a dictionary and a piece of text, find the evidence from the text
    """
    evidences = []
    for d in dictionary:
        pattern = re.compile(re.escape(d))
        matches = pattern.finditer(text)
        for m in matches:
            start,end = m.span(0)
            """validate if words not in the context"""
            if 'exclude' in dictionary[d] and dictionary[d]['exclude']:
                if 'pre' in dictionary[d]['exclude'] and dictionary[d]['exclude']['pre']:  
                    exclude = check_exclude(dictionary[d]['exclude'], text[end:end+chars_len], direction='pre')
                    if exclude:
                        print(f"skip this entity: {m}")
                        continue
                if 'post' in dictionary[d]['exclude'] and dictionary[d]['exclude']['post']:
                    exclude = check_exclude(dictionary[d]['exclude'], text[end:end+chars_len], direction='post')
                    if exclude:
                        print(f"skip this entity: {m}")
                        continue
            """validate if words in the context """
            if 'include' in dictionary[d] and dictionary[d]['include']:
                context_start = max(0,start-chars_len)
                if 'pre' in dictionary[d]['include'] and dictionary[d]['include']['pre']:  
                    include = check_include(dictionary[d]['include'], text[context_start:start], direction='pre')
                    if not include:
                        print(f"skip this entity: {m}")
                        continue
                if 'post' in dictionary[d]['include'] and dictionary[d]['include']['post']:  
                    include = check_include(dictionary[d]['include'], text[context_start:start], direction='post')
                    if not include:
                        print(f"skip this entity: {m}")
                        continue
            evidences.append(m)
    print_evidences(evidences, text)

Let's try ths version. Fianlly, we get the GEO accession numbers as well by check the present of 'GEO' ahead of the match.

In [12]:
find_evidence_final(sample_dictionary_final, sample_text.lower(), chars_len=40)

skip this entity: <_sre.SRE_Match object; span=(159, 165), match='cancer'>
one evidence found: 'cancer', with pre: '' and post: ', diabete and asthma'
one evidence found: 'diabete', with pre: 'cancer, ' and post: ' and asthma are dise'
one evidence found: 'asthma', with pre: 'cancer, diabete and ' and post: ' are diseases. retro'
one evidence found: 'gds1695', with pre: '" tool of the study ' and post: '. thanks for support'


## 4. Run your own dictionary

We have prepared 4 articles for you to try dictionary based text mining.

[Sharing and reusing cell image data: PMC5994892](papers/PMC5994892.txt)

[Targeting malaria parasite invasion of red blood cells as an antimalarial strategy: PMC6524681](papers/PMC6524681.txt)

[Minimal exposure of lipid II cycle intermediates triggers cell wall antibiotic resistance: PMC6588590](papers/PMC6588590.txt)

[Protein-Protein Interactions in Candida albicans: PMC6693483](papers/PMC6693483.txt)

You can define your own dictionary with some terms you want to try so feel free to add any additional rules for validation. The 4 papers are ready for you to use:

In [13]:
papers = ['papers/PMC5994892.txt', 'papers/PMC6524681.txt', 
          'papers/PMC6588590.txt', 'papers/PMC6693483.txt']
pmc5994892 = open(papers[0], 'r').read()
pmc6524681 = open(papers[1], 'r').read()
pmc6588590 = open(papers[2], 'r').read()
pmc6693483 = open(papers[3], 'r').read()

Define your own dictionary below for one or all of the articles.

In [14]:
your_dictionary1 = {'term': {'include': {},
                        'exclude': {}}
              }

Use your own dictionary to search in one article. Have a look at what you get. 

* Are there any unexpected terms matched or missing? 
* Can we improve our dictionary furthur?

In [139]:
find_evidence_final(dictionary=your_dictionary1, text=pmc5994892, chars_len=30)

one evidence found: 'term', with pre: 'great challenges in ' and post: 's of storage, retrie'
one evidence found: 'term', with pre: 'rchiving would be de' and post: 'ined per-case based '
one evidence found: 'term', with pre: 'iology and for long-' and post: ' maintenance of larg'
one evidence found: 'term', with pre: 'l. 
Standard domain ' and post: 'inology, formally te'
one evidence found: 'term', with pre: 'rminology, formally ' and post: 'ed controlled vocabu'
one evidence found: 'term', with pre: ' sets. 
Many of the ' and post: 's in a controlled vo'
one evidence found: 'term', with pre: 'te the data, all in ' and post: 's of the controlled '
one evidence found: 'term', with pre: 'ould be carefully de' and post: 'ined. 
Cell image da'
one evidence found: 'term', with pre: 'emely beneficial in ' and post: 's of reuse. 
The Cel'
