# Regular expressions and Python

Regular expressions provide the programmer with a reliable and repeatable way of extracting structured information from unstructured text, this hinges on there being some sort of consistent structure for analysis within the text.

In this chapter we introduce the python `re` module and try and identify a list of figures in a scientific paper and the number of times each one is mentioned.

## Regular expression language

Regular expressions are used to express text in a generic way so that we can match patterns that crop up in long strings of information. 

We will focus on a few basic concepts:
<table>
<tr>
    <th>Expression</th>
    <th>Meaning</th>
    <th>Examples that match</th>
    <th>Examples that don't match</th>
</tr>
<tr>
<td>[A-Z]</td>
<td>Matches any character A-Z</td>
<td>A, B, C</td>
<td>a, AA, 0,</td>
</tr>
<tr>
<td>[A-Z]+</td>
<td>Matches any character A-Z 1-to-many times</td>
<td>A,AA, AAA, AAB, ABCD, JAMES, COFFEE, SPAM</td>
<td>a, aaa, james, coffee, Coffee or emptystring</td>
</tr>
<tr>
<td>[A-Za-z]+</td>
<td>Matches any character A-Z or a-z 1-to-many times</td>
<td>James, Aa, Abc</td>
<td>Test123, C O F F E E</td>
</tr>
<tr>
<td>[A-Za-z0-9]+</td>
<td>Matches any character A-Z, a-z or 0-9 1-to-many times</td>
<td>James, Aa, Abc, Test123</td>
<td>C O F F E E, Coffee? Coffee!</td>
</tr>
</table>

Let's try some of these in the language


In [2]:
import re

def match(pattern, string):
    
    result = False
    
    match = re.match(pattern,string)
        
    if match:
        result = True
    
    print("Testing if {} will match {}. Result: {}".format(pattern,string, result))
    
    return match


match("[A-Z]", "A")
match("[A-Z]", "a")
match("[A-Z]", "0")
match("[A-Z]", "AA")

Testing if [A-Z] will match A. Result: True
Testing if [A-Z] will match a. Result: False
Testing if [A-Z] will match 0. Result: False
Testing if [A-Z] will match AA. Result: True


<_sre.SRE_Match object; span=(0, 1), match='A'>

Why does AA match against [A-Z]. Well it doesn't really. Let's examine the "substring" that python matched.


In [3]:
result = match("[A-Z]", "AA")

print("Matched against substring '{}'".format(result.group(0)))

Testing if [A-Z] will match AA. Result: True
Matched against substring 'A'


So as you can see, it has matched only the first A in the string "AA". This is correct behaviour because A does belong to the string [A-Z] but we did not specify a '+' to match one-or-more so it has only returned the first instance.

Let's try with [A-Z]+

In [4]:
match("[A-Z]+", "A")
match("[A-Z]+", "a")
match("[A-Z]+", "0")
result = match("[A-Z]+", "AA")
print("Matched against substring '{}'".format(result.group(0)))

result = match("[A-Z]+", "C O F F E E")
print("Matched against substring '{}'".format(result.group(0)))

result = match("[A-Z]+", "James")
print("Matched against substring '{}'".format(result.group(0)))

Testing if [A-Z]+ will match A. Result: True
Testing if [A-Z]+ will match a. Result: False
Testing if [A-Z]+ will match 0. Result: False
Testing if [A-Z]+ will match AA. Result: True
Matched against substring 'AA'
Testing if [A-Z]+ will match C O F F E E. Result: True
Matched against substring 'C'
Testing if [A-Z]+ will match James. Result: True
Matched against substring 'J'


This time the system matches the whole substring AA when it encounters multiple characters together. The system stumbles when it encounters 'C O F F E E'. It correctly matches the 'C' and having matched 1-to-many capital letters returns success once it encounters a space. Similarly in James it matches the capital 'J' and returns success J fulfils the requirement of 1-to-many capitals.

## More examples of regular expression grammar

I'm not going to create an exhaustive list of regular expressions here. If you want to find out more about what Python supports you can [read the documentation page](https://docs.python.org/3/library/re.html#regular-expression-syntax) on regular expressions.

Here are a few more examples that are useful for this exercise.

<table>
<tr>
<th>Regular Expression</th>
<th>Meaning</th>
</tr>
<tr>
<td>.</td>
<td>Match any non-whitespace character. That's a bit like creating a square brackets expression like [A-Za-z0-9] but also includes punctuation marks.</td>
</tr>
<tr>
<td>\*</td>
<td>Match 0-many of the preceeding pattern. For example .* would match any number of non-whitespace characters including no input at all.</td>
</tr>
<tr>
<td>?</td>
<td>Match the preceeding pattern 0-1 times. This is great for specifying that something is optional.</td>
</tr>
<tr>
<td>\s</td>
<td>Matches whitespace characters - space, tab and newline if MULTILINE patterns are enabled.</td>
</tr>
</table>

## Real world application.

Let's find out how many figures there are in the ART corpus and how many times they are brought up. We will use our XML processing and filesystem navigation knowledge to facilitate this.

### Loading and parsing all ART corpus papers

Assuming we have downloaded and extracted all papers from the CoreSC corpus to the assets folder, we can now load all of the papers and find sentences in each.

In [5]:
from xml.etree import ElementTree as ET
import os

def load_sentences(filename):
    
    #open and parse the paper
    tree = ET.parse(filename)
    root = tree.getroot()

    #lets find all sentences in the paper
    sentences = []
    for sentEl in root.iter("s"):
        id = sentEl.get("sid")
        text = "".join(sentEl.itertext())
        sentences.append((filename,id,text))

    return sentences


all_sentences = []
for root, dirs, files in os.walk("assets/ART_Corpus"):
    
    for file in files:
        if file.endswith(".xml"):
            all_sentences.extend( load_sentences( os.path.join(root,file )))
            
print ("Number of sentences loaded: ",len(all_sentences))

Number of sentences loaded:  34680


### Defining a regular expression

Now we are interested in finding out where the authors of these papers talk about Figures in the papers. Depending on their writing style, some authors use Figure 1, some use Fig. 1 an Some use Fig 1 (no dot). We should try and account for each of these.

Sometimes figures have subfigures (i.e. Fig 1.A or 1.B) so we need to match for these too.

In [23]:
pattern = "Fig(ure)?.?\s+([0-9A-B]([A-Za-z0-9])*)"

print ( re.match(pattern,"Fig. 1"))
print (re.match(pattern, "Fig 1"))
print (re.match(pattern, "Figure 1"))
print (re.match(pattern, "Fig 1.A"))

<_sre.SRE_Match object; span=(0, 6), match='Fig. 1'>
<_sre.SRE_Match object; span=(0, 5), match='Fig 1'>
<_sre.SRE_Match object; span=(0, 8), match='Figure 1'>
<_sre.SRE_Match object; span=(0, 5), match='Fig 1'>


You may wonder about all the extra brackets in these expressions. Parenthesis allow you to define "groups" to capture as variables and also define sub-patterns. Notice we use the brackets around 'ure' and a ? in Figure to express the fact that the ure in Figure is optional (the author might just say fig). 

We also put brackets around the bit of the expression that describes the figure number to allow us to capture it. Extracting this info is demonstrated below:

In [24]:
tests = ["Fig. 1", "Fig 1", "Figure 1", "Fig. 1.A", "Figure 2.C", "Figure. 3.2"]

for t in tests:
    m = re.match(pattern,t)
    
    if not m:
        print("Test failed for ", t)
    else:
        print ("Whole match '{}'".format( m.group(0)))
        print ("Figure or Fig?", m.group(1))
        print ("Fig ID '{}'".format(m.group(2)))
        print("--------------------")

Whole match 'Fig. 1'
Figure or Fig? None
Fig ID '1'
--------------------
Whole match 'Fig 1'
Figure or Fig? None
Fig ID '1'
--------------------
Whole match 'Figure 1'
Figure or Fig? ure
Fig ID '1'
--------------------
Whole match 'Fig. 1'
Figure or Fig? None
Fig ID '1'
--------------------
Whole match 'Figure 2'
Figure or Fig? ure
Fig ID '2'
--------------------
Whole match 'Figure. 3'
Figure or Fig? ure
Fig ID '3'
--------------------


The `re.match` and `re.search` functions both return a `Match` object or `None` if the regex failed. Match objects have a group function that allows you to extract groups denoted by `()` in your expressions. 

Group 0 always returns the string that matched the whole expression from start to end. 

This pattern seems robust enough. Let's find out how many times figures are brought up in papers in the ART corpus.

In [33]:
from collections import Counter
from multiprocessing import Pool
from ipywidgets import FloatProgress
from IPython.display import display
from time import sleep

figs = { filename: Counter() for filename,id,text in all_sentences}
   
f = FloatProgress(min=0, max=len(all_sentences))
display(f)
    

def match_sent(sent):
    filename,id,text = sent
    sfigs = []
    
    for m in re.findall(pattern, text):
        sfigs.append(m[1])
        
        
    return filename,id,sfigs
        
    
for filename,id,sentfigs in map(match_sent, all_sentences):
    f.value += 1
    figs[filename].update(sentfigs)


for file in figs:
    print("File ", file)
    print("References to figures...: ")
    print(figs[file])

        

File  assets/ART_Corpus/ann5/b310282c_mode2.xml
References to figures...: 
Counter({'8': 2, '2a': 2, '3': 2, '5': 1, '7': 1, '4a': 1, '1': 1, '4': 1, '2b': 1, '6': 1})
File  assets/ART_Corpus/ann6/b315038k_mode2.xml
References to figures...: 
Counter({'1': 1})
File  assets/ART_Corpus/ann1/b312620j_mode2.xml
References to figures...: 
Counter({'4': 2, '3': 2, '2b': 2, '7': 2, '1d': 1, '1c': 1, '1b': 1, '1a': 1, '2a': 1, '1': 1, '5': 1, '6': 1})
File  assets/ART_Corpus/ann8/b413703e_mode2.xml
References to figures...: 
Counter({'2': 3, '3a': 2, '1': 2, '4': 2, '1b': 2, '1a': 1, '3': 1, '3b': 1})
File  assets/ART_Corpus/ann9/b502789f_mode2.xml
References to figures...: 
Counter({'7b': 3, '8b': 2, '2': 2, '5': 2, '9b': 1, '3b': 1, '7a': 1, '3a': 1, '7d': 1, '4': 1, '10': 1, '9a': 1, '7c': 1, '11b': 1, '1': 1, '11a': 1, '6': 1, '8a': 1})
File  assets/ART_Corpus/ann1/b403450c_mode2.xml
References to figures...: 
Counter({'2': 3, '4': 2, '3': 1, '1a': 1, '1b': 1})
File  assets/ART_Corpus/ann6

So now we know which papers have which figures we can find out which paper has the most different figures and which one references figures the most.


In [41]:
sorted_figs_by_refcount = [ (x, sum( figs[x].values())) for x in 
               sorted(figs, key=lambda x: sum( figs[x].values()), reverse=True ) ] 

sorted_figs_by_variety = [ (x, len(figs[x])) for x in 
               sorted(figs, key=lambda x: len( figs[x] ), reverse=True ) ] 

print("Top 5 papers by number of references to figures (frequency)")
for paper,count in sorted_figs_by_refcount[0:5]:
    print("Title: {} Count: {}".format(paper,count))
print("\n\n")
print("Top 5 papers by number of different figures in paper (variance)")
for paper,count in sorted_figs_by_variety[0:5]:
    print("Title: {} Count: {}".format(paper,count))
    
    


Top 5 papers by number of references to figures (frequency)
Title: assets/ART_Corpus/ann8/b506075c_mode2.xml Count: 48
Title: assets/ART_Corpus/ann5/b311582h_mode2.xml Count: 45
Title: assets/ART_Corpus/ann2/b308202d_mode2.xml Count: 39
Title: assets/ART_Corpus/ann6/b308201f_mode2.xml Count: 37
Title: assets/ART_Corpus/ann9/b413603a_mode2.xml Count: 37



Top 5 papers by number of different figures in paper (variance)
Title: assets/ART_Corpus/ann7/b413535k_mode2.xml Count: 26
Title: assets/ART_Corpus/ann3/b405531d_mode2.xml Count: 20
Title: assets/ART_Corpus/ann5/b402623c_mode2.xml Count: 20
Title: assets/ART_Corpus/ann8/b316895f_mode2.xml Count: 19
Title: assets/ART_Corpus/ann9/b502789f_mode2.xml Count: 18


## Conclusion

We have used regular expressions to parse semi-structured data inside the ART Corpus and determine which of the papers have the most diverse and most frequent references to figures.

Our final chapter looks at [NLTK](NLTK.ipynb)