# Regular expressions and Python

Regular expressions are a formalism for extracting structured information from unstructured text. 
Using this formalism we can specify a pattern to match the target strings which will be selected based on their structure. <br>
For example, we might be interested in filtering all the strings which contain at least one number; the regular expressions help us in specifying and detecting such strings.

In this chapter, we introduce the python `re` module and try and identify a list of figures in a scientific paper and the number of times each one is mentioned.

## Regular expression formalism

Regular expressions are used to express text in a generic way so that we can match patterns that crop up in long strings of information. 

We will focus on a few basic concepts:
<table>
<tr>
    <th>Expression</th>
    <th>Meaning</th>
    <th>Examples that match</th>
    <th>Examples that don't match</th>
</tr>
<tr>
<td>[A-Z]</td>
<td>Matches any character A-Z</td>
<td>A, B, C</td>
<td>a, AA, 0,</td>
</tr>
<tr>
<td>[A-Z]+</td>
<td>Matches any character A-Z 1-to-many times</td>
<td>A,AA, AAA, AAB, ABCD, JAMES, COFFEE, SPAM</td>
<td>a, aaa, james, coffee, Coffee or emptystring</td>
</tr>
<tr>
<td>[A-Za-z]+</td>
<td>Matches any character A-Z or a-z 1-to-many times</td>
<td>James, Aa, Abc</td>
<td>Test123, C O F F E E</td>
</tr>
<tr>
<td>[A-Za-z0-9]+</td>
<td>Matches any character A-Z, a-z or 0-9 1-to-many times</td>
<td>James, Aa, Abc, Test123</td>
<td>C O F F E E, Coffee? Coffee!</td>
</tr>
</table>

Let's try some of these in the language


In [2]:
import re

def match(pattern, string):
    
    result = False
    
    # If zero or more characters at the beginning of string match this regular expression, 
    #  return a corresponding match object. 
    # Return None if the string does not match the pattern.
    match = re.match(pattern,string)
        
    if match:
        result = True
    
    print("Testing if {} will match {}. Result: {}".format(pattern,string, result))
    
    return match


match("[A-Z]", "A")
match("[A-Z]", "a")
match("[A-Z]", "0")
match("[A-Z]", "AA")

Testing if [A-Z] will match A. Result: True
Testing if [A-Z] will match a. Result: False
Testing if [A-Z] will match 0. Result: False
Testing if [A-Z] will match AA. Result: True


<_sre.SRE_Match at 0x7fa42429e578>

Why does AA match against [A-Z]. Well it doesn't really. Let's examine the "substring" that python matched.


In [None]:
result = match("[A-Z]", "AA")

print("Matched against substring '{}'".format(result.group(0)))

So as you can see, it has matched only the first A in the string "AA". This is correct behaviour because A does belong to the string [A-Z] but we did not specify a '+' to match one-or-more so it has only returned the first instance.

Let's try with [A-Z]+

In [6]:
match("[A-Z]+", "A")
match("[A-Z]+", "a")
match("[A-Z]+", "0")

print("")
result = match("[A-Z]+", "AA")
print("Matched against substring '{}'\n".format(result.group(0)))

result = match("[A-Z]+", "C O F F E E")
print("Matched against substring '{}'\n".format(result.group(0)))

result = match("[A-Z]+", "James")
print("Matched against substring '{}'".format(result.group(0)))

Testing if [A-Z]+ will match A. Result: True
Testing if [A-Z]+ will match a. Result: False
Testing if [A-Z]+ will match 0. Result: False

Testing if [A-Z]+ will match AA. Result: True
Matched against substring 'AA'

Testing if [A-Z]+ will match C O F F E E. Result: True
Matched against substring 'C'

Testing if [A-Z]+ will match James. Result: True
Matched against substring 'J'


This time the system matches the whole substring AA when it encounters multiple characters together. The system correctly matches the 'C' in "C O F F E E" having matched 1-to-many capital letters returns success once it encounters a space. Similarly in James it matches the capital 'J' and returns success J fulfils the requirement of 1-to-many capitals.

## More examples of regular expression syntax

An exhaustive list of regular expressions would be very long to be reported here. To find out more about what Python supports, check out [the documentation page](https://docs.python.org/3/library/re.html#regular-expression-syntax) on regular expressions.

Here are a few more examples that are useful for the following exercises.

<table>
<tr>
<th>Regular Expression</th>
<th>Meaning</th>
</tr>
<tr>
<td>.</td>
<td>Match any non-whitespace character. That's a bit like creating a square brackets expression like [A-Za-z0-9] but also includes punctuation marks.</td>
</tr>
<tr>
<td>\*</td>
<td>Match 0-many of the preceeding pattern. For example .* would match any number of non-whitespace characters including no input at all.</td>
</tr>
<tr>
<td>?</td>
<td>Match the preceeding pattern 0-1 times. This is great for specifying that something is optional.</td>
</tr>
<tr>
<td>\s</td>
<td>Matches whitespace characters - space, tab and newline if MULTILINE patterns are enabled.</td>
</tr>
</table>

## Real world application.

Let's find out how many figures there are in the ART corpus (https://www.aber.ac.uk/en/cs/research/cb/projects/art/art-corpus/) and how many times they are referenced. We will read a Pickle file, previosly prepared, with filenames, id and text of sentence of the corpus.

### Loading and parsing all ART corpus papers

In [None]:
import pickle

# Load the dataset previously saved as Pickle file
# 'all_sentences' is a list of tuples (filenames, id and text) for each sentence.
all_sentences = []
with open("Datasets/art_dataset.pickle","rb") as f:
    all_sentences = pickle.load(f)  


print ("Number of sentences loaded: ",len(all_sentences))
for s in all_sentences[:3]:
    print("\nS: ", s)

### Defining a regular expression

Now, we are interested in finding out where the authors talk about Figures in the papers. Depending on their writing style, some authors use "Figure 1", some use "Fig. 1" an some use "Fig 1" (without dot). We should check and account for each of these.

Also, sometimes figures have subfigures (i.e. Fig 1.A or 1.B), so we need to match for these too.

In [4]:
pattern = "Fig(ure)?.?\s+([0-9A-B]([A-Za-z0-9])*)"


print (re.match(pattern, "Fig. 1"))
print (re.match(pattern, "Fig 1"))
print (re.match(pattern, "Figure 1"))
print (re.match(pattern, "Fig 1.A"))

#pattern = "Fig(ure)?.?\s+([0-9A-B](\.[A-Za-z0-9])*)"
#print (re.fullmatch(pattern, "Fig 1.A"))

<_sre.SRE_Match object at 0x7fa426bf5750>
<_sre.SRE_Match object at 0x7fa426bf5750>
<_sre.SRE_Match object at 0x7fa426bf5750>
<_sre.SRE_Match object at 0x7fa426bf5750>


You may wonder about all the extra brackets in these expressions. Parenthesis allow you to define "groups" to capture as variables and also define sub-patterns. Notice we use the brackets around 'ure' and a ? in Figure to express the fact that the ure in Figure is optional (the author might just say "Fig"). 

We also put brackets around the portion that describes the figure number to allow us more flexibility. In the following, we perform a quick check of our current regular expresion:

In [5]:
tests = ["Fig. 1", "Fig 1", "Figure 1", "Fig. 1.A", "Figure 2.C", "Figure. 3.2"]

for t in tests:
    m = re.match(pattern,t)
    
    if not m:
        print("Test failed for ", t)
    else:
        print ("Whole match '{}'".format( m.group(0)))
        print ("Figure or Fig?", m.group(1))
        print ("Fig ID '{}'".format(m.group(2)))
        print("--------------------")

Whole match 'Fig. 1'
('Figure or Fig?', None)
Fig ID '1'
--------------------
Whole match 'Fig 1'
('Figure or Fig?', None)
Fig ID '1'
--------------------
Whole match 'Figure 1'
('Figure or Fig?', 'ure')
Fig ID '1'
--------------------
Whole match 'Fig. 1'
('Figure or Fig?', None)
Fig ID '1'
--------------------
Whole match 'Figure 2'
('Figure or Fig?', 'ure')
Fig ID '2'
--------------------
Whole match 'Figure. 3'
('Figure or Fig?', 'ure')
Fig ID '3'
--------------------


The `re.match` and `re.search` functions both return a `Match` object or `None` if the regex failed. Match objects have a group function that allows you to extract groups denoted by `()` in your expressions. 

Group 0 always returns the string that matched the whole expression from start to end. 

This pattern seems robust enough. Let's find out how many times figures are brought up in papers in the ART corpus.

In [None]:
from collections import Counter

# Creting a dictionary made of Counters
# A Counter is like a dictionary but associating to each key a counter
figs = { filename:Counter() for filename,id,text in all_sentences}    

# Example of Counter usage
# Counter(['blue', 'red', 'blue', 'yellow', 'blue', 'red'])
# Counter({'blue': 3, 'red': 2, 'yellow': 1})


# Return a list of figures mentioned in "sent"
def match_sent(sent):
    filename,id,text = sent
    sfigs = []
    
    # "findall()" returns all non-overlapping matches of pattern in string, as a list of strings. 
    # The string is scanned left-to-right, and matches are returned in the order found. 
    for m in re.findall(pattern, text):
        sfigs.append(m[1])
        
    return filename,id,sfigs


for filename,id,sentfigs in map(match_sent, all_sentences):
    # For each filaname, update when a particular figure is mentionated in a sentence.
    figs[filename].update(sentfigs)
    
    print(sentfigs)

In [None]:
for file in figs:
    print("\nFile ", file)
    print("References to figures...: ")
    print(figs[file])

        

So, now we know which papers have which figures and we can find out which paper has highest number of difference cited figures and which one references figures the most.


In [None]:
sorted_figs_by_refcount = [ (x, sum( figs[x].values())) for x in 
               sorted(figs, key=lambda x: sum( figs[x].values()), reverse=True ) ] 

sorted_figs_by_variety = [ (x, len(figs[x])) for x in 
               sorted(figs, key=lambda x: len( figs[x] ), reverse=True ) ] 

print("Top 5 papers by number of references to figures (frequency)")
for paper,count in sorted_figs_by_refcount[0:5]:
    print("Title: {} Count: {}".format(paper,count))
print("\n\n")
print("Top 5 papers by number of different figures in paper (variance)")
for paper,count in sorted_figs_by_variety[0:5]:
    print("Title: {} Count: {}".format(paper,count))
    
    


## Conclusion

We have used regular expressions to parse semi-structured data inside the ART Corpus and determine which of the papers have the most diverse and most frequent references to figures.