## Regular Expressions

**Start reading [here](https://docs.python.org/3.6/howto/regex.html#regex-howto)**.

[The Python 3.6 documentation](https://docs.python.org/3.6/library/re.html).

[Doug Knox's introduction](https://programminghistorian.org/en/lessons/understanding-regular-expressions).  Examples are in Open Office, not Python, but it's worth reviewing.

## Test data

Let's use the first 674 characters from Edwin Abbott's *Flatland* (I downloaded this data from Project Gutenberg and manually removed the PG header and footer).

In [1]:
text = open('data/pg97_Abbott_Edwin_Flatland.txt').read()[1:673]

print(text)

FLATLAND



PART 1

THIS WORLD



SECTION 1  Of the Nature of Flatland


I call our world Flatland, not because we call it so, but to make its
nature clearer to you, my happy readers, who are privileged to live in
Space.

Imagine a vast sheet of paper on which straight Lines, Triangles,
Squares, Pentagons, Hexagons, and other figures, instead of remaining
fixed in their places, move freely about, on or in the surface, but
without the power of rising above or sinking below it, very much like
shadows--only hard with luminous edges--and you will then have a pretty
correct notion of my country and countrymen.  Alas, a few years ago, I
should have said "my universe:"  


## The basics . . . sub and split

sub, right from the doc:

    re.sub(pattern, repl, string, count=0, flags=0)
    
which returns a string.
    
And split:

    re.split(pattern, string, maxsplit=0, flags=0)
    
which returns a list of strings.

There are several layers of complexity here:

* The pattern can become quite complex; patterns are expressed using a **regular expression language**.
* The string passed as an argument to re.sub and re.split can be expressed as a variable which contains a string, or as a function which returns a string; i.e., these things can nest.
* The pattern can be precompiled to yield a regex object, which in turn has .sub and .split methods.

Note, however, **most of the complexity here--and there's as much as you have appetite for--is in the pattern.**

## Two simple examples.

1.  Replace every occurence of one or more whitespace characters with a space.

2.  Split the text at every point in which one or more whitespace characters occur.

We express "one or more whitespace characters" with the the special character "\s" and the repeating metacharacter "+".


In [2]:
import re

print(re.sub('\s+', ' ', text.strip()), '\n\n')

print(re.split('\s+', text.strip()), '\n\n')

FLATLAND PART 1 THIS WORLD SECTION 1 Of the Nature of Flatland I call our world Flatland, not because we call it so, but to make its nature clearer to you, my happy readers, who are privileged to live in Space. Imagine a vast sheet of paper on which straight Lines, Triangles, Squares, Pentagons, Hexagons, and other figures, instead of remaining fixed in their places, move freely about, on or in the surface, but without the power of rising above or sinking below it, very much like shadows--only hard with luminous edges--and you will then have a pretty correct notion of my country and countrymen. Alas, a few years ago, I should have said "my universe:" 


['FLATLAND', 'PART', '1', 'THIS', 'WORLD', 'SECTION', '1', 'Of', 'the', 'Nature', 'of', 'Flatland', 'I', 'call', 'our', 'world', 'Flatland,', 'not', 'because', 'we', 'call', 'it', 'so,', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you,', 'my', 'happy', 'readers,', 'who', 'are', 'privileged', 'to', 'live', 'in', 'Space.', 'Im

## A tour of "patterns"

[Again, start reading here](https://docs.python.org/3.6/howto/regex.html#regex-howto).  More on syntax [here](https://docs.python.org/3.6/library/re.html#regular-expression-syntax).

In [3]:
import re

de_spaced_text = re.sub('\s+', ' ', text.strip())

#  We can match characters in a very simple fashion; note that I've added a flag
#  so that the find "half" of the sub is case-insensitve.

print(re.sub('Flatland', 'Missouri', de_spaced_text, flags=re.IGNORECASE), '\n\n')

#  Usually, however, we want to think about characters in broad classes.  For example,
#  we might want to change every character to a "Z" (useless in this instance, except
#  to show that "." means "any character").

print(re.sub('.', 'Z', de_spaced_text), '\n\n')

#  Or, we might want to get rid of any punctuation ("\W" means "any non-alphanumeric character"):

print(re.sub('\W', '', de_spaced_text), '\n\n')

#  Not great, however; we'd like to keep the spaces (the brackets define a character class, which
#  in this cases consists of the letters a through z, A through Z, 0 through 9, and whitespace; "^" 
#  immediately after the opening bracket means "not"; i.e, we're saying, "match any character which
#  is not in the word class a through z, etc).

print(re.sub('[^a-zA-Z0-9\s]', '', de_spaced_text), '\n\n')

#  That's pretty good, but it doesn't handle the double hyphens.  If I build a character class that
#  explicity declares all the punctuation, and replace punctuation with spaces, I get a pretty good
#  result.  

#  NOTE that the period in the character class acts as a literal, and not as the metacharacter
#  meaning "any character".  This is the sort of thing that makes me grumble.

import string

print(string.punctuation, '\n\n')

generated_expression = '[' + string.punctuation + ']'

print(generated_expression, '\n\n')

print(re.sub(generated_expression, ' ', de_spaced_text), '\n\n')

#  And if I pass the sub results through a space-normalizing sub, I get a better result.

print(re.sub('\s+', ' ',
                re.sub(generated_expression, ' ', de_spaced_text)), '\n\n')

Missouri PART 1 THIS WORLD SECTION 1 Of the Nature of Missouri I call our world Missouri, not because we call it so, but to make its nature clearer to you, my happy readers, who are privileged to live in Space. Imagine a vast sheet of paper on which straight Lines, Triangles, Squares, Pentagons, Hexagons, and other figures, instead of remaining fixed in their places, move freely about, on or in the surface, but without the power of rising above or sinking below it, very much like shadows--only hard with luminous edges--and you will then have a pretty correct notion of my country and countrymen. Alas, a few years ago, I should have said "my universe:" 


ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

## Patterns, plus a finditer, plus a little code . . . 

In this case, we can create a simple Key in Context display.  Three or four times.  Because I want to demo a kind of progressive elaboration of the pattern.  And to play with converting an iterator to a sequence.

Note that here we introduce a find-type function which uses a regular expression to search a text for matching sequences.

Python re module has at least three find-type functions: match (which checks for a match only at the beginning of the string), search (which checks anywhere).  [See the match vs search doc](https://docs.python.org/3.6/library/re.html#search-vs-match).  I tend to use re.finditer a lot, since it search the whole string and returns an iterator which provides information about all matches.

The non-regex parts of this line:

    for m in list(re.finditer('circle', de_spaced_text, re.IGNORECASE))[:10]:
    
may be a little complicated.  It's complicated becasue I only want the first 10 matches ("m" is my name for a match).  This:

    for m in re.finditer('circle', de_spaced_text, re.IGNORECASE)[:10]:
    
won't work because re.finditer returns an iterator, which provides access to the items one-by-one, but which can't be accessed all at once, or sliced ("\[:10\]") like a sequence.  Casting the iterator to a list ("list(  . . . )" causes the iterator to realize/instantite all its matches; it's effectively like saying:

    for m in [m for m in re.finditer('circle', de_spaced_text, re.IGNORECASE)][:10]:
    
i.e., like using a list comprehension to force the iterator into a list.

In [7]:
import re, string
    
text = open('data/pg97_Abbott_Edwin_Flatland.txt').read()

de_spaced_text = re.sub('\s+', ' ', text.strip())

for m in list(re.finditer('circle', de_spaced_text, re.IGNORECASE))[:10]:
    
    a = m.start() - 40
    if a < 0:
        a = 0
    b = m.end() + 40
    if b > len(de_spaced_text):
        b = len(de_spaced_text)
        
    print(de_spaced_text[a: b])

print('\n\n')

for m in [m for m in re.finditer('circle', de_spaced_text, re.IGNORECASE)][:10]:
    
    a = m.start() - 40
    if a < 0:
        a = 0
    b = m.end() + 40
    if b > len(de_spaced_text):
        b = len(de_spaced_text)
        
    print(de_spaced_text[a: b])

print('\n\n')

for m in list(re.finditer('circle[\s' + string.punctuation + ']', de_spaced_text, re.IGNORECASE))[:10]:
    
    a = m.start() - 40
    if a < 0:
        a = 0
    b = m.end() + 40
    if b > len(de_spaced_text):
        b = len(de_spaced_text)
        
    print(de_spaced_text[a: b])

print('\n\n')

for m in list(re.finditer('\scircle[\s' + string.punctuation + ']', de_spaced_text, re.IGNORECASE))[:10]:
    
    a = m.start() - 40
    if a < 0:
        a = 0
    b = m.end() + 40
    if b > len(de_spaced_text):
        b = len(de_spaced_text)
        
    print(de_spaced_text[a: b])

print('\n\n')

it, look down upon it. It will appear a circle. But now, drawing back to the edge of t
 a Triangle, Square, Pentagon, Hexagon, Circle, what you will--a straight Line he look
e figure cannot be distinguished from a circle, he is included in the Circular or Prie
gth too much even for the wisdom of the Circles. But a wise ordinance of Nature has de
of this Law of Nature, the Polygons and Circles are almost always able to stifle sedit
l body of their brethren whom the Chief Circle keeps in pay for emergencies of this ki
 it has been found by the wisest of our Circles or Statesmen that the multiplication o
sh promises by which the more judicious Circle can in a moment pacify his consort. The
me of the Isosceles; and by many of our Circles the destructiveness of the Thinner Sex
ursuits; and the cautious wisdom of the Circles has ensured safety at the cost of dome



it, look down upon it. It will appear a circle. But now, drawing back to the edge of t
 a Triangle, Square, Pentagon, Hexagon, 

## More "regex + a little code"

Here, to get the 25 most common words in *Flatland*, "most common" meaning, "except the stuff like 'and', 'the', 'a', etc."

[nltk](https://www.nltk.org/) is more or less the standard way to start learning Natural Language Processing on Python, and there's [a standard book](https://www.nltk.org/book/) available [online](https://www.nltk.org/book/), which everyone has at least looked at one time or another.  The library has other NLTK books online; simply search nltk in [the library's website](https://library.wustl.edu/).

[Counter](https://docs.python.org/3/library/collections.html#collections.Counter) is super useful, since we're likely to find ourselves constantly needing to count and order items in sequences.

I find that this pattern

     re.split('[^a-z]', text.strip().lower())
     
is one I use quite often; in effect, it converts a plain text file into a list of lower case words . . . it's a kind of quick-and-dirty [tokenization](https://www.techopedia.com/definition/13698/tokenization).  However, for a more rigorous approach, I always turn to a natural language processing library with more sophisticated rules.  Why?  Consider the word

    quick-and-dirty
    
My regex approach yields \["quick", "and", "dirty"\]; I would hope that a real NLP library would leave "quick-and-dirty" all together as one token.

In [10]:
import re
from collections import Counter
from nltk.corpus import stopwords

sw = set(stopwords.words('english'))
    
some_list = ['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c']
    
for word in Counter(some_list).most_common():
    print(word)

print('\n\n')
    
for word, word_count in Counter(some_list).most_common():
    print(word, word_count)

print('\n\n')
    
text = open('data/pg97_Abbott_Edwin_Flatland.txt').read()

for word, word_count in Counter([w for w in re.split('[^a-z]', text.strip().lower()) 
                                     if w > '' and w not in sw]).most_common(25):
    print(word, word_count)

('a', 3)
('b', 3)
('c', 3)
('d', 2)
('e', 2)



a 3
b 3
c 3
d 2
e 2



one 157
would 114
see 113
line 95
flatland 81
two 80
could 77
three 77
must 76
even 72
sides 63
circle 62
square 59
every 59
say 58
sphere 58
may 57
space 56
straight 55
us 53
women 53
yet 50
woman 50
said 49
point 49


## Extra Credit

What happens if, instead of

    '[^a-z]'
    
I used

    '([^a-z])'
   
What character occurs 31,452 times?  3,489 times?

In [11]:
import re
from collections import Counter
    
text = open('data/pg97_Abbott_Edwin_Flatland.txt').read()

for word, word_count in Counter([w for w in re.split('([^a-z])', text.strip().lower()) 
                                     if w > '' and w not in sw]).most_common(25):
    print(word, word_count)

  31452

 3489
, 2443
. 1139
- 662
" 541
; 373
one 157
? 143
' 124
would 114
see 113
line 95
flatland 81
two 80
( 78
) 78
could 77
three 77
must 76
! 74
even 72
: 71
sides 63
circle 62
