## Regular Expressions


![](regular_expressions.png "")

Class notes can be found at [**https://github.com/spenteco/substitute_teaching**](https://github.com/spenteco/substitute_teaching).

More importantly, the python documentation for Regular Expressions is at [**https://docs.python.org/3.7/howto/regex.html**](https://docs.python.org/3.7/howto/regex.html)

### First, let's get some text:

In [7]:
import urllib

text = urllib.request.urlopen('https://www.gutenberg.org/files/11/11-0.txt').read().decode('utf-8') 

#for a, l in enumerate(text.split('\n')):
#    print(a, l)
    
text = '\n'.join(text.split('\n')[31:3370])

print(text)

ALICE’S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or
conversations?’

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered a

## match vs search vs finditer

See [**match vs search**](https://docs.python.org/3.7/howto/regex.html#match-versus-search) in the Python doc.

I usually use finditer for this . . . 

Two regular expressions in this cell:

    r'\s+
    
    r'Alice'
    
* special sequences ("\s") 
* special characters ("+")
* character literal ("Alice")
* raw python strings ("r'\ . . . ")

In [25]:
import re

def make_keyword_in_context(pattern_to_find, text_to_search):

    text_normalized_spaces = re.sub(r'\s+', r' ', text_to_search)

    for match in re.finditer(pattern_to_find, text_normalized_spaces, flags=re.IGNORECASE):

        start_snippet = match.start() - 40
        end_snippet = match.end() + 40

        if start_snippet < 0:
            start_snippet = 0
        if end_snippet > len(text_normalized_spaces):
            end_snippet = len(text_normalized_spaces)

        print(text_normalized_spaces[start_snippet: end_snippet])

# ------------------------------------------------------------------
        
make_keyword_in_context(r'Alice', text)

ALICE’S ADVENTURES IN WONDERLAND Lewis Carrol
ION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitt
and what is the use of a book,’ thought Alice ‘without pictures or conversations?’ So
ing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to
 and looked at it, and then hurried on, Alice started to her feet, for it flashed acr
 the hedge. In another moment down went Alice after it, never once considering how in
 dipped suddenly down, so suddenly that Alice had not a moment to think about stoppin
ds as she fell past it. ‘Well!’ thought Alice to herself, ‘after such a fall as this,
d miles down, I think--’ (for, you see, Alice had learnt several things of this sort 
at Latitude or Longitude I’ve got to?’ (Alice had no idea what Latitude was, or Longi
 down. There was nothing else to do, so Alice soon began talking again. ‘Dinah’ll mis
t do cats eat bats, I wonder?’ And here Alice began to get rather sleepy, and went on
 and dry

## Substitution

In [26]:
new_text = re.sub(r'\bAlice\b', 'Jane', text, flags=re.IGNORECASE)

make_keyword_in_context(r'Jane', new_text)

Jane’S ADVENTURES IN WONDERLAND Lewis Carrol
ION 3.0 CHAPTER I. Down the Rabbit-Hole Jane was beginning to get very tired of sitt
and what is the use of a book,’ thought Jane ‘without pictures or conversations?’ So
ing so VERY remarkable in that; nor did Jane think it so VERY much out of the way to
 and looked at it, and then hurried on, Jane started to her feet, for it flashed acr
 the hedge. In another moment down went Jane after it, never once considering how in
 dipped suddenly down, so suddenly that Jane had not a moment to think about stoppin
ds as she fell past it. ‘Well!’ thought Jane to herself, ‘after such a fall as this,
d miles down, I think--’ (for, you see, Jane had learnt several things of this sort 
at Latitude or Longitude I’ve got to?’ (Jane had no idea what Latitude was, or Longi
 down. There was nothing else to do, so Jane soon began talking again. ‘Dinah’ll mis
t do cats eat bats, I wonder?’ And here Jane began to get rather sleepy, and went on
 and dry leaves, and

In [90]:
new_text = re.sub(r'\bsaid (\w+)', r'hollered \g<1> loudly', text)

make_keyword_in_context(r'loudly', new_text)

allen by this time?’ she hollered aloud loudly. ‘I must be getting somewhere near the 
ly was not here before,’ hollered Alice loudly,) and round the neck of the bottle was 
What a curious feeling!’ hollered Alice loudly; ‘I must be shutting up like a telescop
se in crying like that!’ hollered Alice loudly to herself, rather sharply; ‘I advise y
te a little bit, and hollered anxiously loudly to herself, ‘Which way? Which way?’, ho
be ashamed of yourself,’ hollered Alice loudly, ‘a great girl like you,’ (she might we
are not the right words,’ hollered poor loudly Alice, and her eyes filled with tears a
g all alone here!’ As she hollered this loudly she looked down at her hands, and was s
at WAS a narrow escape!’ hollered Alice loudly, a good deal frightened at the sudden c
bad, that it is!’ As she hollered these loudly words her foot slipped, and in another 
an go back by railway,’ she hollered to loudly herself. (Alice had been to the seaside
I hadn’t cried so much!’ hollered Alice lou

## What are all the capitalized words?

<span style="color:red; font-weight: bold">Why doesn't it work?</span>

In [85]:
make_keyword_in_context(r'\b[A-Z].{3,10}\b', new_text)

ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll T
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNI
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDI
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER 
ENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the 
WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-H
ND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice w
roll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning
LLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get v
FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired o
ION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting b
APTER I.

ttle shrieks, and more sounds of broken glass. ‘What a number of cucumber-frames there m
ieks, and more sounds of broken glass. ‘What a number of cucumber-frames there must be!
nd more sounds of broken glass. ‘What a number of cucumber-frames there must be!’ thought 
unds of broken glass. ‘What a number of cucumber-frames there must be!’ thought Alice. ‘I
roken glass. ‘What a number of cucumber-frames there must be!’ thought Alice. ‘I wonder
lass. ‘What a number of cucumber-frames there must be!’ thought Alice. ‘I wonder what they’
 a number of cucumber-frames there must be!’ thought Alice. ‘I wonder what they’ll do
mber of cucumber-frames there must be!’ thought Alice. ‘I wonder what they’ll do next! A
cucumber-frames there must be!’ thought Alice. ‘I wonder what they’ll do next! As for pull
rames there must be!’ thought Alice. ‘I wonder what they’ll do next! As for pulling me out 
must be!’ thought Alice. ‘I wonder what they’ll do next! As for pulling me out of the windo
hought Alice

dly Hatter. ‘You might just as well say that “I see what I eat” is the same thing as “I eat
‘You might just as well say that “I see what I eat” is the same thing as “I eat what I see
ust as well say that “I see what I eat” is the same thing as “I eat what I see”!’ ‘You migh
say that “I see what I eat” is the same thing as “I eat what I see”!’ ‘You might just as we
see what I eat” is the same thing as “I eat what I see”!’ ‘You might just as well say,’ add
eat” is the same thing as “I eat what I see”!’ ‘You might just as well say,’ added the Marc
same thing as “I eat what I see”!’ ‘You might just as well say,’ added the March Hare, ‘tha
as “I eat what I see”!’ ‘You might just as well say,’ added the March Hare, ‘that “I like w
 I see”!’ ‘You might just as well say,’ added the March Hare, ‘that “I like what I get” is
‘You might just as well say,’ added the March Hare, ‘that “I like what I get” is the same 
st as well say,’ added the March Hare, ‘that “I like what I get” is the same thing 

e--evening, Beautiful, beautiful Soup!’ CHAPTER XI. Who Stole the Tarts? The King and Quee
Beautiful, beautiful Soup!’ CHAPTER XI. Who Stole the Tarts? The King and Queen of Hearts 
 beautiful Soup!’ CHAPTER XI. Who Stole the Tarts? The King and Queen of Hearts were seated
Soup!’ CHAPTER XI. Who Stole the Tarts? The King and Queen of Hearts were seated on their
APTER XI. Who Stole the Tarts? The King and Queen of Hearts were seated on their throne wh
Who Stole the Tarts? The King and Queen of Hearts were seated on their throne when they ar
the Tarts? The King and Queen of Hearts were seated on their throne when they arrived, with
he King and Queen of Hearts were seated on their throne when they arrived, with a great c
nd Queen of Hearts were seated on their throne when they arrived, with a great crowd assemb
Hearts were seated on their throne when they arrived, with a great crowd assembled ab
s were seated on their throne when they arrived, with a great crowd assembled about them-
ated

In [86]:
import re

def output_result_line(match, text_normalized_spaces):

    start_snippet = match.start() - 40
    end_snippet = match.end() + 40

    if start_snippet < 0:
        start_snippet = 0
    if end_snippet > len(text_normalized_spaces):
        end_snippet = len(text_normalized_spaces)

    print(text_normalized_spaces[start_snippet: end_snippet])

def better_keyword_in_context(pattern_to_find, text_to_search, flags=re.IGNORECASE):

    text_normalized_spaces = re.sub(r'\s+', r' ', text_to_search)

    if flags == None:
        for match in re.finditer(pattern_to_find, text_normalized_spaces):
            output_result_line(match, text_normalized_spaces)
    else:
        for match in re.finditer(pattern_to_find, text_normalized_spaces, flags=flags):
            output_result_line(match, text_normalized_spaces)
        
# --------------------------------------------------------------------

better_keyword_in_context(r'\b[A-Z]\S{3,5}\b', new_text, flags=None)

ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 
LLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to 
FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very ti
 EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired o
ION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitt
and what is the use of a book,’ thought Alice ‘without pictures or conversations?’ So
nd picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
king the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There 
Rabbit with pink eyes ran close by her. There was nothing so VERY remarkable in that;
 ran close by her. There was nothing so VERY remarkable in that; nor did Alice think
ing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to
able in that; n

, if the Mock Turtle would be so kind,’ Alice replied, so eagerly that the Gryphon sa
ed tone, ‘Hm! No accounting for tastes! Sing her “Turtle Soup,” will you, old fellow
Hm! No accounting for tastes! Sing her “Turtle Soup,” will you, old fellow?’ The Mock 
accounting for tastes! Sing her “Turtle Soup,” will you, old fellow?’ The Mock Turtl
urtle Soup,” will you, old fellow?’ The Mock Turtle sighed deeply, and began, in a v
 Soup,” will you, old fellow?’ The Mock Turtle sighed deeply, and began, in a voice so
d with sobs, to sing this:-- ‘Beautiful Soup, so rich and green, Waiting in a hot tu
 Who for such dainties would not stoop? Soup of the evening, beautiful Soup! Soup of
t stoop? Soup of the evening, beautiful Soup! Soup of the evening, beautiful Soup! B
p? Soup of the evening, beautiful Soup! Soup of the evening, beautiful Soup! Beau--o
ul Soup! Soup of the evening, beautiful Soup! Beau--ootiful Soo--oop! Beau--ootiful 
p! Soup of the evening, beautiful Soup! Beau--ootiful Soo--o

## Splitting

In [87]:
import re

tokens = re.split(r'\b', text)

print(tokens[:50])
print()
print()

tokens = [t.strip() for t in re.split(r'\b', text) if t.strip() > '']

print(tokens[:50])
print()

tokens = re.split(r'[^A-z]', text)

print(tokens[:50])
print()

tokens = [t for t in re.split(r'[^A-z0-9]', text) if t > '']

print(tokens[:50])
print()

tokens = [t for t in re.split(r'[A-z0-9]', text) if t > '']

print(tokens[:50])
print()

['', 'ALICE', '’', 'S', ' ', 'ADVENTURES', ' ', 'IN', ' ', 'WONDERLAND', '\r\n\r\n', 'Lewis', ' ', 'Carroll', '\r\n\r\n', 'THE', ' ', 'MILLENNIUM', ' ', 'FULCRUM', ' ', 'EDITION', ' ', '3', '.', '0', '\r\n\r\n\r\n\r\n\r\n', 'CHAPTER', ' ', 'I', '. ', 'Down', ' ', 'the', ' ', 'Rabbit', '-', 'Hole', '\r\n\r\n', 'Alice', ' ', 'was', ' ', 'beginning', ' ', 'to', ' ', 'get', ' ', 'very']


['ALICE', '’', 'S', 'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE', 'MILLENNIUM', 'FULCRUM', 'EDITION', '3', '.', '0', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she']

['ALICE', 'S', 'ADVENTURES', 'IN', 'WONDERLAND', '', '', '', 'Lewis', 'Carroll', '', '', '', 'THE', 'MILLENNIUM', 'FULCRUM', 'EDITION', '', '', '', '', '', '', '', '', '', '', '', '', '', 'CHAPTER', 'I', '', 'Down'

In [51]:
from collections import Counter

words = [t for t in re.split(r'[^A-z]', text.lower()) if t > '']

for word, n_occurences in Counter(words).most_common(25):
    print(word, n_occurences)



the 1644
and 872
to 729
a 632
it 595
she 553
i 543
of 514
said 462
you 411
alice 398
in 369
was 357
that 315
as 263
her 248
t 218
at 212
s 201
on 193
all 182
with 181
had 178
but 170
for 153
