Introduction to NLP course (2017-2018).

Notebook 2: Regular Expressions. Tokenization.

by Venelin Kovatchev, University of Barcelona

In [1]:
# Import section

# Import nltk
import nltk
from nltk import word_tokenize
from nltk import regexp_tokenize
# Import regular expressions
import re

# Import corpora
from nltk.corpus import treebank_raw
from nltk.corpus import treebank_chunk

1) Tokenization: 

- input: a string (raw corpus) 
- output: a list of strings; each of the elements of the list is a linguistic unit

In [2]:
# Corpus in raw format (input):
corpus_text = treebank_raw.raw(treebank_raw.fileids())
print("\nInput of the tokenizer: \n {0}".format(corpus_text[:50]))

# Corpus in tokenized format (desired output)
corpus_token = treebank_chunk.words(treebank_chunk.fileids())
print("\nDesired output of the tokenizer: \n {0}".format([str(tok) for tok in corpus_token[:10]]))


Input of the tokenizer: 
 .START 

Pierre Vinken, 61 years old, will join th

Desired output of the tokenizer: 
 ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the']


2) Tokenization is more complex than .split()

- punctuation
- multi word expressions
- special characters

In [3]:
# Naive tokenization, separate by whitespaces
corpus_naive = corpus_text.split(" ")
print("\nSeparate the corpus by whitespaces: \n {}".format([str(tok) for tok in corpus_naive[:10]]))
print("\nDesired output of the tokenizer: \n {0}".format([str(tok) for tok in corpus_token[:10]]))


Separate the corpus by whitespaces: 
 ['.START', '\n\nPierre', 'Vinken,', '61', 'years', 'old,', 'will', 'join', 'the', 'board']

Desired output of the tokenizer: 
 ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the']


3) More non-trivial examples that can't be handled by splitting based on whitespaces and punctuation.

- R.E.N.F.E
- 16:20
- $100m
- Mr. 
- 4:1
- flight LU-345
- jacapetti@miami.maffia.com 
- Problems with multi-word expressions:
- J.A. Capetti 
- Mr. Capetti 
- New York
- kick the bucket

4) Regular expressions and rule based tokenization:

- for English, a standard approach to tokenization is to use a combination of regular expression rules and a dictionary
- regular expressions are a sequence of characters that define a search pattern. In python they are normally implemented through the re module.

A very brief illustration of different regular exprassions (for more detailed information, refer to the official python documentation):

In [4]:
# Basic regular expression examples:

# Matching literal text
# Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters 
# match themselves. You can escape a control character by preceding it with a backslash.

# re.findall(PATTERN,INPUT); r enables extended regular expressions
# This should match both occurrences of "dog"
print(re.findall(r"dog","the black dog and the blue dog")) 

['dog', 'dog']


In [5]:
# ^ matches the beginning of the line
# This should only match the first "the"
print(re.findall(r"^the","the black dog and the blue dog")) 

['the']


In [6]:
# $ matches the end of the line
# This should only match the last "dog"
print(re.findall(r"dog$","the black dog and the blue dog"))

['dog']


In [7]:
# . matches any single character
# This should match both "dog" and "don"
print(re.findall(r"do.","the black dog and the blue don"))

['dog', 'don']


In [8]:
# [...] matches any single character in the bracket
# This should match both "dog" and "don", but not "dol"
print(re.findall(r"do[gn]","the black dog, the blue don and the yellow dol"))

['dog', 'don']


In [9]:
# [^...] matches any single character that is NOT in the bracket
# N.B.: do NOT mistaken [^...] and ^ (beginning of the line)
# This should only match "dol"
print(re.findall(r"do[^gn]","the black dog, the blue don and the yellow dol"))

['dol']


In [10]:
# ? matches 0 or 1 times the preceding regular expression
# This should match "dg" and "dog", but not "doog"
print(re.findall(r"do?g","the black dg, the blue dog and the yellow doog"))

['dg', 'dog']


In [11]:
# + matches 1 or more times the preceding regular expression
# This should match "dog" and "doog", but not "dg"
print(re.findall(r"do+g","the black dg, the blue dog and the yellow doog"))

['dog', 'doog']


In [12]:
# * matches 0 or more times the preceding regular expression
# This should match "dg", "dog" and "doog"
print(re.findall(r"do*g","the black dg, the blue dog and the yellow doog"))

['dg', 'dog', 'doog']


In [13]:
# {m} matches exactly m repetitions, while {n,m} matches between n and m repetitions
# The first regex should match only "dog", the second one matches "dg" and "dog"
print(re.findall(r"do{1}g","the black dg, the blue dog and the yellow doog"))
print(re.findall(r"do{0,1}g","the black dg, the blue dog and the yellow doog"))

['dog']
['dg', 'dog']


In [14]:
# () are used to group regular expressions. ? + * {} can be applied to groups the same way
# This should match "dodog", and "dododog", but not "dog"
print(re.findall(r"(do){2,3}(g)","the black dog, the blue dodog and the yellow dododog"))

[('do', 'g'), ('do', 'g')]


In [15]:
# a|b matches either a or b
# This should match "mansion", "pansion", but not "mission"
print(re.findall(r"((?:man)|(?:pan))(sion)","the black mansion, the blue pansion and the yellow mission"))

[('man', 'sion'), ('pan', 'sion')]


5) Useful pre-built re matches:

- [0-9] matches any digit
- [a-z] matches any lowercase letter
- [A-Z] matches any uppercase letter
- [a-zA-Z0-9] matches all of the above
- \s matches a whitespace character (including tab and newline)
- \S matches a NON whitespace character
- \t matches a tab
- \n matches newline
- \w matches alphanumeric chars (equal to [a-zA-Z0-9])
- \W matches non-alphanumeric chars (equal to [^a-zA-Z0-9])

6) RE exercises

Input text: 
J.A. Capetti will arrive by R.E.N.F.E. at 16:20. He will leave on 22nd November on flight LU-345. Contact him at jacapetti@miami.maffia.com and tell me something. The odds are 4:1 and we must prevent them from rising. We are risking $100m. Mr. Capetti is the capo in Miami and we have  to go along.

Match each of the following in a single regex:

- J.A. Capetti
- R.E.N.F.E.
- 16:20
- 22nd November
- LU-345
- jacapetti@miami.maffia.com
- 4:1
- $100m.
- Mr.

Each of the regex should return only the text that is required, using re.findall()
- do not match strings literally if you can avoid it

- keep the redundancy as low as possible

In [16]:
# Input
my_text = "J.A. Capetti will arrive by R.E.N.F.E. at 16:20. He will leave on 22nd November on flight LU-345. Contact him at jacapetti@miami.maffia.com and tell me something. The odds are 4:1 and we must prevent them from rising. We are risking $100m. Mr. Capetti is the capo in Miami and we have to go along."

# Your regex here