# Modern Data Science
**(Module 07: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are  are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/mds](https://github.com/tulip-lab/mds/issues)

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---

# Session C - Processing Raw Text

## Accessing Text from the Web and from Disk

### Ebook
Text number 2554 is an English translation of `Crime and Punishment`, and we can access it as follows.

In [None]:
import nltk
from urllib import request

In [None]:
url="http://www.gutenberg.org/files/2554/2554-0.txt"

In [None]:
response = request.urlopen(url)

In [None]:
raw = response.read().decode('utf8')

In [None]:
type(raw)

In [None]:
len(raw)

In [None]:
raw[:76]

* Note

The `read()` process will take a few seconds as it downloads this large book. If you're using an internet proxy which is not correctly detected by Python, you may need to specify the proxy manually, before using `urlopen`, as follows:
``` Python

    proxies = {'http': 'http://www.someproxy.com:3128'}
    request.ProxyHandler(proxies)
    
```

The variable `raw` contains a string with 1,176,967 characters. This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Notice the `\r` and `\n` in the opening line of the file, which is how Python displays the special carriage return and line feed characters. For our language processing, we want to break up the string into words and punctuation. This step is called **tokenization**, and it produces our familiar structure, a list of words and punctuation.

In [None]:
import nltk, re, pprint
from nltk import word_tokenize

In [None]:
tokens = word_tokenize(raw)

In [None]:
type(tokens)

In [None]:
len(tokens)

In [None]:
tokens[1:10]

Notice that NLTK was needed for **tokenization**, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing, along with the regular list operations like **slicing**:

In [None]:
text = nltk.Text(tokens)

In [None]:
type(text)

In [None]:
text[1024:1062]

In [None]:
text.collocations()

Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("End of Project Gutenberg's Crime")

In [None]:
raw = raw[5336:1157743]

In [None]:
raw.find("PART I")

The `find()` and `rfind()` ("reverse find") methods help us get the right index values to use for slicing the string. We overwrite `raw` with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the web: texts found on the web may contain unwanted material, and there may not be an automatic way to remove it. But with a small amount of extra work we can extract the material we need.

### Dealing with HTML

In [None]:
url = "http://www.gutenberg.org/files/2554/2554-h/2554-h.htm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

You can type `print(html)` to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

In [None]:
from bs4 import BeautifulSoup

In [None]:
raw = BeautifulSoup(html).get_text()

In [None]:
tokens = word_tokenize(raw)

In [None]:
tokens

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

In [None]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text

In [None]:
text.concordance('Crime')

### Reading Local Files

In order to read a local file, we need to use Python's built-in `open()` function, followed by the `read()` method. Suppose you have a file `document.txt`, you can load its contents like this:

In [None]:
import wget

In [None]:
link_to_data = 'https://github.com/tulip-lab/mds/raw/master/Jupyter/data/document.txt'

DataSet = wget.download(link_to_data)

In [None]:
f = open('document.txt')
raw = f.read()

If the interpreter couldn't find your file, you would have seen an error like this:

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/error-example.png' width = '600' height = '600' align = center />

To check that the file that you are trying to open is really in the right directory, examining the current directory within Python.

In [None]:
import os
os.listdir()

Assuming that you can open the file, there are several methods for reading it.

In [None]:
raw

We can also read a file one line at a time using a `for` loop:

In [None]:
f = open('document.txt', 'rU')
for line in f:
    print(line.strip())

Here we use the `strip()` method to remove the newline character at the end of the input line.

In [None]:
raw = open('document.txt').read()

In [None]:
type(raw)

In [None]:
tokens = nltk.word_tokenize(raw)

In [None]:
type(tokens)

In [None]:
words = [w.lower() for w in tokens]

In [None]:
type(words)

In [None]:
vocab = sorted(set(words))

In [None]:
type(vocab)

we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists.

In [None]:
query = "who knows?"
beatles = ['john', 'paul', 'george', 'ringo']
query + beatles

## Strings

### Basic Operations with Strings

Strings are specified using single quotes or double quotes, as shown below. If a string contains a single quote, we must backslash-escape the quote so Python knows a literal quote character is intended, or else put the string in double quotes. Otherwise, the quote inside the string will be interpreted as a close quote, and the Python interpreter will report a syntax error:

In [None]:
monty = 'Monty Python'
monty

In [None]:
circus = "Monty Python's Flying Circus"
circus

In [None]:
circus = 'Monty Python's Flying Circus'
circus

Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of two strings is joined into a single string. We need to use backslash or parentheses so that the interpreter knows that the statement is not complete after the first line.

In [None]:
couplet = "Shall I compare thee to a Summer's day?"\
          "Thou are more lovely and more temperate:"

In [None]:
print(couplet)

In [None]:
couplet = ("Rough winds do shake the darling buds of May,"
          "And Summer's lease hath all too short a date:")

In [None]:
print(couplet)

Besides, we can use a triple-quoted string as follows:

In [None]:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""
print(couplet)

In [None]:
couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''
print(couplet)

First let's look at the + operation, known as **concatenation**. It produces a new string that is a copy of the two original strings pasted together end-to-end. Notice that concatenation doesn't do anything clever like insert a space between the words. We can even multiply strings :

In [None]:
'very' + 'very' + 'very'

In [None]:
'very'*3

We've seen that the addition and multiplication operations apply to strings, not just numbers. However, note that we cannot use subtraction or division with strings:

In [None]:
'very' - 'y'

In [None]:
'very' / 2

These error messages are another example of Python telling us that we have got our data types in a muddle. In the first case, we are told that the operation of subtraction (i.e., `-`) cannot apply to objects of type `str` (strings), while in the second, we are told that division cannot take `str` and `int` as its two operands.

### Accessing Individual Characters

strings are indexed, starting from zero. When we index a string, we get one of its characters (or letters). A single character is nothing special — it's just a string of length 1.

In [None]:
monty[0]

In [None]:
monty[1]

As with lists, if we try to access an index that is outside of the string we get an error:

In [None]:
monty[20]

Again as with lists, we can use negative indexes for strings, where `-1` is the index of the last character. Positive and negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes `5` and `-7` both refer to the same character (a space). (Notice that `5 = len(monty) - 7`.)

In [None]:
monty[-1]

In [None]:
monty[5]

In [None]:
monty[-7]

We can write `for` loops to iterate over the characters in strings. This `print` function includes the optional `end=' '` parameter, which is how we tell Python to print a space instead of a newline at the end.

In [None]:
sent = 'colorless green ideas sleep furiously'
for char in sent:
    print(char, end=' ')

We can count individual characters as well. We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters:

In [None]:
from nltk.corpus import gutenberg

In [None]:
raw = gutenberg.raw('melville-moby_dick.txt')

In [None]:
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

In [None]:
fdist.most_common(5)

In [None]:
[char for (char, count) in fdist.most_common()]

### Accessing Substrings

A substring is any continuous section of a string that we want to pull out for further processing. We can easily access substrings using the same slice notation we used for lists (see as follow figure). For example, the following code accesses the substring starting at index `6`, up to (but not including) index `10`:

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/acc-substrings.png' width = '450' height = '450' align = center />

String Slicing: The string "Monty Python" is shown along with its positive and negative indexes; two substrings are selected using "slice" notation. The slice `[m,n]` contains the characters from position `m` through `n-1`.

In [None]:
monty[6:10]

Here we see the characters are 'P', 'y', 't', and 'h' which correspond to `monty[6]` ... `monty[9]` but not `monty[10]`. This is because a slice *starts* at the first index but finishes *one before* the end index.

We can also slice with negative indexes — the same basic rule of starting from the start index and stopping one before the end index applies; here we stop before the space character.

In [None]:
monty[-12:-7]

As with list slices, if we omit the first value, the substring begins at the start of the string. If we omit the second value, the substring continues to the end of the string:

In [None]:
monty[:5]

In [None]:
monty[6:]

We test if a string contains a particular substring using the `in` operator, as follows:

In [None]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

We can also find the position of a substring within a string, using `find()`:

In [None]:
monty.find('Python')

### The Difference between Lists and Strings

Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists:

In [None]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
query[2]

In [None]:
beatles[2]

In [None]:
query[:2]

In [None]:
beatles[:2]

In [None]:
query + " I don't"

We cannot join strings and list. If you join them, you will report an <font color='red'>'TypeError'</font>

In [None]:
beatles + 'Brian'

In [None]:
beatles + ['Brian']

Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements:

In [None]:
beatles[0] = "John Lennon"

In [None]:
del beatles[-1]

In [None]:
beatles

On the other hand if we try to do that with a *string* — changing the 0th character in `query` to <font color='green'>'F'</font> — we get:

In [None]:
query[0]='F'

This is because strings are **immutable** — you can't change a string once you have created it. However, lists are **mutable**, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

## Regular Expressions

As a powerful way of searching, replacing, and parsing text with complex patterns of characters, regular expressions are the most significant tools in data parsing. They figure into all kinds of text-manipulation tasks. Searching and search-and-replace are among the most common uses. Regular expressions tend to be easier to write than they are to read. This is less of a problem if you are the only one who ever needs to maintain them. But if several people need to, the syntax can turn into more of a hindrance than an aid. For example,
```python
    ^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$
```
is a regular expression for validating email addresses.
Please don't try to parse it yourself, 
since an experienced regular expression user might take a while to parse it.
In this section, we will first go through some good introductory materials of regular expressions,
and then show you some fundamentals of how to use regular expressions in search text.

There are a couple of good online materials that introduce regular expressions in Python. 
We strongly suggest that you study this chapter together with these materials. 
They are 
* [Regular Expression HOWTO](https://docs.python.org/2/howto/regex.html) from Python's office website: An introductory tutorial to using regular expressions in Python with the `re` module. 📖
* [Regular Expressions](http://www.diveintopython3.net/regular-expressions.html), chapter 5 of "**Dive into Python 3**": A series of examples inspired by real world problems are used to show you how to generate regular expressions for parsing street name, Roman numerals, and phone numbers. 📖

The complete list of meta-characters and their behaviour in the context of regular expressions can be found [here](https://docs.python.org/2/library/re.html). Besides, there is an alternative material if you would like to view, which is 

* [RegexOne](http://regexone.com): An interactive tutorial on learning regular expressions with simple exercises.

Before we go through some basics of regular expressions in python, we would like to point out [RegExr](http://regexr.com) by Grant Skinner. It is an online tool to learn, build, & test regular expressions. RegExr provides us with syntax highlighting, contextual help, video tutorial, reference, and searchable community patterns.
You will find a lot of good information in the six tabs provides on its website. In addition, pop-ups appear when you hover over the regular expression or target text in RegExr, giving you helpful information linking you between a regular expression and the corresponding matches in text. 
These resources are one of the reasons why RegExr is among our favourite online Regex checkers.

To use regular expressions in Python we need to import the `re` library using: `import re`.

In [None]:
import re

### Backslash

**First, what is '\'? **

'\', backslash or escape-character, is used to indicate special forms or to allow special characters to be used without invoking their special meaning.




**How about r"" ? When to use it? **

r"" is Python’s string literal prefix notation, which has nothing to do with regular expression.  By using r"" or r'', Python will not handle special characters in any special way, in another word, it treated the contents as raw string. For example, r"\t" represents
a two-character string containing '\' and 't', whereas "\t" represents tab.

Sometimes you can use them interchangeably.

In [None]:
str1 = re.findall('\t', "Please find \t")
print (str1)

str2 = re.findall(r'\t', "Please find \t")
print (str2)

Sometimes not!

In [None]:
str1=re.match(r"\W(.)\1\W", " ff ")
print (str1)

str2=re.match("\W(.)\1\W", " ff ")
print (str2)

str3=re.match("\\W(.)\\1\\W", " ff ")
print (str3)

"\W(.)\1\W" doesn't match ?  What is the difference? 

In [None]:
str4="\W(.)\1\W"
print (str4)
str4

In [None]:
str4=r"\W(.)\1\W"
print (str4)
str4

Now you might be able to guess, what "\W(.)\1\W" will match

In [None]:
str2=re.match("\W(.)\1\W", " f\x01 ")
print (str2)

It matches with non-word + any one character  + "\x01" + non=word.

*Conclusion -- always fist validate your regular expression, then test with Python*

\* is ??  <br>
\* is a wildcard similar with ? and +  <br>
\* matches 0+ <br>
? matches 0-1 <br>
\+ matches 1+ <br>

In [None]:
str1 = re.findall(r'.*', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'.?', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'.+', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'l+', 'Please find all')
print (str1)

### Matching String Literals
Matching strings with one or more literal characters, called string literals, is similar to the way you might do a search in a word editor or when submitting a keyword to a search engine. When you search for a string of text, you are searching with a string literal.
Let's start with a very simple scenario. 
If we have a sentence like
```
    Today is 26 jan 2016, not 25 Jan 2016.
```
And want to see if the string contains the word `Jan`  using a Python regular expression,
we'd use the following

In [None]:
import re # The Regular Expressions library
str = "Today is 26 jan 2016, not 25 Jan 2016." 
s = re.search("Jan", str)
print (s)

The simple pattern used above is just something like 'J' followed by 'a' followed by 'n' (i.e., 'Jan').
The `search()` method scans through the string, looking for any location where 'Jan' appears. If a match is found, a match object instance corresponding to the first match is returned. Our search was successful, as the code prints out the match object
as 
```
    <_sre.SRE_Match object at 0x103ed47e8>
```
This is Python's way of saying 'True' or 'Yes'. If no match is found, it will print out 'None'. 
For example, try the following 

In [None]:
print (re.search("Feb", str))

The returned match object contains information about the match: where it starts and ends, the substring it matched, and more. You can query the match object for information about the matching string. The most important ones are:

In [None]:
print (s.group())
print (s.start()) 
print (s.end())
print (s.span())

The `group()` method returns the string "Jan" matched by the regular expression. 
The `start()` method returns the starting position of "Jan", which is equal to the index of 'J' in the whole string.
Go ahead, count the characters in "Today is 18 Jan 2016.", starting at "T", then try:
```python
   str.index('J')
```
It should give the same integer as that given by `s.start()`. 
The `end()` method returns the ending position of the match, 
and the span() method returns a tuple containing the (start, end) positions.
This scenario is so simple that you don't need a regular expression.
Instead, you can use a string function, `find()`, which gives you the start position of the target string.

In [None]:
str.find("Jan")

How about finding both "Jan" and "jan"? 
The `find()` method can only find the first match of a given regular expression. 
There are two pattern methods that return all of the matches for a pattern encoded in a given regular expression. 
They are `findall()` and `finditer()`.
The former returns a list of matching strings, 
and the latter returns a sequence of match object instances as an iterator. 
Let's try!

In [None]:
print (re.findall("Jan", str))
for m in re.finditer("Jan", str):
    print (m.group())
    print (m.span())

However, using "Jan" can find the one with uppercase "J", but not the one with lowercase 'j'. 
The reason is that string matching is case-sensitive in regular expressions. 
If you want to match both lower- and uppercase, you can: 
1. Convert all the characters in the string into either lower- or uppercase ones, then use either `re.findall("jan", str)` or `re.findall("JAN", str)` respectively to find the two appearances of "Jan",
2. Update our regular expression to account for both 'J' and 'j', and retrieve both "jan" and "Jan" in their original form, like: 
```python
    [Jj]an
```
where '[ ]' indicates a set of characters, and '[Jj]' will match 'J' or 'j', which is also known as a character class.

In [None]:
re.findall(r"[Jj]an", str)

Our second choice is to use grouping in regular expressions. For multiple options we place them in brackets () and separate them by a pipe |. So we could use:

In [None]:
re.findall(r"(Jan|jan)", str)

Let's move one-step further to find all the words with only alphabetic characters using only regular expressions. 
It is not feasible to use grouping to enumerate all the possible words. 
Instead, we are going to use '[ ]' together with '+'.
You have seen '[ ]' above. '+' means matching 1 or more repetitions of the preceding regular expression. 
For example,
'an+' will match ‘a’ followed by any non-zero number of ‘n’s; 
it will not match just ‘a’. 
To match non-zero numbers of either lower- or uppercase characters, we derive the following regular expression:

In [None]:
re.findall(r"[a-zA-Z]+", str)

In the example above, we represent a range of characters by giving two characters separated by a '-'. For example [a-z] will match any lowercase ASCII letters, and [A-Z] will match any uppercase ASCII letters. Put the two together, we derive the regular expression that matches any lower- or uppercase letters.

### Matching Digits
There are several ways to represent digits in regular expressions:
* [0-9]: A range that matches the range of digits 0 through 9, which is the same as "[0123456789]".
* \d: A character shorthand to match the digits, which is pre-defined in most regular expression engines.
It is equivalent to [0-9].

Note that the character shorthand for digits is shorter and simpler, 
but it doesn’t have the power or flexibility of the range. 
With a range, you can pick the exact digits you want to match. 
For example, if you want to match a sequence of the binary digits, like '0010101011', 
you would use
```python
    [01]+
```

To match numbers that have more than one digit, for example, '12' and '123',
you can repeat either representation as many times as you want, like
* [0-9][0-9] or \d\d matches two-digits numbers from 00 to 99.
* [0-9][0-9][0-9] or \d\d\d matches three digits numbers from 000 to 999.

However, the above approach gets redundant if you try to match '100000' for example.
In this case, we can specify the number of occurrences of those digits by using 
curly brackets, like:  
* [0-9]{2} or \d{2} that matches numbers from 00 to 99.
* \d{1,3} that matches numbers from 0 to 999.

Let's try to extract year for the give string,

In [None]:
s = re.search(r'\d{4}', str) 
print(s.group()) 

As we discussed before, the `search()` method returns the first match found in the string.
However, if search stops when it finds the first occurrence, what is the point of group?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses, i.e., ( and ) meta-characters. Any sub-pattern inside a pair of parentheses will be captured as a group.
Let's try to find a pair of words separated by a white space in the following simple
string
```python
    Isaac Newton, Data Scientist
```
The regular expression we are going to use is 
```python
    ([a-zA-Z]+) ([a-zA-Z]+)
```
It uses two groups. One is used to match the first word in the pair and another matches
the second word. Note that there is a white space between the two groups.

In [None]:
m = re.match(r"([a-zA-Z]+) ([a-zA-Z]+)", "Isaac Newton, Data Scientist")
print(m.group(0) + "\n" + m.group(1)  + "\n" + m.group(2))

As you can see, `m.group(0)` returns the entire match. `m.group(1)` returns the match of the first parenthesized subgroup. And `m.group(2)` returns the match of the second parenthesized subgroup.
You can also retrieve the two groups by using the `groups()` methods.

In [None]:
m.groups()

In Python regular expressions, you can also name each group in a regular expression using 
```python
    (?P<name>...)
```
The substring matched by the group is accessible via the symbolic group name 'name'.
For example:

In [None]:
m = re.match(r"(?P<first_name>[a-zA-Z]+) (?P<last_name>[a-zA-Z]+)", "Isaac Newton")
m.groupdict()

### More on Regular Expression Syntax
We have shown you how to match words and digits in the previous two sections. Here we would like to list some meta-characters that are used very often in regular expressions:
* \D: Matches characters that are not digits, which is equivalent to [^0-9] or [^\d].
* \w: Matches any alphanumeric character, which is equivalent to [a-zA-Z0-9].
* \W: Matches any non-alphanumeric character; which is equivalent to [^a-zA-Z0-9] or [^\w].
* \s: Matches any whitespace character; which is equivalent to [ \t\n\r\f\v], where \t indicates taps, \n  line feeds, \r carriage returns, \f form feeds and \v vertical tabs.
* \S: Matches any non-whitespace character; which is equivalent to  [^ \t\n\r\f\v].

### Raw Strings in Python Regular Expressions
We have been using 'r' in our regular expressions, what does it mean?
It is Python's raw string notation for regular expressions.
It has been used to work around the backslash plague.

In regular expressions the backslash character ('\')  is often used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. For example, to match a literal backslash, one has to write '\\\\\\\\' as the regular expression string. This is because the regular expression must be '\\\\', and each backslash must be expressed as '\\\\' inside a regular Python string literal. Let's assume that you would like find all the LaTeX commands in a given LaTeX file. Those commands always start with a backslash, like '\\usepackage',
'\\section', '\\title', etc. The regular expression without raw string notation is:
```python
    \\\\\\w+
```
Refer to the previous section for the meaning of "\w".
In contrast, one can prefiex the string literals with a letter 'r' or 'R' to form a raw string notation, which tells 
the regular expression engine not to handle backslashes in any special way. With the raw string notation, the regular expression above can be simplified to 
```python
    r"\\\w+"
```
Let's try them out:

In [None]:
m1 = re.match("\\\\\w+", "\section")
print (m1.group())

m2 = re.match(r"\\\w+", "\section")
print (m2.group())

The two lines of matching code above are functionally identical. But it is easy to interpret the regular expression using raw string notation. Therefore, when writing regular expression in Python, it is recommended that you use raw strings instead of regular Python strings. 

- - -

## Parsing Dates with Regular Expressions


This section will show you how to parse dates in simple data formats, 
e.g., mm/dd/yyyy, and dd/mm/yyyy. You might think that something as conceptually trivial as a date should be an easy job for a regular expression. But it isn’t, for reasons like: 
* The problem of leading zeros: humans are very sloppy with writing dates. Sometimes we omit the leading zeros, and write dates like "1/1/2016" and "1/01/2016". Therefore, should the regular expression for dates allow leading zeros to be omitted?
* Different date delimiters: besides forward slashes, we can also use white spaces, or hyphens to separate day, month and year.
* Matching a given range of numbers: regular expressions don't deal directly with numbers and don't understand the numerical meanings that humans assign to strings of digits. They treat numbers, like 123, as strings of characters displayed as digits, 1, 2, and 3. Therefore, we cannot tell a regular expression to match a given range of numbers directly. For instance, to match months that are in a range from 1 to 12 and to match days from 1 to 31.

Therefore, you have to choose how simple or how accurate you want your regular expression to be.
If you already know your text doesn’t contain any invalid dates, you could use a trivial regex such as
```python
    r"\d{2}/\d{2}/\d{4}"
```
The fact that this matches things like 00/00/0000 is irrelevant if those don’t occur in your text.
In most cases, you won't know whether your text has invalid dates or not. 

So given that a basic date is day, month and year, and are all digits, which of the three is easiest to parse with regular expressions?
Give month a try. First define our own method 'month' which accepts a pattern and a month (both text) as arguments and reports if there is a match:

In [None]:
def month(pattern, m):
    if re.match(pattern, m):
        print (m + " is a month")
    else:
        print (m + " is NOT a month")

It seems that it is trivial to write a regular expression to match the 12 months from 1 to 12 with or without 
leading zeros. 
Let's first assume that all months are represented by two digits. 
In other words, we append a zero to the left if the month is in between January to September. 
The simplest regular expression one can think could be
```python
    r"\d\d"
```
Try it out

In [None]:
month(r'\d\d', "12")
month(r"\d\d", "03")
month(r"\d\d", "00")
month(r"\d\d", "13")
month(r"\d\d", "3")

The regular expression we used matches exactly two-digit numbers from 00 to 99. 
Although it can match all the months represented by two digits, the problems are that 
* It cannot match months represented by a single digit, e.g., 1 (January), 2 (February), etc.
* It matches numbers that do not represent any month. 
  So one does need to validate the given number to make sure it is in the right range.

Tackling the first problem, we can use curly brackets `{m, n}` to specify the minimum and maximum occurrences of digits. A month can have at least one digit and at most two digits. 
So the regular expression should look like
```python
    r"\d{1,2}"
```

In [None]:
month(r"\d{1,2}", "03") 
month(r"\d{1,2}", "3") 
month(r"\d{1,2}", "00") 
month(r"\d{1,2}", "0")
month(r"\d{1,2}", "13")

However, this regular expression still matches invalid months, such as "00", "0" and "13".
The months must be restricted to numbers between 1 and 12.
We use alternation inside a group to match various pairs of digits to form a range of one- or two-digit numbers.

In [None]:
month(r"([1-9]|1[0-2])", "3") 
month(r"([1-9]|1[0-2])", "0") 
month(r"([1-9]|1[0-2])", "01") 

In the above code `[1-9]` matches months that can be represented by a single digit, and `1[0-2]` matches October, November and December. Let's further update the regular expression to allow leading zeros by adding `0?`:

In [None]:
month(r"(0?[1-9]|1[0-2])", "03")

It seems that we have constructed a regular expression that can handle months represented by either one- or two-digits numbers. But sooner or later you will find the following problem: 

In [None]:
month(r"(0?[1-9]|1[0-2])", "13") 
month(r"(0?[1-9]|1[0-2])", "99") 

Why?

Some of these patterns seem right but don't always work. 
Regular expressions are quite specific, like mini programs.
You have to get them right and then they will very effectively block everything that doesn't match.
We very specifically say what we want, as opposed to listing all the exceptions we don't want.
which is easier?
For example testing all exceptions, case by case:
* is the input empty (and this in itself is trouble, one space or two? ' ', tab, CR, LF etc.)
* is the input the correct type (character, number etc.)
* the correct format
* correct range
* positive, negative
* uppercase, lowercase, etc.

Watch out for the difference between greediness & laziness in regular expressions. 
Greediness means match longest possible string.
Laziness means match shortest possible string. 
Or, put another way, laziness will stop as soon as the condition is satisfied, 
but greediness means it will stop only once the condition is not satisfied any more - this is quite different.

Consider also Start of String and End of String anchors. The caret ^ matches the position before the first character in the string. Applying "^a" to "abc" matches the whole string. "^b" does not match "abc" at all, because the b cannot be matched right after the start of the string, matched by ^. Similarly, $ matches right after the last character in the string.

So here's what we want:
```python
    r"^(0?[1-9]|1[0-2])$" 
```
Let's test it now:

In [None]:
pattern =  "^(0?[1-9]|1[0-2])$" 

In [None]:
month(pattern,"03") 
month(pattern,"0")
month(pattern,"033")
month(pattern,"003")
month(pattern,"99")
month(pattern,"3")

Similarly, you can write regular expressions to validate days. 
We will leave this for you to do as an exercise.
Next, we are going to show you the regular expressions for handling years in 20th and 21st centuries. 
These years are between 1900 and 2099.
The first two digits are either 19 or 20, which can be captured by a group alternating between these two numbers
```python
    (19|20)
```
Each of the last two digits contains numbers between 0 and 9, which can be easily captured by
```
    \d{2}
```
Put them together and we have
```
    r"(19|20)\d{2}"
```

In [None]:
def year(pattern, m):
    if re.match(pattern, m):
        print (m + " is a year")
    else:
        print (m + " is NOT a year")
        
year(r"(19|20)\d{2}", "1800")
year(r"(19|20)\d{2}", "1900")
year(r"(19|20)\d{2}", "2099")
year(r"(19|20)\d{2}", "2100")

Dealing with leap years is not trivial. Can one write a regular expression that can distinguish days in February in either leap years or non-leap years? It is easy to write regular expressions to match February 29th regardless of the year. Allowing February 29th only in leap years would require us to spell out all the years that are leap years, and all the years that aren’t. Therefore, it seems that regular expressions are not a good choice here. Handling leap years require an extra bit of code. Maybe it's better to do it in two stages:
1. Does it look like a date? (use regex), then
2. is it a date? (code, e.g. convert to numeric then > 0 and < 13)

For example, the regular expression we found here:
```python
    r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
```
matches a date in traditional date format from between 1900-01-01 and 2099-12-31, with a choice of four separators.
However, there are dates that match the regular expression but aren't valid.
For example:

In [None]:
pattern = r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
year(pattern, "2016-02-31") 

It is impossible for February in any year can have more than 29 days. Instead of using regular expressions to validate 
dates, you can also use Python's `datatime` module. If a given date string cannot be converted to a Python Date object, then the date wouldn't be valid.

In [None]:
# use python datetime libraries
import datetime as dt
today = "2016-02-31"
mydt = dt.datetime.strptime(today, '%Y-%m-%d') 

Therefore, 
if you get regular expressions right, they can be very useful as anything that doesn't match the pattern will get blocked. However, getting them wrong will result in many problems.
- - -

## Extract IPs, dates, and email address with regular expressions

With following tasks we will use the mail box data ([mbox-short.txt](http://www.pythonlearn.com/code3/mbox-short.txt)) provided by the book [Python for Informatics: Exploring Information](http://www.pythonlearn.com/book.php#python-for-informatics). 


In [None]:
!pip install wget

In [None]:
import wget

link_to_data = 'https://github.com/tulip-lab/mds/raw/master/Jupyter/data/mbox-short.txt'

DataSet = wget.download(link_to_data)

In [None]:
!ls

In [None]:
with open('mbox-short.txt','r') as infile:
    text = infile.read()

### Find IP addresses 

In this task we will need to 
1. find all IP addresses in the mbox-short dataset.
2. print unique IP addresses 

Let's have a try first: 

In [None]:
str1 = re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', "This is a IP address 111.23.39.99")
str1

![](https://github.com/tulip-lab/mds/raw/master/Jupyter/image/regeximg3.png)

From https://regexper.com/

In [None]:
str1= re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
if len(str1)>0:
    print(str1)

By running the code above, we are able to print all IP addresses. 

Next can we save all unique IP address in a list? We will need to read the whole txt file in to 'text', and then apply re.findall function. set() function returns the unique values.

In [None]:
str1=re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
set(str1)

### Extract All date time 


In the next task, we need to extract all date time from the file. We trust that all date time are valid for now. 



In [None]:
str1=re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', text)
set(str1)

From the extract datetime string, extract date and hour information by using nested group

In [None]:
str2=re.findall(r'((\d{4}-\d{2}-\d{2} \d{2}):\d{2}:\d{2})', text)
set(str2)

### Extract author's email address


There are many email addresses included in the file. We would like to extract email addresses from the Author the format is normally:

"Author: stephen.marquard@uct.ac.za"

Now lets see if we can use the following regular expression:
```python
r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
```
which was copied and pasted from http://emailregex.com/

Does it work in the task?

In [None]:
str1=re.findall(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", text)
set(str1)

What if I only want email address after Author ? 

In [None]:
str1=re.findall(r'Author: ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', text)
str1

##  Pre-Processing Text

The possible steps of text pre-processing are nearly the same for all text analysis tasks, though which pre-processing steps are chosen depends on the specific task. The basic steps are as follows:
* Tokenization
* Case normalization
* Removing Stop words
* Stemming and Lemmatization
* Sentence Segmentation

We will walk you through each of these steps with some examples. First, you need to 
decide <font color="red">the scope of the text to be used in the downstream text analysis tasks</font>. Should you use an entire document?
Or should you break the document down into sections, paragraphs, or sentences. Choosing 
the proper scope depends on the goals of the analysis task.
For example, you might choose to use an entire document in document classification and clustering tasks
while you might choose smaller units like paragraphs or sentences in document summarization and information
retrieval tasks. The scope chosen by you will have an impact on the steps needed in the pre-processing process.

### Tokenization

Text is usually represented as sequences of characters by computers. 
However, most natural language processing (NLP) and text mining tasks
(e.g., parsing, information extraction, machine translation, document classification, information
retrieval, etc.) need to operate on tokens. 
The process of breaking a stream of text into tokens is often referred to as **tokenization**.
For example, a tokenizer turns a string such as 
```
    A data wrangler is the person performing the wrangling tasks.
```
into a sequence of tokens such as
```
    "A" "data" "wrangler" "is" "the" "person" "performing" "the" "wrangling" "tasks"
```

There is no single right way to do tokenization. 
It completely depends on the corpus and the text analysis task you are going to perform. It is important to ensure that your tokenizer produces proper token types for your downstream text analysis tools. 
Although word tokenization is relatively easy compared with other NLP or text mining task, errors made in this phase will propagate into later analysis and cause problems.
In this section, we will demonstrate the process of chopping character sequences into pieces with different tokenizers. 

The major question of the tokenization phase is what counts as a token.
Different linguistic analyses might have different notions of tokens.
In different languages, a token could mean different things. 
Here we are not going to dive into the linguistic aspect of what counts as a token,
as it goes beyond the scope of this unit.
We rather consider English text.
In English, a token can be a string of alphanumeric characters separated by spaces, which
seems quite easy.
However, things get considerably worse when we start considering words having
hyphens, apostrophes, periods and so on. In a word tokenization task, should we
remove hyphens? Should we keep periods? 
According to different text analysis tasks, 
tokens can be unigram words, multi-word phrases (or collocations), or 
other meaningful and identifiable linguistic elements.
Therefore, working out word tokens is not an easy task in pre-processing natural language text.
Reading materials associated with this section are [1], section 3.7 of [2], [3] and section 4.2.2 of [6].
You might be interested in watching a YouTube video on [word tokenization](https://www.youtube.com/watch?v=jBk24DI8kg0).

#### Standard Tokenizer

For English, a straightforward tokenization strategy is to use white spaces as token delimiters. 
The whitespace tokenizer simply splits the text on any sequence of whitespace, tab, or newline characters.
Consider the following hypothetical text.

In [None]:
raw = """The GSO finace group in  U.S.A. provided Cole with about
US$40,000,555.4 in funding, which accounts for 35.3% of Cole's revenue (i.e., AUD113.3m), 
as the ASX-listed firm battles for its survival.
Mr. Johnson said GSO's recapitalisation meant "the current shares are worthless"."""

As a starting point, let's tokenize the text above by using any whitespace characters as token delimiters.
As mentioned, these characters include whitespace (' '), tab ('\t'), newline ('\n'), return ('\r'), and so on.
You have learnt in week 2 that those characters are together represented by a built-in regular expression abbreviation '\s'.
Thus, we will use '\s' rather than writing it as something like '[ \t\n]+'.

There are multiple ways of tokenizing a string with whitespaces.
The simplest approach might be using Python's string function `split()`.
This function returns a list of tokens in the string.
Another way is to use Python's regular expression package, `re` as
```python
    import re
    re.split(r"\s+", raw)
```
The output should be exactly the same as that given by the string function `split()`.
Here we further demonstrate the use of <font color="blue">RegexpTokenzier</font> from Natural Language Toolkit (NLTK).

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer

In [None]:
tokenizer = RegexpTokenizer(r"\s+", gaps=True)
tokens = tokenizer.tokenize(raw)
tokens

A <font color="blue">RegexpTokenizer</font> splits a string into tokens using a regular expression.
Refer to its online [documentation](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer) 
for more details.
Its constructor takes four arguments.
The compulsory argument is the pattern used to build the tokenizer.
It is in the form of a regular expression. 
In the example above, we used `\s+` to match 1 or more whitespace characters.
If the pattern defines separators between tokens, the value of `gaps` should be
set to `True`. Otherwise, the pattern should be used to find the tokens.
NLTK also provides a whitespace tokenizer, `WhitespaceTokenizer[source]`, which is
equivalent to our tokenizer. Try
```python
    from nltk.tokenize import WhitespaceTokenizer
    WhitespaceTokenizer().tokenize(raw)
```

It seems that word tokenization is quite simple if words in a language are all
separated by whitespace characters. 
However, this is not the case in many languages other than English, such
as Chinese, Japanese, Korean and Ancient Greek. 
In those languages, text is written without any whitespaces between words. 
So the whitespace tokenizer is of no use at all.
To handle them, we need more advanced tokenization techniques, often referred to as
word segmentation, which is an important and challenging task in NLP. 
However,
discussing word segmentation is beyond our scope here.

It is not surprising that the whitespace tokenizer is insufficient even for English, since English does not just contains sequences of alphanumeric characters separated by white spaces. 
It often contains punctuation, hyphen, apostrophe, and so on.
Sometimes whitespace does not necessarily indicate a word break. 
For example, non-compositional phrases (e.g., "real estate" and "shooting pain") and proper nouns (e.g., "The New York Times") have a different meaning than the sum of their parts. They cannot be split in the process of word tokenization.
They must be treated as a whole in, for instance, information retrieval.

Back to our example, 
the whitespace tokenizer still gives us word like "(i.e.,", "funding," and "worthless".".
We would like to remove parentheses, some punctuations, quotation marks and other non-alphanumeric characters.
A simple and straightforward strategy is to use all non-alphanumeric characters as token delimiters.

In [None]:
tokenizer = RegexpTokenizer(r"\W+", gaps=True) 
tokenizer.tokenize(raw)

In regular expressions, '\W' indicates any non-alphanumeric characters (equivalent to `[^a-zA-Z0-9]`) while '\w' indicates any alphanumeric characters. 
The counterpart is to extract tokens that only consist of alphanumeric characters without the empty strings. Try the following out yourself:
```python
    tokenizer = RegexpTokenizer(r"\w+")
    tokenizer.tokenize(raw)
```

These two strategies are simple to implement, but there are cases where they may not match the desired behaviour. 
For example, the whitespace tokenizer cannot properly handle non-alphanumeric characters, while the non-alphanumeric tokenizer might over-tokenise some tokens with periods, hyphens, apostrophes, etc.
In the rest of this section, we will discuss the main problems that you might face while tokenising free language text. You will soon find that tokenizers should often be customized to deal with different datasets.

#### Periods in Abbreviations

Word tokens are not always surrounded by whitespace characters. Punctuation, such as commas, semicolons, and periods, are often used in English, as they are vital to disambiguate the meaning of sentences. However, it is problematic for computers to handle punctuation, especially periods, properly in tokenization. 
In this part we will focus on the handling of periods.

Periods are usually used to mark the end of sentences. Difficulty arises when the period marks abbreviations (including acronyms). Please refer to **"Step 2: Handling Abbreviations" in [3]** for a detailed discussion on abbreviations.  In the case of abbreviations, particularly acronyms, separating tokens on punctuation and other non-alphanumeric characters would put different components of the acronym into different tokens, as you have seen in our example, where "U.S.A" has been put into three tokens, "U", "S" and "A", losing the meaning of the acronym. To deal with abbreviations, one approach is to maintain a look-up list of known abbreviations during tokenization. Another approach aims for smart tokenization. Here we will show you how to use regular expressions to cover most but not all abbreviations.

An acronym is often formed from the initial components in multi-word phrases.  Some contains periods, and some do not. Common acronyms with periods are for example, 
* U.S.A
* U.N.
* U.K.
* B.B.C

Other abbreviations with a similar pattern are, for instance, 
* A.M. and P.M.
* A.D. and B.C.
* O.K.
* i.e.
* e.g.

For abbreviations like those, it is not hard to figure out the pattern and the corresponding regular expression.  Each of those abbreviations contains at least a pair of a letter (either uppercase or lowercase) and a period.  The regular expression is
```python
    r"([a-zA-z]\.)+"
```
To see the graphical representation of the regular expression, please click <a href="https://regexper.com/#(%5Ba-zA-z%5D%5C.)%2B">here</a>. 
Put it into <font color="blue">RegexpTokenizer</font>,

In [None]:
tokenizer = RegexpTokenizer(r"(?:[a-zA-Z]\.)+")
tokenizer.tokenize(raw)

Observe that
1. We introduced <font color="red">(?: )</font> in the regular expression to avoid just selecting substrings that match the pattern. `(?: )` is a non-capturing version of regular parentheses. If the parentheses are used to specify the scope of the pattern, but not to select the matched material to be output, you have to use `(?: )`. To check out how `?:` affects the output, try to remove it and run the tokenizer again. You will get the following output
```
    ['e.', 'A.', 'l.', 'r.']
```
It just returns the last substrings that match the pattern.
2. The code also returned 'l.' and 'r.' that are part of 'survival.' and 'Mr.' 
The period in 'survival.' marks the end of a sentence. 
Indeed, it is very challenging to deal with the period at the end of each sentence, as it can also be part of an abbreviation if the abbreviation appears at the end of a sentence.
For example, the following sentence ends with 'etc.'
```
    I need milk, eggs, bread, etc.
```

Next, let’s further consider some more general abbreviations, like
* Mr. and Mrs.
* Dr.
* st.
* Wash. and Calif. (abbreviations for two states in U.S., Washington and California)

In those abbreviations, the period is always preceded two or more letters in English alphabet. Turn this pattern into a regular expression
```
    r"[a-zA-z]{2,}\."
```

In [None]:
tokenizer = RegexpTokenizer(r"[a-zA-z]{2,}\.")
tokenizer.tokenize(raw)

It is not surprising that the ouput contains "survival." again. 
The issue of working out which punctuation marks indicate the end of a setence will be discussed in section 2.5.
Let's put all the cases together. 
The regular expression can be generalised to
```python
    r"([a-zA-Z]+\.)+"
```
which matches both acronyms and abbreviations like "Dr."

As we mentioned early in this chapter, the issues of tokenization are language specific.
The language of the document to be tokenized should be known a priori.
Take computer technology as an example.
It has introduced new types of character sequences that a tokenizer should probably treat as a single token, including email addresses, web URLs, IP addresses, etc. One solution is to simply ignore them by using a non-alphanumeric-based tokenizer. 
However, this comes the cost of losing the original meaning of those kinds of tokens. For instance, if an IP address, like "172.19.197.106", is tokenized into individual numbers, "172", "19", "197", and "106".
It is no longer an IP address, and these numbers can be anything.
To account for strings like
* "172.19.197.106"
* "www.monash.edu.au"

you can simply update our regular expression accounting for abbreviations to 
```python
    (\w+\.?)+
```

Try it out on http://regexr.com/.

#### 2.1.3 Currency and Percentages

While analysing financial document, such as finance reports, a financial analyst might be interested in monetary numerals mentioned in the reports. One interesting research question in both finance and computer science is whether one can use finance reports to help predict the stock market prices. In this case, it would be good for a tokenizer to keep all the monetary numerals.

Currency is usually expressed in symbols and numerals (e.g., $10).
There are many different ways of writing about different currencies.
For example,
* A three-letter currency abbreviations followed by figures, for example,
```
    AUD100, EUR500, CNY330 
```

* A letter or letters symbolising the country followed the, for example,
```
    A$100 (= AUD100), US$10 (= USD10), C$5 (= CAD5),
```

* A currency symbols ($, £, €, ¥, etc.) followed by figures, for examples
```
    £100.5, €30.0
```

While the number of digits in the integer part is more than three, commas are often inserted between every three digits, like
```
    AUD100, 000 
```
Let's construct a regular expression that can account for all the following monetary numerals
```
1. $10,000.00
2. €10,000,000.00
3. ¥5.5555
4. AUD100
5. A$10.555
```
The regular expression should looks like as follows (<a href="https://regexper.com/#(%3F%3A%5BA-Z%5D%7B1%2C3%7D)%3F%5B%5C%24£€¥%5D%3F(%3F%3A%5Cd%7B1%2C3%7D%2C)*%5Cd%7B1%2C3%7D(%3F%3A%5C.%5Cd%2B)%3F"> the graphical representation</a>):
```python
    r'''(?x)          
        ([A-Z]{1,3})? # (1)
        [\$£€¥]?      # (2)
        (\d{1,3},)*   # (3)
        \d{1,3}       # (4)
        (?:\.\d+)?    # (5)
    '''
```

(1) matches the start of monetary numerals, which consists of one or up to 3 uppercase letters that indicate a country symbol or a currency abbreviation.
<br/>
(2) together with (1), matches the start of monetary numerals, which consists of either only a currency symbol or a country symbol plus a currency symbol.
<br/>
(3) accounts for the integer part that contains more than three digits. It matches all digits in the integer part except for the last three digits.
<br/>
(4) matches the last three digits in the integer part.
<br/>
(5) matches the fractional part.

In [None]:
tokenizer = RegexpTokenizer(r"(?:[A-Z]{1,3})?[\$£€¥]?(?:\d{1,3},)*\d{1,3}(?:\.\d+)?")
tokenizer.tokenize(raw)

Refer back to our example text "raw", can you find any issue rather than the percentage (35.5%)? The regular expression cannot handle "AUD113.3m", where the "m" indicates million. Without 'm', the number 'AUD113.3' loses its meaning in the original context. Therefore, you have seen that there might not be a regular expression that can handle all possible ways of representing currency.

Now, we have constructed a regular expression for currencies, even though it is not perfect.
Next, we move to working out the regular expression for percentages, things becomes quite easy.
Percentages usually have the following forms
* 23%
* 23.23%
* 23.2323%
* 100.00%

The maximum number of digits in the integer part is 3, the minimun is 1, so the regular expression is '\d{1,3}'.
A percentage can have either one or no fractional part, which can be matched by '(\.\d+)?'.
Adding % to the end, we have (<a href="https://regexper.com/#%5Cd%7B1%2C3%7D(%5C.%5Cd%2B)%25">the graphical representation</a>)
```python
    r"\d{1,3}(\.\d+)%"
```

In [None]:
tokenizer = RegexpTokenizer(r"\d{1,3}(?:\.\d+)?%")
tokenizer.tokenize(raw)

The above code should give you the only percentage in our example text. 
Compare the regular expression matching percentages with that matching currency,
you will find that the former is similar to the last bits of the latter, except for the percentage sign.
Besides, there are other numerical and special expressions that
we can not easily handle with regular expressions. For example, these expressions include
email addresses, time, vehicle licence numbers, phone numbers, etc.
If you are interested in dealing with them, you could read the “Regular Expressions Cookbook” by Jan Goyvaerts and Steven Levithan. 

#### Hyphens and Apostrophes 

In English, hyphenation is used for various purposes. The hyphen can be used to form certain compound terms, including hyphenated compound nouns, verbs and adjectives. It can also be used for word division. There are many sources of hyphens in texts. Thus, should one count a sequence of letters with a hyphen as one word to two? Unfortunately, the answer seems to be sometimes one, sometimes two. 
For example, if the hyphen is used to split up vowels in words, such as "co-operate", "co-education" and "pre-process", these words should be regarded as single token. In contrast, if the hyphen is used to group a couple of words together, for example, "a state-of-the-art algorithm" and "a money-back guarantee", these hyphenated words should be separated into individual words.
Therefore, handling hyphenated words automatically is one of the most difficult tasks in pre-processing text data.

"**The Art of Tokenization**" [3] categorizes different hyphens into three types:
* **End-of-Line Hyphen**: In professionally printed material (like books, and newspapers), the hyphen is used to divide words between the end of one line and the beginning of the next in order to perform justification of text during typesetting. It seems to be easy to handle these kinds of hyphens by simply removing them and joining the parts of a word at the end of one line and the beginning of the next.
* **Lexical Hyphen**: Words with a lexical hyphen are better to be treated as a single word. They are typically included in a dictionary. For example, words contains certain prefixes, like "co-", "pre-", "multi-", etc., and other words like "so-called", "forty-two"
* **Sententially Determined Hyphenation**: This type of hyphen is often created dynamically. It includes, for example, nouns modified by an 'ed'-verb (e.g., "text-based" and "hand-made") and sequences of words used as a modifier in a noun group, as in "the 50-cent-an-hour raise". In these cases, we might want to treat those tokens joined by hyphens as individual words.

The use of hyphens in many such cases is extremely inconsistent, which further increase the complexity of dealing with hyphens in tokenization. People often resort to using either some heuristic rules or treating it as a machine learning problem. However, these go beyond our scope here. It is clear that handling hyphenation is much more complicated than one can expect. You should also be clear that there is no way of handling all the cases above.

Let's assume that we are going to treat all strings of two words separated by a hyphen as a single token, how can we extract them from texts without breaking them into pieces.  In our example text, we are going to view "ASX-listed" as a single token. The pattern here is  a sequence of alphanumeric character plus "-" and plus another sequence of alphanumeric character.
The corresponding regular expressions should be 
```python
    r"\w+-\w"
```

In [None]:
tokenizer = RegexpTokenizer(r"\w+-\w+")
tokenizer.tokenize(raw)

Similar to hyphens, how to handle an apostrophe in tokenization is another complex question. The apostrophe in English is often used in two cases:
* Contractions: a shortened version of a word or multiple words. 
    * don't (do not)
    * she'll (she will)
    * you're (you are)
    * he's (he is or he has)
    * you'd (you would)
* Possessives: used to indicate ownership/possession with nouns.
    * the cat's tail
    * Einstein's theory
    
Should we treat a string containing apostrophes as a single word or two words?
Perhaps, you might think we should separate English Contractions into two words, and regard possessives as a single word. 
However, distinguishing contractions from possessives is not easy.
For example, should "cat's" be "cat has/is" or the possessive case of cat.
Thus some processor in NLP splits the strings in either case into two words, while others do not.
Here we again assume that we are going to retrieve all strings with an apostrophe as single words.
The regular expression is quite similar to the one for handling hyphens.
```
     r"\w+'\w+"
```

In [None]:
tokenizer = RegexpTokenizer(r"\w+'\w+")
tokenizer.tokenize(raw)

Now let's generalise the `\w+` to permit word-internal hyphens and apostrophes (<a href="https://regexper.com/#%5Cw%2B(%3F%3A%5B-'%5D%5Cw%2B)%3F">the graphical representation</a>):
```python
    \w+(?:[-']\w+)? 
```

You have learnt some simple approaches for handling different issues in word tokenization, which turns out to be far more difficult than you might have expected. It is clear that different NLP and text mining tasks on different text corpora need different word tokenization strategies, as you must decide what counts as a word. Besides the `RegexpTokenizer`, NLTK implements a set of other word tokenizaton modules. Please refer to [its official webpage](http://www.nltk.org/api/nltk.tokenize.html) for more details.
So far that we have only considered well-written text, but there are other types of natural language texts, such the transcripts of speech corpora and some non-standard texts like tweets that provide their own additional challenges.

### Case Normalization
After word tokenization, you may find that words can contain either upper- or lowercase letters. 
For example, you might have "data" and "Data" appearing in the same text.
Should one treat them as two different words or as the same word?
Most English texts are written in mixed case. 
In other words, a text can contain both upper- and lowercase letters.
Capitalization helps readers differentiate, for example, between nouns and proper nouns.
In many circumstances, however, an uppercase word should be treated no differently than in lower case appearing in a document, and even in a corpus.
Therefore, a common strategy is to reduce all letters in a word to lower case.
It is very simple to do so.

In [None]:
tokens = [token.lower() for token in tokens]
tokens

It is often a good idea to do case normalization. For example, with case normalization, you can match "data wrangling" with "Data Wrangling" in an information retrieval task. But for other tasks, like named entity recognition, one would better to keep capitalised words (e.g., pronouns) left as capitalised.
People have tried some simple heuristics that just makes some token lowercase. 
However, there is a trade-off between getting capitalization right and simply using lowercase regardless of the correct case of words.

### Removing Stop words
[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are words that are extremely common and carry little lexical content. For many NLP and text mining tasks, it is useful to remove stopwords in order to save storage space 
and speed up processing, and the process of removing these words is usually called “stopping.” 
An example stopword list from NLTK is shown bellow:

In [None]:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
stopwords_list

The above list contains 127 stopwords in total, which are often [function words](https://en.wikipedia.org/wiki/Function_word) in English, like articles (e.g., "a", "the", and "an"), 
pronouns (e.g., "he", "him", and "they"), particles (e.g., "well", "however" and "thus"), etc.
It is easy to use NLTK's built-in stopword list to remove all the stopwords from a tokenised text.

In [None]:
filtered_tokens = [token for token in tokens if token not in stopwords_list]
filtered_tokens

We have removed 13 stopwords. The number of tokens left is 28. 
To check what stopwords have been excluded from the filtered list, you simply change `not in` to `in`.

There is no single universal list of stop words used by all NLP and text mining tools.
Different stopword lists are available online. For example, the English stopword list 
available at [Kevin Bouge's website](https://sites.google.com/site/kevinbouge/stopwords-lists) 
which contains 570 stopwords, a quite fine-grained stopword list. 
At the same website, you can also download stopword lists for 27 languages other than English.
Please download the English stopwords list from Kevin Bourge's website, and save it into the folder where
you keep this IPython Notebook file. 
We will try out the aforementioned stopword lists on the large
[Reuters corpus](http://about.reuters.com/researchandstandards/corpus/). 

In [None]:
import wget

link_to_data = 'https://github.com/tulip-lab/mds/raw/master/Jupyter/data/stopwords_en.txt'

DataSet = wget.download(link_to_data)

In [None]:
!ls

In [None]:
import nltk
reuters = nltk.corpus.reuters.words()

stopwords_list_570 = []
with open('stopwords_en.txt') as f:
    stopwords_list_570 = f.read().splitlines()

Remove stop words accroding to NLTK's built-in stopword list.

In [None]:
filtered_reutuers = [w for w in reuters if w.lower() not in stopwords_list]
len(filtered_reutuers)*1.0/len(reuters)

Remove stop words according to the downloaded stop word list. (Note: the following script will run a couple of minutes due to data structure used in search.)

In [None]:
filtered_reutuers = [w for w in reuters if w.lower() not in stopwords_list_570]
len(filtered_reutuers)*1.0/len(reuters)

Thus, with the help of these two stopword lists, we can filter about 36% and 34% of the words respectively.
We have significantly reduced the size of the Reuters corpus. 
The question is: Have we lost lots of information due to removing stopwords? 
For the large majority of NLP and text mining tasks and algorithms, stopwords usually appear to be of little value and have little impact on the final results, as the presence of stopwords in a text does not really help distinguishing it from other texts. 
In contrast, text analysis tasks involving phrases are the exception because phrases lose their meaning if some of the words are removed. 
For example, if the two stopwords in the phrase "a bed of roses" are removed, its original meaning in the context of IR will be lost.

Stopwords usually refer to the most common words in a language. 
The general strategy for determining whether a word is a stopword or not is to compute its total number of appearances in a corpus. 
We will cover more about removing common words other than stopwords while we further explore text data in next chapter.
Here we would like to point out that failing to remove those common words could lead to skewed analysis results.
For example, while analysing emails we usually remove headers (e.g., "Subject", "To", and "From") and sometimes
a lengthy legal disclaimer that often appears in many corporate emails.
For short messages, a long disclaimer can overwhelm the actual text when performing any sort of text analysis.
For more discussion on stopping, please read [5] and watch an 8-mintue YouTube video on [Stop Words](https://www.youtube.com/watch?v=w36-U-ccajM).

### Stemming and Lemmatization

Another question in text pre-processing is whether we want to keep word forms like "educate", "educated", "educating", 
and "educates" separate or to collapse them. Grouping such forms together and working in terms of their base form is 
usually known as stemming or lemmatization.
Typically the stemming process includes the identification and removal of prefixes, suffixes, and pluralisation, 
and leaves you with a stem.
Lemmatization is a more advanced form of stemming that makes use of, for example, the context surrounding the words, 
an existing vocabulary, morphological analysis of words and other grammatical information (e.g., part-of-speech tags) 
to determine the basic or dictionary form of a word, which is known as the lemma.
See Wikipedia entries for [stemming](https://en.wikipedia.org/wiki/Stemming) 
and [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation).

Stemming and lemmatization are the basic text pre-processing methods for texts in languages like English, French, 
German, etc. 
In English, nouns are inflected in the plural, verbs are inflected in the various tenses, and adjectives are 
inflected in the comparative/superlative. 
For example,
* watch &#8594; watches
* party &#8594; parties
* carry &#8594; carrying
* love &#8594; loving
* stop &#8594; stopped
* wet &#8594; wetter
* fat &#8594; fattest
* die &#8594; dying
* meet &#8594; meeting

It is not hard to find that they all follow some inflections rules. 
For instance, to get the plural forms of nouns endings with consonant 'y', one often changes the ending 
'y' to 'ie' before adding 's'. 
Indeed most existing stemming algorithms make intensive use of this kind of rules.

In morphology, the derivation process creates a new word out of an existing one often by adding either 
a prefix or a suffix. It brings considerable sematic changes to the word, often word class is changed, for example,
* dark &#8594; darkness
* agree &#8594; agreement
* friend &#8594; friendship
* derivation &#8594; derivational

The goal of stemming and lemmatization is to reduce either inflectional forms or derivational forms of 
a word to a common base form. 
Before we demonstrate the use of several state-of-the-art stemmers and lemmatizers implemented in NLTK, please read
[4] and section 3.6 in [2].
If you are a visual learner, you could watch the YouTube video on 
[Stemming](https://www.youtube.com/watch?v=2s7f8mBwnko) from Prof. Dan Jurafsky.

NLTK provides several famous stemmers interfaces, such as

* Porter Stemmer, which is based on 
[The Porter Stemming Algorithm](http://tartarus.org/martin/PorterStemmer/)
* Lancaster Stemmer, which is based on 
[The Lancaster Stemming Algorithm](http://delivery.acm.org/10.1145/110000/101310/p56-paice.pdf?ip=130.194.73.168&id=101310&acc=ACTIVE%20SERVICE&key=65D80644F295BC0D%2E54DA4E88E6052E5D%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=586402953&CFTOKEN=41173049&__acm__=1456460730_26a9cd5f8f70e5d3e101f527c10e1a82),
* Snowball Stemmer, which is based on [the Snowball Stemming Algorithm](http://snowball.tartarus.org/)

Let's try the three stemmers on the words listed above.

In [None]:
words = ['watches', 'parties', 'carrying', 'loving', 'stopped', 'wetter', 'fattest', 
          'dying', 'darkness', 'agreement', 'friendship', 'derivational', 'denied',  'meeting']

Porter Stemming Algorithm is the one of the most common stemming algorithms.
It makes use of a series of heuristic replacement rules.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

The Porter Stemmer works quite well on general cases, like 'watches' &#8594; 'watch' and 'darkness' &#8594; 'dark'.
However, for some special cases, the Porter Stemmer might not work as expected, 
like  'carrying'  &#8594; 'carri' and 'derivational' &#8594; 'deriv'. 
Note that a concept called "list comprehension" supported by Python is used here.
If you would like to know more about list comprehension, please click [here](http://www.secnetix.de/olli/Python/list_comprehensions.hawk).

The Lancaster Stemmer is much newer than the Porter Stemmer, published in 1990.

In [None]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

After comparing the output from the Lancaster Stemmer and that from the Porter Stemmer, you might think that
the Lancaster Stemmer could be a bit more aggressive than the Porter Stemmer, since it gets 'agreement' &#8594; 'agr' and 'derivational' &#8594; 'der'. 
At the same time, it seems that the Lancaster Stemmer can handle words like 'parties' and 'carrying' quite well.

Now let's try the Snowball Stemmer.
The version in NLTK is available in 15 languages.
Different from the previous two stemmers, you need to specify which language the Snowball Stemmer will be applied to in its class constructor.
It works in a similar way to the Porter Stemmer.

In [None]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

A stemmer usually resorts to language-specific rules. 
Different stemmers implementing different rules and behave differently, 
as shown above.
The use of inflection and derivation is very complex in English.
There might not exist a set of rules that can cover all the cases.
Therefore, the stemmers that you have played will always generate some out-of-vocabulary words.

Rather than using a stemmer, you can use a lemmatizer that utilises
more information about the language to accurately identify the lemma
for each word.
As pointed out in "**Stemming and lemmatization**", 
> Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words

The WordNet lemmatizer implemented in NLTK is based on WordNet's built-in morphologic function, and returns the input word unchanged if it cannot be found in WordNet, which sounds more reasonable
than just chopping off prefixes and suffixes. In NLTK, you can use it in the following way:

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
['{0} -> {1}'.format(w, lemmatizer.lemmatize(w)) for w in words]

It is a bit strange that the lemmatizer did nothing to nearly all the words, except for 'watches', 'parties'
However, if we specify the POS tag of each word, what will happen?
Let try a couple of words in our list.

In [None]:
lemmatizer.lemmatize('dying', pos='v')

In [None]:
lemmatizer.lemmatize('meeting', pos='v')

In [None]:
lemmatizer.lemmatize('meeting', pos='n')

In [None]:
lemmatizer.lemmatize('wetter', pos='a')

In [None]:
lemmatizer.lemmatize('fattest', pos='a')

If we know the POS tags of the words, the WordNet Lemmatizer can accurately identify the corresponding lemmas.
For example, the word 'meeting' with different POS tag, the WordNet Lemmatizer gives you different lemmas.
Without giving the POS tags, it uses noun as default.

Both stemming and lemmatization can significantly reduce the number of words in a vocabulary.
In other words, the downstream text analysis tools can benefit from them by saving running time
and memory space. In contrast, can stemming and lemmatization improve the performance
of those tools? It is a quite arguable question. 
As pointed out in [4], stemming and lemmatization can increase recall but harm precision in information
retrieval. Researchers have also found that classifying English document tasks often do not gain 
from stemming and lemmatization.
However, it might not be the case when we change our language to something rather than English, for example, German.

### Sentence Segmentation

Sentence segmentation is also known as sentence boundary disambiguation or sentence boundary detection.
The following is the Wikipedia definition of sentence boundary disambiguation:
>Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

SBD is one of the essential problems for many NLP tasks, like Parsing, Information Extraction, Machine Translation, and Document Summarizations. 
The accuracy of the SBD system will directly affect the performance of these applications. 

Sentences are the basic textual unit immediately above the word and phrase. 
So what is a sentence? Is something ending with one of the following punctuations ".", "!", "?"?
Does a period always indicate sentence boundaries?
For English texts, it is almost as easy as finding every occurrence of those punctuations.
However, some periods occur as part of abbreviations, monetary numerals and percentages, as we 
have discussed in sections 1.2 and 1.3. 
Although you can use a few heuristic rules to correctly
identify the majority of sentence boundaries, SBD is much more complex that we can expect,
please read section 4.2.4 of [6] and watch a Youtube video on [Sentence segmentation](https://class.coursera.org/nlp/lecture/5). 
discussing more advanced techniques for SBD goes beyond our scope.
Instead, we will show you some sentence segmentation tools implemented in NLTK.
Please also note that there are other tools or packages containing a sentence tokenizer,
for example, Apache OpenNLP, Stanford NLP toolkit, and so on.

The NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) was designed to split 
text into sentences "*by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.*” It contains a pre-trained sentence tokenizer for English.
Let's test it out with a couple of examples extracted from the book, called "Moby Dick", on Project Gutenberg, by 
Herman Melville.
First construct a pre-trained English sentence tokenizer,

In [None]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

Following the intruction on the official website of Punkt Sentence Tokenizer, we tokenize two snippets extracted
from "Moby Dick":

In [None]:
text1 = '''And so it turned out; Mr. Hosea Hussey being from home, but leaving 
Mrs. Hussey entirely competent to attend to all his affairs. Upon making known our desires 
for a supper and a bed, Mrs. Hussey, postponing further scolding for the present, ushered us 
into a little room, and seating us at a table spread with the relics of a recently concluded repast, 
turned round to us and said—"Clam or Cod?"'''
print('\n-----\n'.join(sent_detector.tokenize(text1.strip())))

In [None]:
text2 = '''A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"'''
print('\n-----\n'.join(sent_detector.tokenize(text2.strip())))

You can also use `sent_tokenize`, an instance of Punkt Sentence Tokenizer.
This instance has already been trained on and works well for many European languages.
```python
    from nltk.tokenize import sent_tokenize
    sent_tokenize(text1)
```
You should get similar outputs as above.

Comparing the two results we notice that the sentence tokenizer has troubles in recognizing abbreviations.
It got "Mrs." right in the first snippet but not the second. Regarding this type of issues, please read a blog post on sentence tokenizer [7].
* * *

## Additional Reading and Resources

1. "[Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)" 📖 .
2. "[Processing Row Text](http://www.nltk.org/book_1ed/ch03.html)", chapter 3 of
of "Natural Language Processing with Python".
3. "[The Art of Tokenization](https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en)": An IBM blog on tokenization. It gives a detailed discussion about word tokenization and its challenges 📖 .
4. "[Stemming and lemmatization](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)" 📖 .
5. "[Dropping common terms: stop words](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)" 📖 .
6. "[Corpus-Based Work](http://cognet.mit.edu.ezproxy.lib.monash.edu.au/system/cogfiles/books/9780262312134/pdfs/9780262312134_chap4.pdf)", Chapter 4 of "Foundations of statistical natural language processing" by Christopher D. Manning 📖 .
7. "[Testing out the NLTK sentence tokenizer](http://www.robincamille.com/2012-02-18-nltk-sentence-tokenizer/)"
1. "[Accessing Text Corpora and Lexical Resources](http://www.nltk.org/book/ch02.html): Chapter 2 of "Natural Language Processing with Python" By Steven Bird, Ewan Kelin & Edward Loper 📖 .
2. "[Corpus Readers](http://www.nltk.org/howto/corpus.html#tagged-corpora)": An NLTK tutorial on accessing the contents of a diverse set of corpora.