# SIT742: Modern Data Science 
**(Week 04: Text Analysis)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)

Prepared by **SIT742 Teaching Team**

---

## Session 4A - The Fundamentals of Text Pre-processing

Table of Content

* Part 1. Accessing Various Text Resources
* Part 2. Basic Steps of Pre-Processing Text 
* Part 3. Summary
* Part 4. Reading Materials


---

The majority of text data that appears in everyday sources such as books, 
newspapers, magazines, emails, blogs, and tweets 
is free language text. Given the amount of information stored as text on the Internet, it is not feasible for a human to manually explore such a large amount of text data to extract useful information. Therefore, we have to use automatic approaches, such as text analysis algorithms developed in the fields of text mining, natural language process (NLP) and information retrieval (IR). It is worth knowing that computers cannot directly understand text like humans. For example, humans can automatically break down sentences into units of meaning, but computers cannot. Therefore, text data must be processed before various text analysis algorithms can use it.

Unlike the data you can retrieve from relational databases, text data always appears in an unstructured form.
By unstructured we mean that text data exists "in the wild" and has not been converted into a structured format, like a spreadsheet. Therefore, it has to be manipulated and converted into a proper structured and numerical format consumable by text analysis algorithms, which is referred to as text pre-processing. It is an important task and a critical step in text analysis. The characters, words and sentences identified by text pre-processing are the fundamental units passed to all the downstream text analysis algorithms, such as part-of-speech tagging, parsing, document classification and clustering, etc.
This chapter describes the basic pre-processing steps that are needed to convert unstructured text into a structured 
format.

## Part 1. Accessing Various Text Resources

What are the text corpora and lexical resources often used in text analysis? Where and how can we 
access them? 
Text data used for different text analysis tasks can be derived from various resources, such as 
* **Existing data repositories**, most of which contains corpora that have been either pre-processed into a specific format that can be directly digested by the downstream text analysis algorithms or manually annotated. 
For example,
    *  [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table) contains 30 corpora that can be used in text mining tasks, such as regression, clustering, and classification.  
    * [Linguistic Data Consortium](https://www.ldc.upenn.edu/) contains corpora mainly used in various natural language processing tasks, such as parsing, acoustic analysis, phonological analysis and etc. One disadvantage of using LDC is that its corpora are not free. Users have to buy a license in order to use those corpora.
    
* **NLTK**: A language toolkit that also includes a diverse set of corpora and lexical resources, which include, for example,
    * Plain text corpora, e.g.,
        * The Gutenberg Corpus contains thousands of books.
    * Tagged Corpora, e.g.,
        * The Brown Corpus is annotated with part-of-speech tags. Each word is now paired with its part-of-speech tag.
           You can retrieve words as (word, tag) tuples, rather than just bare word strings.
    * Chunked Corpora, e.g.,
        * The CoNLL corpora includes phrasal chunks (CoNLL 2000), named entity chunks (CoNLL 2002).
    * Parsed Corpora, e.g.,
        * The Treebank corpora provide a syntactic parse for each sentence, like the Penn Treebank based on Wall Street Journal samples.
    * Word List and Lexicons, e.g.,
        * [WordNet](https://wordnet.princeton.edu/): a large lexical database of English, where nouns, verbs, adjectives and adverbs are organized into interlinked synsets (i.e., sets of synonyms)
    * Categorized Corpora: 
        * The Reuters corpus: a corpus of Reuters News stories for used in developing text analysis algorithms.
        
* **Web**: The largest source for getting text data is the Web. Text can be extracted from webpages or be retrieved
via various APIs. For example,
     * **Wikipedia articles**: The Wikimedia website provides links to download dumps of Wikipedia articles. Click [here](https://dumps.wikimedia.org/enwiki) to view various dumps for English Wikipedia articles. 
     * **Tweets** that allows people to communicate with short, 140-characters messages. It is fortunate that Twitter provides quite well documented API that we can use to retrieve tweets of our interest.
     * The other text data can be scraped from the Internet, like webpages. Here is a <a href="https://www.youtube.com/watch?v=3xQTJi2tqgk">Youtube video</a> on **scraping websites with Python**.

The set of NLTK corpora can be easily accessed with interfaces offered by NLTK. Here we show you how to install the text data that comes with NLTK and all the packages included in NLTK.

In [0]:
import nltk 
#If you're unsure of which data/model you need, you can start out with the basic list of data + models with:
#It will download a list of "popular" resources, these includes:
nltk.download("popular")
#It will download a list of "retuters" resources, thses includes:
nltk.download("reuters")
#While you downliad the nltk package, it will show the Download path,(root/nltk_data)
#It will also show the 1st item in the nltk.data.path list

# Specifies the file stored in the NLTK data package at *path*. NLTK will search for these files in the directories specified by ``nltk.data.path``.
nltk.data.path

In [35]:
import nltk 
#A new window should open, showing the NLTK Downloader.
#You can input the related character for the command.
#For example, if you would like to check the current NLTK confiuation details.
#Just input 'c' and then input 'd' in the new line.
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


KeyboardInterrupt: ignored

You also can install the NLTK software on the Mac or Windows OS, For example, if you use the Mac OS, then you run the above block's two commands and it will locally gives you a window that looks like the following screenshot. For this lab, we use the Google Colab, a Linux-like system, it will show the command line interface (CLL) not a windows as below if you run the nltk.download() function.

![NLTK](https://github.com/tulip-lab/sit742/raw/master/Jupyter/image/nltkInstallWindow.png "NLTK")



This window,  'NTLK DOwnload', shown on the Mac OS will allows you to browse the available corpora and packages included in NLTK. The Collections tab on the downloader shows how the packages are grouped into sets. You can select the line labeled "all" and click "download" to obtain all corpora and packages (<font color = "red">Warning: the size is a couple of GBs</font>). It will take a couple of minutes to download the corpora and packages, depending on how fast your Internet connection is. You can also choose to just install the copora and packages as you go.
* * *

## Part 2. Basic Steps of Pre-Processing Text

The possible steps of text pre-processing are nearly the same for all text analysis tasks, though which pre-processing steps are chosen depends on the specific task. The basic steps are as follows:
* Tokenization
* Case normalization
* Removing Stop words
* Stemming and Lemmatization
* Sentence Segmentation

We will walk you through each of these steps with some examples. First, you need to 
decide <font color="red">the scope of the text to be used in the downstream text analysis tasks</font>. Should you use an entire document?
Or should you break the document down into sections, paragraphs, or sentences. Choosing 
the proper scope depends on the goals of the analysis task.
For example, you might choose to use an entire document in document classification and clustering tasks
while you might choose smaller units like paragraphs or sentences in document summarization and information
retrieval tasks. The scope chosen by you will have an impact on the steps needed in the pre-processing process.





### 2.1. Tokenization

Text is usually represented as sequences of characters by computers. 
However, most natural language processing (NLP) and text mining tasks
(e.g., parsing, information extraction, machine translation, document classification, information
retrieval, etc.) need to operate on tokens. 
The process of breaking a stream of text into tokens is often referred to as **tokenization**.
For example, a tokenizer turns a string such as 
```
    A data wrangler is the person performing the wrangling tasks.
```
into a sequence of tokens such as
```
    "A" "data" "wrangler" "is" "the" "person" "performing" "the" "wrangling" "tasks"
```

There is no single right way to do tokenization. 
It completely depends on the corpus and the text analysis task you are going to perform. It is important to ensure that your tokenizer produces proper token types for your downstream text analysis tools. 
Although word tokenization is relatively easy compared with other NLP or text mining task, errors made in this phase will propagate into later analysis and cause problems.
In this section, we will demonstrate the process of chopping character sequences into pieces with different tokenizers. 

The major question of the tokenization phase is what counts as a token.
Different linguistic analyses might have different notions of tokens.
In different languages, a token could mean different things. 
Here we are not going to dive into the linguistic aspect of what counts as a token,
as it goes beyond the scope of this unit.
We rather consider English text.
**In English, a token can be a string of alphanumeric characters separated by spaces, which
seems quite easy.**
However, things get considerably worse when we start considering words having
hyphens, apostrophes, periods and so on. In a word tokenization task, should we
remove hyphens? Should we keep periods? 
According to different text analysis tasks, 
tokens can be unigram words, multi-word phrases (or collocations), or 
other meaningful and identifiable linguistic elements.
Therefore, working out word tokens is not an easy task in pre-processing natural language text.
You might be interested in watching a YouTube video on [word tokenization](https://www.youtube.com/watch?v=f9o514a-kuc).

In [0]:
raw = """The GSO finace group in  U.S.A. provided Cole with about
US$40,000,555.4 in funding, which accounts for 35.3% of Cole's revenue (i.e., AUD113.3m), 
as the ASX-listed firm battles for its survival.
Mr. Johnson said GSO's recapitalisation meant "the current shares are worthless"."""

#### 2.1.1 Standard Tokenizer

For English, a straightforward tokenization strategy is to use white spaces as token delimiters. 
The whitespace tokenizer simply splits the text on any sequence of whitespace, tab, or newline characters.
Consider the above hypothetical text.
As a starting point, let's tokenize the text above by using any whitespace characters as token delimiters.
As mentioned, these characters include whitespace (' '), tab ('\t'), newline ('\n'), return ('\r'), and so on.
You have learnt in week 2 that those characters are together represented by a built-in regular expression abbreviation '\s'.
Thus, we will use '\s' rather than writing it as something like '[ \t\n]+'.
You can read the details about the ["\s" Syntax](https://docs.python.org/3/library/re.html)

There are multiple ways of tokenizing a string with whitespaces.
The simplest approach might be using Python's string function `split()`.
This function returns a list of tokens in the string.
Another way is to use Python's regular expression package, `re` as
```python
    import re
    re.split(r"\s+", raw)
```
The output should be exactly the same as that given by the string function `split()`.
Here we further demonstrate the use of <font color="blue">RegexpTokenzier</font> from Natural Language Toolkit (NLTK).

In [0]:
from nltk.tokenize import RegexpTokenizer

In [0]:
#For the RegexpTokenizer function, the arguement gaps type is bool.
#we will use the 'True' if this tokenizer's pattern should be used to find separators between tokens; 
#we will use the 'False' if this tokenizer's pattern should be used to find the tokens themselves.

#The below example with gasp param is True
tokenizer = RegexpTokenizer(r"\s+", gaps=True)
tokens = tokenizer.tokenize(raw)
print(tokens)

#The below example with gasp param is False
tokenizer_test = RegexpTokenizer(r"\s+", gaps=False)
tokens_test = tokenizer_test.tokenize(raw)
print(tokens_test)

A <font color="blue">RegexpTokenizer</font> splits a string into tokens using a regular expression.
Refer to its online [documentation](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer) 
for more details.
Its constructor takes four arguments.
The compulsory argument is the pattern used to build the tokenizer.
It is in the form of a regular expression. 
**In the example above, we used `\s+` to match 1 or more whitespace characters.**
If the pattern defines separators between tokens, the value of `gaps` should be
set to `True`. Otherwise, the pattern should be used to find the tokens.
NLTK also provides a whitespace tokenizer, `WhitespaceTokenizer[source]`, which is
equivalent to our tokenizer. Try




In [0]:
from nltk.tokenize import WhitespaceTokenizer
WhitespaceTokenizer().tokenize(raw)

It seems that word tokenization is quite simple if words in a language are all
separated by whitespace characters. 
However, this is not the case in many languages other than English, **such
as Chinese, Japanese, Korean and Ancient Greek.** 
In those languages, text is written without any whitespaces between words. 
So the whitespace tokenizer is of no use at all.
To handle them, we need more advanced tokenization techniques, often referred to as
word segmentation, which is an important and challenging task in NLP. 
**However,
discussing word segmentation is beyond our scope here.**

It is not surprising that the whitespace tokenizer is **insufficient** even for English, since English does not just contains sequences of alphanumeric characters separated by white spaces. 
It often contains punctuation, hyphen, apostrophe, and so on.
Sometimes **whitespace does not necessarily indicate a word break. **
For example, non-compositional phrases (e.g., "real estate" and "shooting pain") and proper nouns (e.g., "The New York Times") have a different meaning than the sum of their parts. They cannot be split in the process of word tokenization.
They must be treated as a whole in, for instance, information retrieval.

Back to our example, 
the whitespace tokenizer still gives us word like "(i.e.,", "funding," and "worthless".".
We would like to remove parentheses, some punctuations, quotation marks and other non-alphanumeric characters.
A simple and straightforward strategy is to use all non-alphanumeric characters as token delimiters.

In [0]:
tokenizer = RegexpTokenizer(r"\W+", gaps=True) 
tokenizer.tokenize(raw)

In regular expressions, '\W' indicates any non-alphanumeric characters (equivalent to `[^a-zA-Z0-9]`) while '\w' indicates any alphanumeric characters (equivalent to `[a-zA-Z0-9]`). 
The counterpart is to extract tokens that only consist of alphanumeric characters without the empty strings. Try the following out yourself:
```python
    tokenizer = RegexpTokenizer(r"\w+")
    tokenizer.tokenize(raw)
```

These two strategies are simple to implement, but there are cases where they may not match the desired behaviour. 
For example, the whitespace tokenizer cannot properly handle non-alphanumeric characters, while the non-alphanumeric tokenizer might over-tokenise some tokens with periods, hyphens, apostrophes, etc.
In the rest of this section, we will discuss the main problems that you might face while tokenising free language text. You will soon find that tokenizers should often be customized to deal with different datasets.

In [0]:
#\w means match any alphanumberic characters
tokenizer = RegexpTokenizer(r"\w+", gaps=True) 
tokenizer.tokenize(raw)

#### 2.1.2 Periods in Abbreviations

Word tokens are not always surrounded by whitespace characters. Punctuation, such as commas, semicolons, and periods, are often used in English, as they are vital to disambiguate the meaning of sentences. However, it is problematic for computers to handle punctuation, especially periods, properly in tokenization. 
In this part we will focus on the handling of periods.

Periods are usually used to mark the end of sentences. Difficulty arises when the period marks abbreviations (including acronyms). Please refer to **"Step 2: Handling Abbreviations" in [3]** for a detailed discussion on abbreviations.  In the case of abbreviations, particularly acronyms, separating tokens on punctuation and other non-alphanumeric characters would put different components of the acronym into different tokens, as you have seen in our example, where "U.S.A" has been put into three tokens, "U", "S" and "A", losing the meaning of the acronym. To deal with abbreviations, one approach is to maintain a look-up list of known abbreviations during tokenization. Another approach aims for smart tokenization. Here we will show you how to use regular expressions to cover most but not all abbreviations.

An acronym is often formed from the initial components in multi-word phrases.  Some contains periods, and some do not. Common acronyms with periods are for example, 
* U.S.A
* U.N.
* U.K.
* B.B.C

Other abbreviations with a similar pattern are, for instance, 
* A.M. and P.M.
* A.D. and B.C.
* O.K.
* i.e.
* e.g.

For abbreviations like those, it is not hard to figure out the pattern and the corresponding regular expression.  Each of those abbreviations contains at least a pair of a letter (either uppercase or lowercase) and a period.  The regular expression is
```python
    r"([a-zA-Z]\.)+"
```
To see the graphical representation of the regular expression, please click the [RegexpTokenizer](https://regexper.com/#%28%5Ba-zA-z%5D%5C.%29%2B) webpage.

In [0]:
#If you directly use the r"([a-zA-Z])", you will find out that the output is different with your expect. 
tokenizer = RegexpTokenizer(r"([a-zA-Z]\.)+")
tokenizer.tokenize(raw)

In [0]:
#Then, we add the ?: in the above regular expression.
tokenizer = RegexpTokenizer(r"(?:[a-zA-Z]\.)+")
tokenizer.tokenize(raw)

Observe that
1. We introduced <font color="red">(?: )</font> in the regular expression to avoid just selecting substrings that match the pattern. `(?:)` is a non-capturing version of regular parentheses. If the parentheses are used to specify the scope of the pattern, but not to select the matched material to be output, you have to use `(?:)`. To check out how `?:` affects the output, try to remove it and run the tokenizer again. You will get the following output
```
    ['e.', 'A.', 'l.', 'r.']
```
It just returns the last substrings that match the pattern.
2. The code also returned 'l.' and 'r.' that are part of 'survival.' and 'Mr.' 
The period in 'survival.' marks the end of a sentence. 
Indeed, it is very challenging to deal with the period at the end of each sentence, as it can also be part of an abbreviation if the abbreviation appears at the end of a sentence.
For example, the following sentence ends with 'etc.'
```
    I need milk, eggs, bread, etc.
```

Next, let’s further consider some more general abbreviations, like
* Mr. and Mrs.
* Dr.
* st.
* Wash. and Calif. (abbreviations for two states in U.S., Washington and California)

In those abbreviations, the period is always preceded two or more letters in English alphabet. Turn this pattern into a regular expression
```
    r"[a-zA-z]{2,}\."
```

In [0]:
tokenizer = RegexpTokenizer(r"[a-zA-z]{2,}\.")
tokenizer.tokenize(raw)

It is not surprising that the ouput contains "survival." again. 
The issue of working out which punctuation marks indicate the end of a setence will be discussed in section 2.5.
Let's put all the cases together. 
The regular expression can be generalised to
```python
    r"([a-zA-Z]+\.)+"
```
which matches both acronyms and abbreviations like "Dr."

In [0]:
tokenizer = RegexpTokenizer(r"([a-zA-z]+\.)+")
tokenizer.tokenize(raw)

As we mentioned early in this chapter, the issues of tokenization are language specific.
The language of the document to be tokenized should be known a priori.
Take computer technology as an example.
It has introduced new types of character sequences that a tokenizer should probably treat as a single token, including email addresses, web URLs, IP addresses, etc. One solution is to simply ignore them by using a non-alphanumeric-based tokenizer. 
However, this comes the cost of losing the original meaning of those kinds of tokens. For instance, if an IP address, like "172.19.197.106", is tokenized into individual numbers, "172", "19", "197", and "106".
It is no longer an IP address, and these numbers can be anything.
To account for strings like
* "172.19.197.106"
* "www.monash.edu.au"

you can simply update our regular expression accounting for abbreviations to 
```python
    (\w+\.?)+
```

Try it out on http://regexr.com/.

In [0]:
#Token for mathing IP address
tokenizer = RegexpTokenizer(r"\d{1,3}")
print(tokenizer.tokenize("172.19.197.106"))

#Token for mathing word in a UTL
tokenizer = RegexpTokenizer(r"\w{1,}")
print(tokenizer.tokenize("www.monash.edu.au"))

#the last word in a IP address or a URL
tokenizer = RegexpTokenizer(r"(\w+\.?)+")
print(tokenizer.tokenize("172.19.197.106"))
print(tokenizer.tokenize("www.monash.edu.au"))

#### 2.1.3 Currency and Percentages

While analysing financial document, such as finance reports, a financial analyst might be interested in monetary numerals mentioned in the reports. One interesting research question in both finance and computer science is whether one can use finance reports to help predict the stock market prices. In this case, it would be good for a tokenizer to keep all the monetary numerals.

Currency is usually expressed in symbols and numerals (e.g., $10).
There are many different ways of writing about different currencies.
For example,
* A three-letter currency abbreviations followed by figures, for example,
```
    AUD100, EUR500, CNY330 
```

* A letter or letters symbolising the country followed the, for example,
```
    A$100 (= AUD100), US$10 (= USD10), C$5 (= CAD5),
```

* A currency symbols ($, £, €, ¥, etc.) followed by figures, for examples
```
    £100.5, €30.0
```

While the number of digits in the integer part is more than three, commas are often inserted between every three digits, like
```
    AUD100, 000 
```
Let's construct a regular expression that can account for all the following monetary numerals
```
1. $10,000.00
2. €10,000,000.00
3. ¥5.5555
4. AUD100
5. A$10.555
```
The regular expression should looks like as follows (<a href="https://regexper.com/#(%3F%3A%5BA-Z%5D%7B1%2C3%7D)%3F%5B%5C%24£€¥%5D%3F(%3F%3A%5Cd%7B1%2C3%7D%2C)*%5Cd%7B1%2C3%7D(%3F%3A%5C.%5Cd%2B)%3F"> the graphical representation</a>):
```python
    r" (?:          
        [A-Z]{1,3})?                 # (1)
        [\$£€¥]?         # (2)
        (?:\d{1,3},)*      # (3)
        \d{1,3}          # (4)
        (?:\.\d+)?        # (5)
   "
```

![The diagram for this regular expression](https://github.com/tulip-lab/sit742/raw/master/Jupyter/image/P04A01.png)


(1) matches the start of monetary numerals, which consists of one or up to 3 uppercase letters that indicate a country symbol or a currency abbreviation.
<br/>
(2) together with (1), matches the start of monetary numerals, which consists of either only a currency symbol or a country symbol plus a currency symbol.
<br/>
(3) accounts for the integer part that contains more than three digits. It matches all digits in the integer part except for the last three digits.
<br/>
(4) matches the last three digits in the integer part.
<br/>
(5) matches the fractional part.


In [0]:
#Let run the above regular expression
tokenizer = RegexpTokenizer(r"(?:[A-Z]{1,3})?[\$£€¥]?(?:\d{1,3},)*\d{1,3}(?:\.\d+)?")
tokenizer.tokenize(raw)

Refer back to our example text "raw", can you find any issue rather than the percentage (35.5%)? The regular expression cannot handle "AUD113.3m", where the "m" indicates million. Without 'm', the number 'AUD113.3' loses its meaning in the original context. Therefore, you have seen that there might not be a regular expression that can handle all possible ways of representing currency.

Now, we have constructed a regular expression for currencies, even though it is not perfect.
Next, we move to working out the regular expression for percentages, things becomes quite easy.
Percentages usually have the following forms
* 23%
* 23.23%
* 23.2323%
* 100.00%

The maximum number of digits in the integer part is 3, the minimun is 1, so the regular expression is '\d{1,3}'.
A percentage can have either one or no fractional part, which can be matched by '(\.\d+)?'.
Adding % to the end, we have (<a href="https://regexper.com/#%5Cd%7B1%2C3%7D(%5C.%5Cd%2B)%25">the graphical representation</a>)
```python
    r"\d{1,3}(\.\d+)%"
```

![The diagram for this regular expression](https://github.com/tulip-lab/sit742/raw/master/Jupyter/image/P04A02.png)

In [0]:
tokenizer = RegexpTokenizer(r"\d{1,3}(?:\.\d+)?%")
tokenizer.tokenize(raw)

The above code should give you the only percentage in our example text. 
Compare the regular expression matching percentages with that matching currency,
you will find that the former is similar to the last bits of the latter, except for the percentage sign.
Besides, there are other numerical and special expressions that
we can not easily handle with regular expressions. For example, these expressions include
email addresses, time, vehicle licence numbers, phone numbers, etc.
If you are interested in dealing with them, you could read the “Regular Expressions Cookbook” by Jan Goyvaerts and Steven Levithan. 

#### 2.1.4 Hyphens and Apostrophes 

In English, hyphenation is used for various purposes. The hyphen can be used to form certain compound terms, including hyphenated compound nouns, verbs and adjectives. It can also be used for word division. There are many sources of hyphens in texts. Thus, should one count a sequence of letters with a hyphen as one word to two? Unfortunately, the answer seems to be sometimes one, sometimes two. 
For example, if the hyphen is used to split up vowels in words, such as "co-operate", "co-education" and "pre-process", these words should be regarded as single token. In contrast, if the hyphen is used to group a couple of words together, for example, "a state-of-the-art algorithm" and "a money-back guarantee", these hyphenated words should be separated into individual words.
Therefore, handling hyphenated words automatically is one of the most difficult tasks in pre-processing text data.

"**The Art of Tokenization**" (Please refer the Part 4, Reading Materials) categorizes different hyphens into three types:
* **End-of-Line Hyphen**: In professionally printed material (like books, and newspapers), the hyphen is used to divide words between the end of one line and the beginning of the next in order to perform justification of text during typesetting. It seems to be easy to handle these kinds of hyphens by simply removing them and joining the parts of a word at the end of one line and the beginning of the next.
* **Lexical Hyphen**: Words with a lexical hyphen are better to be treated as a single word. They are typically included in a dictionary. For example, words contains certain prefixes, like "co-", "pre-", "multi-", etc., and other words like "so-called", "forty-two"
* **Sententially Determined Hyphenation**: This type of hyphen is often created dynamically. It includes, for example, nouns modified by an 'ed'-verb (e.g., "text-based" and "hand-made") and sequences of words used as a modifier in a noun group, as in "the 50-cent-an-hour raise". In these cases, we might want to treat those tokens joined by hyphens as individual words.

The use of hyphens in many such cases is extremely inconsistent, which further increase the complexity of dealing with hyphens in tokenization. People often resort to using either some heuristic rules or treating it as a machine learning problem. However, these go beyond our scope here. It is clear that handling hyphenation is much more complicated than one can expect. You should also be clear that there is no way of handling all the cases above.

Let's assume that we are going to treat all strings of two words separated by a hyphen as a single token, how can we extract them from texts without breaking them into pieces.  In our example text, we are going to view "ASX-listed" as a single token. The pattern here is  a sequence of alphanumeric character plus "-" and plus another sequence of alphanumeric character.
The corresponding regular expressions should be 
```python
    r"\w+-\w"
```

In [0]:
tokenizer = RegexpTokenizer(r"\w+-\w+")
tokenizer.tokenize(raw)

Similar to hyphens, how to handle an apostrophe in tokenization is another complex question. The apostrophe in English is often used in two cases:
* Contractions: a shortened version of a word or multiple words. 
    * don't (do not)
    * she'll (she will)
    * you're (you are)
    * he's (he is or he has)
    * you'd (you would)
* Possessives: used to indicate ownership/possession with nouns.
    * the cat's tail
    * Einstein's theory
    
Should we treat a string containing apostrophes as a single word or two words?
Perhaps, you might think we should separate English Contractions into two words, and regard possessives as a single word. 
However, distinguishing contractions from possessives is not easy.
For example, should "cat's" be "cat has/is" or the possessive case of cat.
Thus some processor in NLP splits the strings in either case into two words, while others do not.
Here we again assume that we are going to retrieve all strings with an apostrophe as single words.
The regular expression is quite similar to the one for handling hyphens.
```
     r"\w+'\w+"
```

In [0]:
tokenizer = RegexpTokenizer(r"\w+'\w+")
tokenizer.tokenize(raw)

Now let's generalise the `\w+` to permit word-internal hyphens and apostrophes (<a href="https://regexper.com/#%5Cw%2B(%3F%3A%5B-'%5D%5Cw%2B)%3F">the graphical representation</a>):
```python
    \w+(?:[-']\w+)? 
```

You have learnt some simple approaches for handling different issues in word tokenization, which turns out to be far more difficult than you might have expected. It is clear that different NLP and text mining tasks on different text corpora need different word tokenization strategies, as you must decide what counts as a word. Besides the `RegexpTokenizer`, NLTK implements a set of other word tokenizaton modules. Please refer to [its official webpage](http://www.nltk.org/api/nltk.tokenize.html) for more details.
So far that we have only considered well-written text, but there are other types of natural language texts, such the transcripts of speech corpora and some non-standard texts like tweets that provide their own additional challenges.

In [0]:
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)? ")
tokenizer.tokenize(raw)

### 2.2. Case Normalization
After word tokenization, you may find that words can contain either upper- or lowercase letters. 
For example, you might have "data" and "Data" appearing in the same text.
Should one treat them as two different words or as the same word?
Most English texts are written in mixed case. 
In other words, a text can contain both upper- and lowercase letters.
Capitalization helps readers differentiate, for example, between nouns and proper nouns.
In many circumstances, however, an uppercase word should be treated no differently than in lower case appearing in a document, and even in a corpus.
Therefore, a common strategy is to reduce all letters in a word to lower case.
It is very simple to do so.

In [0]:
tokens = [token.lower() for token in tokens]
tokens

It is often a good idea to do case normalization. For example, with case normalization, you can match "data wrangling" with "Data Wrangling" in an information retrieval task. But for other tasks, like named entity recognition, one would better to keep capitalised words (e.g., pronouns) left as capitalised.
People have tried some simple heuristics that just makes some token lowercase. 
However, there is a trade-off between getting capitalization right and simply using lowercase regardless of the correct case of words.
You can read about basic formatting issues of text processing in "Corpus-Based Work" on the Part 4, Reading Materials.

### 2.3. Removing Stop words
[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are words that are extremely common and carry little lexical content. For many NLP and text mining tasks, it is useful to remove stopwords in order to save storage space 
and speed up processing, and the process of removing these words is usually called “stopping.” 
An example stopword list from NLTK is shown bellow:

In [0]:
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')

#show the stopword in the 'english' database
stopwords_list

The above list contains 127 stopwords in total, which are often [function words](https://en.wikipedia.org/wiki/Function_word) in English, like articles (e.g., "a", "the", and "an"), 
pronouns (e.g., "he", "him", and "they"), particles (e.g., "well", "however" and "thus"), etc.
It is easy to use NLTK's built-in stopword list to remove all the stopwords from a tokenised text.

In [0]:
filtered_tokens = [token for token in tokens if token not in stopwords_list]
filtered_tokens

In [0]:
#This will show all exclude stopwords from the filtered list
excluded_tokens = [token for token in tokens if token in stopwords_list]
excluded_tokens

We have removed 13 stopwords. The rest Token number is 28. 
To check what stopwords have been excluded from the filtered list, you simply change `not in` to `in`.

There is no single universal list of stop words used by all NLP and text mining tools.
Different stopword lists are available online. For example, the English stopword list 
available at [Kevin Bouge's website](https://sites.google.com/site/kevinbouge/stopwords-lists) 
which contains 570 stopwords, a quite fine-grained stopword list. 
At the same website, you can also download stopword lists for 27 languages other than English.
Please download the English stopwords list from Kevin Bourge's website, and save it into the folder where
you keep this IPython Notebook file. 
We will try out the aforementioned stopword lists on the large
[Reuters corpus](https://github.com/teropa/nlp/tree/master/resources/corpora/reuters). 

In [0]:
!pip install wget

import wget

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/stopwords_en.txt'

DataSet = wget.download(link_to_data)

!ls

In [0]:
import nltk
reuters = nltk.corpus.reuters.words()

stopwords_list_570 = []
with open('stopwords_en.txt') as f:
    stopwords_list_570 = f.read().splitlines()
#It will show the retuers stopwords, you can compare it with the above 'english'stopwords.
#You will find that the 'retuers'stopwords is more abundant than 'english' stopwords. 
stopwords_list_570

Remove stop words accroding to NLTK's built-in stopword list.

In [0]:
filtered_reutuers = [w for w in reuters if w.lower() not in stopwords_list]
#It will show the percentage between the filtered_retuers and the 'english' stopwords
len(filtered_reutuers)*1.0/len(reuters)

Remove stop words according to the downloaded stop word list. (Note: the following script will run a couple of minutes due to data structure used in search.)

In [0]:
filtered_reutuers = [w for w in reuters if w.lower() not in stopwords_list_570]
#It will show the percentage between the filtered_retuers and the 'retuers' stopwords
#It will show that the retuers stopwords will filte more stopwords. 
len(filtered_reutuers)*1.0/len(reuters)

Thus, with the help of these two stopword lists, we can filter about 36% and 34% of the words respectively.
We have significantly reduced the size of the Reuters corpus. 
The question is: Have we lost lots of information due to removing stopwords? 
For the large majority of NLP and text mining tasks and algorithms, stopwords usually appear to be of little value and have little impact on the final results, as the presence of stopwords in a text does not really help distinguishing it from other texts. 
In contrast, text analysis tasks involving phrases are the exception because phrases lose their meaning if some of the words are removed. 
For example, if the two stopwords in the phrase "a bed of roses" are removed, its original meaning in the context of IR will be lost.

Stopwords usually refer to the most common words in a language. 
The general strategy for determining whether a word is a stopword or not is to compute its total number of appearances in a corpus. 
We will cover more about removing common words other than stopwords while we further explore text data in next chapter.
Here we would like to point out that failing to remove those common words could lead to skewed analysis results.
For example, while analysing emails we usually remove headers (e.g., "Subject", "To", and "From") and sometimes
a lengthy legal disclaimer that often appears in many corporate emails.
For short messages, a long disclaimer can overwhelm the actual text when performing any sort of text analysis.
For more discussion on stopping, please read [5] and watch an 8-mintue YouTube video on [Stop Words](https://www.youtube.com/watch?v=w36-U-ccajM).

### 2.4. Stemming and Lemmatization

Another question in text pre-processing is whether we want to keep word forms like "educate", "educated", "educating", 
and "educates" separate or to collapse them. Grouping such forms together and working in terms of their base form is 
usually known as stemming or lemmatization.
Typically the stemming process includes the identification and removal of prefixes, suffixes, and pluralisation, 
and leaves you with a stem.
Lemmatization is a more advanced form of stemming that makes use of, for example, the context surrounding the words, 
an existing vocabulary, morphological analysis of words and other grammatical information (e.g., part-of-speech tags) 
to determine the basic or dictionary form of a word, which is known as the lemma.
See Wikipedia entries for [stemming](https://en.wikipedia.org/wiki/Stemming) 
and [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation).

Stemming and lemmatization are the basic text pre-processing methods for texts in languages like English, French, 
German, etc. 
In English, nouns are inflected in the plural, verbs are inflected in the various tenses, and adjectives are 
inflected in the comparative/superlative. 
For example,
* watch &#8594; watches
* party &#8594; parties
* carry &#8594; carrying
* love &#8594; loving
* stop &#8594; stopped
* wet &#8594; wetter
* fat &#8594; fattest
* die &#8594; dying
* meet &#8594; meeting

It is not hard to find that they all follow some inflections rules. 
For instance, to get the plural forms of nouns endings with consonant 'y', one often changes the ending 
'y' to 'ie' before adding 's'. 
Indeed most existing stemming algorithms make intensive use of this kind of rules.

In morphology, the derivation process creates a new word out of an existing one often by adding either 
a prefix or a suffix. It brings considerable sematic changes to the word, often word class is changed, for example,
* dark &#8594; darkness
* agree &#8594; agreement
* friend &#8594; friendship
* derivation &#8594; derivational

The goal of stemming and lemmatization is to reduce either inflectional forms or derivational forms of 
a word to a common base form. 
Before we demonstrate the use of several state-of-the-art stemmers and lemmatizers implemented in NLTK, please read
[4] and section 3.6 in [2].
If you are a visual learner, you could watch the YouTube video on 
[Stemming](https://www.youtube.com/watch?v=2s7f8mBwnko) from Prof. Dan Jurafsky.

NLTK provides several famous stemmers interfaces, such as

* Porter Stemmer, which is based on 
[The Porter Stemming Algorithm](http://tartarus.org/martin/PorterStemmer/)
* Lancaster Stemmer, which is based on 
[The Lancaster Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/),
* Snowball Stemmer, which is based on [the Snowball Stemming Algorithm](http://snowball.tartarus.org/)

Let's try the three stemmers on the words listed above.


In [0]:
words = ['watches', 'parties', 'carrying', 'loving', 'stopped', 'wetter', 'fattest', 
          'dying', 'darkness', 'agreement', 'friendship', 'derivational', 'denied',  'meeting']

Porter Stemming Algorithm is the one of the most common stemming algorithms.
It makes use of a series of heuristic replacement rules.

In [0]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

The Porter Stemmer works quite well on general cases, like 'watches' &#8594; 'watch' and 'darkness' &#8594; 'dark'.
However, for some special cases, the Porter Stemmer might not work as expected, 
like  'carrying'  &#8594; 'carri' and 'derivational' &#8594; 'deriv'. 
Note that a concept called "list comprehension" supported by Python is used here.
If you would like to know more about list comprehension, please click [here](http://www.secnetix.de/olli/Python/list_comprehensions.hawk).

The Lancaster Stemmer is much newer than the Porter Stemmer, published in 1990.

In [0]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

After comparing the output from the Lancaster Stemmer and that from the Porter Stemmer, you might think that
the Lancaster Stemmer could be a bit more aggressive than the Porter Stemmer, since it gets 'agreement' &#8594; 'agr' and 'derivational' &#8594; 'der'. 
At the same time, it seems that the Lancaster Stemmer can handle words like 'parties' and 'carrying' quite well.

Now let's try the Snowball Stemmer.
The version in NLTK is available in 15 languages.
Different from the previous two stemmers, you need to specify which language the Snowball Stemmer will be applied to in its class constructor.
It works in a similar way to the Porter Stemmer.

In [0]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
['{0} -> {1}'.format(w, stemmer.stem(w)) for w in words]

A stemmer usually resorts to language-specific rules. 
Different stemmers implementing different rules and behave differently, 
as shown above.
The use of inflection and derivation is very complex in English.
There might not exist a set of rules that can cover all the cases.
Therefore, the stemmers that you have played will always generate some out-of-vocabulary words.

Rather than using a stemmer, you can use a lemmatizer that utilises
more information about the language to accurately identify the lemma
for each word.
As pointed out in "**Stemming and lemmatization**" (Please read the   related Reading Materials on the Part 4), 
> Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words

The WordNet lemmatizer implemented in NLTK is based on WordNet's built-in morphologic function, and returns the input word unchanged if it cannot be found in WordNet, which sounds more reasonable
than just chopping off prefixes and suffixes. In NLTK, you can use it in the following way:

In [0]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
['{0} -> {1}'.format(w, lemmatizer.lemmatize(w)) for w in words]

It is a bit strange that the lemmatizer did nothing to nearly all the words, except for 'watches', 'parties'
However, if we specify the POS tag of each word, what will happen?
Let try a couple of words in our list.

In [0]:
lemmatizer.lemmatize('dying', pos='v')

In [0]:
lemmatizer.lemmatize('meeting', pos='v')

In [0]:
lemmatizer.lemmatize('meeting', pos='n')

In [0]:
lemmatizer.lemmatize('wetter', pos='a')

In [0]:
lemmatizer.lemmatize('fattest', pos='a')

If we know the POS tags of the words, the WordNet Lemmatizer can accurately identify the corresponding lemmas.
For example, the word 'meeting' with different POS tag, the WordNet Lemmatizer gives you different lemmas.
Without giving the POS tags, it uses noun as default.

Both stemming and lemmatization can significantly reduce the number of words in a vocabulary.
In other words, the downstream text analysis tools can benefit from them by saving running time
and memory space. In contrast, can stemming and lemmatization improve the performance
of those tools? It is a quite arguable question. 
As pointed out in [4], stemming and lemmatization can increase recall but harm precision in information
retrieval. Researchers have also found that classifying English document tasks often do not gain 
from stemming and lemmatization.
However, it might not be the case when we change our language to something rather than English, for example, German.

### 2.5. Sentence Segmentation

Sentence segmentation is also known as sentence boundary disambiguation or sentence boundary detection.
The following is the Wikipedia definition of sentence boundary disambiguation:
>Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

SBD is one of the essential problems for many NLP tasks, like Parsing, Information Extraction, Machine Translation, and Document Summarizations. 
The accuracy of the SBD system will directly affect the performance of these applications. 

Sentences are the basic textual unit immediately above the word and phrase. 
So what is a sentence? Is something ending with one of the following punctuations ".", "!", "?"?
Does a period always indicate sentence boundaries?
For English texts, it is almost as easy as finding every occurrence of those punctuations.
However, some periods occur as part of abbreviations, monetary numerals and percentages, as we 
have discussed in sections 1.2 and 1.3. 
Although you can use a few heuristic rules to correctly
identify the majority of sentence boundaries, SBD is much more complex that we can expect,
please read section 4.2.4 of the book, 'Corpus-Based Work'  refered on the Part 4 Reading Materials, and watch a Youtube video on [Sentence segmentation](https://www.youtube.com/watch?v=9LXq3oQEEIA). 
discussing more advanced techniques for SBD goes beyond our scope.
Instead, we will show you some sentence segmentation tools implemented in NLTK.
Please also note that there are other tools or packages containing a sentence tokenizer,
for example, Apache OpenNLP, Stanford NLP toolkit, and so on.

The NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) was designed to split 
text into sentences "*by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.*” It contains a pre-trained sentence tokenizer for English.
Let's test it out with a couple of examples extracted from the book, called "Moby Dick", on Project Gutenberg, by 
Herman Melville.
First construct a pre-trained English sentence tokenizer,

In [0]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

Following the intruction on the official website of Punkt Sentence Tokenizer, we tokenize two snippets extracted
from "Moby Dick":

In [0]:
text1 = '''And so it turned out; Mr. Hosea Hussey being from home, but leaving 
Mrs. Hussey entirely competent to attend to all his affairs. Upon making known our desires 
for a supper and a bed, Mrs. Hussey, postponing further scolding for the present, ushered us 
into a little room, and seating us at a table spread with the relics of a recently concluded repast, 
turned round to us and said—"Clam or Cod?"'''


#('\n-----\n' is used to wrap the sentences after the stripped results, it is useful for
#reading the processed the text)
print('\n-----\n'.join(sent_detector.tokenize(text1.strip())))

In [0]:
text2 = '''A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"'''
print('\n-----\n'.join(sent_detector.tokenize(text2.strip())))

You can also use `sent_tokenize`, an instance of Punkt Sentence Tokenizer.
This instance has already been trained on and works well for many European languages.
```python
    from nltk.tokenize import sent_tokenize
    sent_tokenize(text1)
```
You should get similar outputs as above.

Comparing the two results we notice that the sentence tokenizer has troubles in recognizing abbreviations.
It got "Mrs." right in the first snippet but not the second. Regarding this type of issues, please read a blog post on sentence tokenizer , just click the 'Testing out the NLTK sentence tokenizer' on the Part 4 Reading Materials. 
* * *

## Part 3. Sumary

In this chapter we have covered the fundamentals of text pre-processing. 
You have learnt how to access different text data, and how to carry out 
the following basic text pre-processing steps:
* Tokenization
* Case normalization
* Stopping
* Stemming and lemmatization
* Sentence segmentation

Now you should be able to perform those pre-processing tasks on a new corpus according
to requirements of different text analysis tasks. 
We would like to point out that besides NLTK, there are other NLP tools with mixed quality, which can be used to process text data. For example, [the standford NLP group](http://nlp.stanford.edu/software/) provides a list of tools for parsing, POS tagging, Name Entity Regonition  (NER), word segmentation, tokinization, etc; 
and [Mallet](http://mallet.cs.umass.edu/) is a Java-based package for statistical natural langage processing. 
* * *

## Part 4. Reading Materials

1. "[Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)" 📖 .
2. "[Processing Row Text](http://www.nltk.org/book_1ed/ch03.html)", chapter 3 of
of "Natural Language Processing with Python".
3. "[The Art of Tokenization](https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en)": An IBM blog on tokenization. It gives a detailed discussion about word tokenization and its challenges 📖 .
4. "[Stemming and lemmatization](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)" 📖 .
5. "[Dropping common terms: stop words](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)" 📖 .
6. "[Corpus-Based Work](https://www.deakin.edu.au/library)", Chapter 4 of "Foundations of statistical natural language processing" by Christopher D. Manning 📖 .
7. "[Testing out the NLTK sentence tokenizer](http://www.robincamille.com/2012-02-18-nltk-sentence-tokenizer/)"
1. "[Accessing Text Corpora and Lexical Resources](http://www.nltk.org/book/ch02.html): Chapter 2 of "Natural Language Processing with Python" By Steven Bird, Ewan Kelin & Edward Loper 📖 .
2. "[Corpus Readers](http://www.nltk.org/howto/corpus.html#tagged-corpora)": An NLTK tutorial on accessing the contents of a diverse set of corpora.
