# Text Preprocessing

This notebook demonstrates a simple text preprocessing pipeline using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html). 

Install the following two packages before starting the labs:

`pip install --user -U nltk`

`pip install --user -U trectools`

In [1]:
import nltk
import string
from collections import Counter

Raw text from [this Wikipedia page](https://en.wikipedia.org/wiki/Australia).

In [2]:
raw_text = "Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.\nIndigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day."

In [3]:
print(raw_text)

Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.
Indigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 Januar

## Sentence splitting

Splitting text into sentences.

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
help(sent_tokenize)  # uncomment this line to see the documentation of `sent_tokenize'

Help on function sent_tokenize in module nltk.tokenize:

sent_tokenize(text, language='english')
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus



In [7]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/r10x8596/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/r10x8596/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
sentences = sent_tokenize(raw_text)

In [9]:
# uncomment the following code to see the sentences line by line

print(f'There are {len(sentences)} sentences')
for i in range(len(sentences)):
    print(str(i+1) + ": " + sentences[i])

There are 7 sentences
1: Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.
2: With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country.
3: Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils.
4: It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.
5: Indigenous Australians have inhabited the continent for approximately 65,000 years.
6: The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers.
7: In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to t

## Tokenisation

Dividing a string into a list of tokens.

In [10]:
from nltk.tokenize import word_tokenize

In [11]:
# help(word_tokenize)

In [12]:
tokens_list = [word_tokenize(s) for s in sentences]

In [13]:
print(tokens_list)

[['Australia', ',', 'officially', 'the', 'Commonwealth', 'of', 'Australia', ',', 'is', 'a', 'sovereign', 'country', 'comprising', 'the', 'mainland', 'of', 'the', 'Australian', 'continent', ',', 'the', 'island', 'of', 'Tasmania', ',', 'and', 'numerous', 'smaller', 'islands', '.'], ['With', 'an', 'area', 'of', '7,617,930', 'square', 'kilometres', '(', '2,941,300', 'sq', 'mi', ')', ',', 'Australia', 'is', 'the', 'largest', 'country', 'by', 'area', 'in', 'Oceania', 'and', 'the', 'world', "'s", 'sixth-largest', 'country', '.'], ['Australia', 'is', 'the', 'oldest', ',', 'flattest', ',', 'and', 'driest', 'inhabited', 'continent', ',', 'with', 'the', 'least', 'fertile', 'soils', '.'], ['It', 'is', 'a', 'megadiverse', 'country', ',', 'and', 'its', 'size', 'gives', 'it', 'a', 'wide', 'variety', 'of', 'landscapes', 'and', 'climates', ',', 'with', 'deserts', 'in', 'the', 'centre', ',', 'tropical', 'rainforests', 'in', 'the', 'north-east', ',', 'and', 'mountain', 'ranges', 'in', 'the', 'south-east'

The top-10 most common tokens.

In [14]:
Counter([w for x in tokens_list for w in x]).most_common(10)

[('the', 15),
 (',', 14),
 ('of', 8),
 ('Australia', 7),
 ('and', 7),
 ('.', 7),
 ('in', 5),
 ('is', 4),
 ('a', 4),
 ('country', 4)]

### Exercise

Try [other tokenisers provided by NLTK](https://www.nltk.org/api/nltk.tokenize.html) (e.g. RegexpTokenizer, WhitespaceTokenizer, WordPunctTokenizer etc.) and compare their outputs.

In [15]:
from nltk.tokenize import WhitespaceTokenizer

### Question 

What are the differences and how can we choose the best tokeniser for a task?

## Removing punctuation and stop words

Stopwords and punctuation are usually not helpful for many IR tasks, and removing them can reduce the number of tokens we need to process. 

In [16]:
from nltk.corpus import stopwords

In [17]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/r10x8596/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [18]:
stopwords_en = set(stopwords.words('english'))

In [19]:
print(stopwords_en)

{'just', 'him', "she'll", 'until', 'that', 'weren', 'does', "mightn't", 'then', "she's", 'which', 'because', 'had', 'are', 'same', 'a', 'doing', 'has', 'to', 'at', 'hers', 'through', 'having', 'before', 'further', "haven't", 'than', "aren't", 'mightn', 'where', 'his', "you're", 'don', 'after', 'over', 'me', 'he', 'myself', 'we', 'so', 'themselves', 'why', 'from', 'did', 've', 'all', 'y', 'for', "that'll", 'with', 'as', 'won', 'it', 'very', 'will', "you'd", "couldn't", "i've", "needn't", 'by', 'own', 'some', "she'd", "you'll", 'i', 'do', 'herself', "it'll", 'ours', 'few', 'against', 'between', 'here', "we're", 'above', 'was', 'there', 'both', "it's", 'shan', 'isn', "don't", "hadn't", 'up', 'these', 'too', 'once', 'under', "weren't", 'have', 'needn', 'you', 'who', "we've", 'off', "should've", 'yourselves', 'hadn', "doesn't", 'down', 're', 'should', 'she', 'aren', 'their', 'm', "we'll", 'and', 'yours', 'can', 'any', 'during', "he's", 'himself', "shouldn't", 'more', 'whom', 'its', "i'm", '

In [20]:
tokens_list[:] = [[w for w in x if w not in string.punctuation and w not in stopwords_en] for x in tokens_list]

In [21]:
print(tokens_list)

[['Australia', 'officially', 'Commonwealth', 'Australia', 'sovereign', 'country', 'comprising', 'mainland', 'Australian', 'continent', 'island', 'Tasmania', 'numerous', 'smaller', 'islands'], ['With', 'area', '7,617,930', 'square', 'kilometres', '2,941,300', 'sq', 'mi', 'Australia', 'largest', 'country', 'area', 'Oceania', 'world', "'s", 'sixth-largest', 'country'], ['Australia', 'oldest', 'flattest', 'driest', 'inhabited', 'continent', 'least', 'fertile', 'soils'], ['It', 'megadiverse', 'country', 'size', 'gives', 'wide', 'variety', 'landscapes', 'climates', 'deserts', 'centre', 'tropical', 'rainforests', 'north-east', 'mountain', 'ranges', 'south-east'], ['Indigenous', 'Australians', 'inhabited', 'continent', 'approximately', '65,000', 'years'], ['The', 'European', 'maritime', 'exploration', 'Australia', 'commenced', 'early', '17th', 'century', 'arrival', 'Dutch', 'explorers'], ['In', '1770', 'Australia', "'s", 'eastern', 'half', 'claimed', 'Great', 'Britain', 'initially', 'settled',

The top-10 most common tokens.

In [22]:
Counter([w for x in tokens_list for w in x]).most_common(10)

[('Australia', 7),
 ('country', 4),
 ('continent', 3),
 ("'s", 3),
 ('area', 2),
 ('inhabited', 2),
 ('officially', 1),
 ('Commonwealth', 1),
 ('sovereign', 1),
 ('comprising', 1)]

### Question

Will we get a different set of tokens if we lower casing all words before removing stopwords? What are the potential problems by doing that?

## Stemming

Turning words into stems.

In [23]:
from nltk.stem import PorterStemmer

In [24]:
stemmer = PorterStemmer()

In [25]:
tokens_stem = [stemmer.stem(w) for x in tokens_list for w in x]

In [26]:
print(tokens_stem)

['australia', 'offici', 'commonwealth', 'australia', 'sovereign', 'countri', 'compris', 'mainland', 'australian', 'contin', 'island', 'tasmania', 'numer', 'smaller', 'island', 'with', 'area', '7,617,930', 'squar', 'kilometr', '2,941,300', 'sq', 'mi', 'australia', 'largest', 'countri', 'area', 'oceania', 'world', "'s", 'sixth-largest', 'countri', 'australia', 'oldest', 'flattest', 'driest', 'inhabit', 'contin', 'least', 'fertil', 'soil', 'it', 'megadivers', 'countri', 'size', 'give', 'wide', 'varieti', 'landscap', 'climat', 'desert', 'centr', 'tropic', 'rainforest', 'north-east', 'mountain', 'rang', 'south-east', 'indigen', 'australian', 'inhabit', 'contin', 'approxim', '65,000', 'year', 'the', 'european', 'maritim', 'explor', 'australia', 'commenc', 'earli', '17th', 'centuri', 'arriv', 'dutch', 'explor', 'in', '1770', 'australia', "'s", 'eastern', 'half', 'claim', 'great', 'britain', 'initi', 'settl', 'penal', 'transport', 'coloni', 'new', 'south', 'wale', '26', 'januari', '1788', 'dat

In [27]:
Counter(tokens_stem).most_common(10)

[('australia', 7),
 ('countri', 4),
 ('contin', 3),
 ("'s", 3),
 ('australian', 2),
 ('island', 2),
 ('area', 2),
 ('inhabit', 2),
 ('explor', 2),
 ('offici', 1)]

### Exercise

Try other NLTK stemmers (e.g. SnowballStemmer, RegexpStemmer), you may need to download additional data packages, see https://www.nltk.org/data.html

In [28]:
# from nltk.stem import SnowballStemmer, RegexpStemmer

## Lemmatisation

Turning words into lemmas (entries in a dictionary). It requires knowledge of the context (typically the intended
Part-of-Speech of a word in the context).

In [29]:
from nltk.stem import WordNetLemmatizer

In [30]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package wordnet to /home/r10x8596/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/r10x8596/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/r10x8596/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/r10x8596/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

POS tagging for lemmatisation.

In [31]:
tags_list = nltk.pos_tag_sents(tokens_list)

A heuristic to convert POS tags to the [four syntactic categories that wordnet recognizes (i.e. **noun**, **verb**, **adj** and **adv**)](https://wordnet.princeton.edu/):
- `n` for nouns
- `v` for verbs
- `a` for adjectives
- `r` for adverbs

In [32]:
print(tags_list)

[[('Australia', 'NNP'), ('officially', 'RB'), ('Commonwealth', 'NNP'), ('Australia', 'NNP'), ('sovereign', 'JJ'), ('country', 'NN'), ('comprising', 'VBG'), ('mainland', 'NN'), ('Australian', 'JJ'), ('continent', 'NN'), ('island', 'NN'), ('Tasmania', 'NNP'), ('numerous', 'JJ'), ('smaller', 'JJR'), ('islands', 'NNS')], [('With', 'IN'), ('area', 'NN'), ('7,617,930', 'CD'), ('square', 'JJ'), ('kilometres', 'NNS'), ('2,941,300', 'CD'), ('sq', 'JJ'), ('mi', 'NN'), ('Australia', 'NNP'), ('largest', 'JJS'), ('country', 'NN'), ('area', 'NN'), ('Oceania', 'NNP'), ('world', 'NN'), ("'s", 'POS'), ('sixth-largest', 'JJ'), ('country', 'NN')], [('Australia', 'NNP'), ('oldest', 'JJS'), ('flattest', 'JJS'), ('driest', 'NN'), ('inhabited', 'VBN'), ('continent', 'NN'), ('least', 'JJS'), ('fertile', 'JJ'), ('soils', 'NNS')], [('It', 'PRP'), ('megadiverse', 'VBZ'), ('country', 'NN'), ('size', 'NN'), ('gives', 'VBZ'), ('wide', 'JJ'), ('variety', 'NN'), ('landscapes', 'NNS'), ('climates', 'VBZ'), ('deserts',

In [33]:
wordnet_tag = lambda t: 'a' if t == 'j' else (t if t in ['n', 'v', 'r'] else 'n')

Lemmatising

In [34]:
lemmatizer = WordNetLemmatizer()

In [35]:
tokens_lemma = [lemmatizer.lemmatize(w.lower(), pos=wordnet_tag(t[0].lower())) for x in tags_list for (w, t) in x]

In [36]:
print(tokens_lemma)

['australia', 'officially', 'commonwealth', 'australia', 'sovereign', 'country', 'comprise', 'mainland', 'australian', 'continent', 'island', 'tasmania', 'numerous', 'small', 'island', 'with', 'area', '7,617,930', 'square', 'kilometre', '2,941,300', 'sq', 'mi', 'australia', 'large', 'country', 'area', 'oceania', 'world', "'s", 'sixth-largest', 'country', 'australia', 'old', 'flat', 'driest', 'inhabit', 'continent', 'least', 'fertile', 'soil', 'it', 'megadiverse', 'country', 'size', 'give', 'wide', 'variety', 'landscape', 'climates', 'desert', 'centre', 'tropical', 'rainforest', 'north-east', 'mountain', 'range', 'south-east', 'indigenous', 'australian', 'inhabited', 'continent', 'approximately', '65,000', 'year', 'the', 'european', 'maritime', 'exploration', 'australia', 'commence', 'early', '17th', 'century', 'arrival', 'dutch', 'explorer', 'in', '1770', 'australia', "'s", 'eastern', 'half', 'claim', 'great', 'britain', 'initially', 'settle', 'penal', 'transportation', 'colony', 'new'

In [37]:
Counter(tokens_lemma).most_common(10)

[('australia', 7),
 ('country', 4),
 ('continent', 3),
 ("'s", 3),
 ('australian', 2),
 ('island', 2),
 ('area', 2),
 ('officially', 1),
 ('commonwealth', 1),
 ('sovereign', 1)]

### Question

Compare the results of stemming and lemmatisation. Can you see the differences and the potential problems with stemming and lemmatisation?