#Week 1. NLP Basics with NLTK

## 1. What is NLTK?

NLTK (Natural Language Toolkit) is one of the most important and also the earliest Python-based NLP development tool. NLTK is developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.

In summary, NLTK provides convenience interface with over 50 corpora and lexical resources such as WordNet - WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept and also one of the most im-portant the fundamental lexical database in NLP world, developed by Princeton University from 1980's.
Other lexical databases and corpora such as the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency Thesaurus.

In fact, the most important feature of NLTK is that it contains the basic statisti-cal-based text processing libraries for FIVE fundamental NLP enabling technolo-gy together with basic semantic reasoning tool, which include:
- tokenization
- parsing
- classification
- stemming
- tagging
- basic semantic reasoning

In this workshop, the demostration with all workshops use python 3.11.9 as running environment. We strongly recommend yours can build a independent virtual environment for these workshops of the book. With any version of pre-installed Anaconda, ths command is:
### conda create -n *your_virtual_environment_name* python=3.11
Please confirm you have installed listed packages before you start the workshop:
- Python (demo version 3.11.9)
- Tensorflow (demo version 2.17.0)
- NLTK (demo version 3.9.1)


## 2. A Taste of NLTK on Text Tokenization

In [None]:
# Import NLTK package
import nltk

In [None]:
# Create a sample utterance 1 (utt1)
utt1 = "On every weekend, early in the morning. I drive my car to the car center for car washing. Like clock-work."

In [None]:
# Display utterance
utt1

'On every weekend, early in the morning. I drive my car to the car center for car washing. Like clock-work.'

In [None]:
nltk.download('punkt_tab')  # refers to a tokenizer model in the NLTK library,

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Create utterance tokens (utokens)
utokens = nltk.word_tokenize(utt1)

In [None]:
# Display utokens
utokens

['On',
 'every',
 'weekend',
 ',',
 'early',
 'in',
 'the',
 'morning',
 '.',
 'I',
 'drive',
 'my',
 'car',
 'to',
 'the',
 'car',
 'center',
 'for',
 'car',
 'washing',
 '.',
 'Like',
 'clock-work',
 '.']

## 3. How to Install NLTK?
#### Type 'pip install nltk'

#### Installing NLTK Data
#### Once you finished install NLTK into Python, you can download the NLTK Data
#### 3.1 Run Python
#### 3.2 Type the following to activate the NLTK downloader.
- import nltk
- nltk.download()

## 4. Why using Python for NLP?

Before the popularity of Python in AI and NLP, C, C++ and later on Java dominate the world of software development.
However, started from early 2000, Python and their associate toolkit and packages start to dominate the world of software development, especially in the areas of Data Science, AI and NLP.

Several reasons to drive for the changes:
1. Python is a generalist language, means that it does not specialize in one area.
2. Other commonly used language such as Java and JavaScript, on the other hand, is specifically designed for use on the web, thet are most suitable for developing web applications and websites.
3. Python, on the other hand, is a generalist language, which means it can be used for a wide variety of purposes, including:
- Data Science Analysis and Applications
- Developing web apps
- Creating software (both Web or Non-web based)
- AI modeling and applications (e.g. building Deep Networks)
- Natural language processing
4. Easy to learn and use. As compared with C and C++, Python is much easier to learn. Especially useful for non-computer science students and scientists.
5. In term of NLP, Python's list and list-processing data-type provide an excellent environment for the NLP modeling.

The following simple Python program shows how Python handle text as list objects, itself already an excellent tokenization tool in NLP!!!

In [None]:
# Define utterance 2 (utt2)
utt2 = "Hello world. How are you?"

In [None]:
# Using split() method to split it into word tokens
utt2.split()

['Hello', 'world.', 'How', 'are', 'you?']

In [None]:
# Check the no of word tokens
nwords = len(utt2.split())
print ("'Hello world. How are you?' contains ",nwords," words.")

'Hello world. How are you?' contains  5  words.


## 5. NLTK with Basic Text Processing in NLP

As said, one important feature NLTK is the provision of simple Python tools and methods for us to learn and practise NLP technology, which started by Basic Text Processing in NLP.
They include:
1. Basic text processing as lists of words.
2. Basic statistics on text processing in NLP.
3. Simple text analysis.

Before we start, of course we need to use some text document to start with.
Just like Project Gutenburg word counting we have just learnt, nothing is more straight forward than start with analyzing the classics literature such as Moby Dick.
However, in terms of NLP, it is even much better if we can study the text analysis of a variety of document types, such as classics, news and articles and even public speeches.
Why? ....
So in NLTK, it provides NINE typical text documents for us to start with. It contains: classic literatures, bible texts, famous public speeches, news and articles, and personal corpus.
So, let's start ...

In [None]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [None]:
nltk.download('genesis')

[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.


True

In [None]:
nltk.download('inaugural')

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


True

In [None]:
nltk.download('nps_chat')

[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.


True

In [None]:
nltk.download('webtext')

[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


True

In [None]:
nltk.download('treebank')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [None]:
# Let's load some sample books from NLTK databank
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
# Display the list of sample books
texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
# Check text1
text1

<Text: Moby Dick by Herman Melville 1851>

In [None]:
# Know more about text1, check this
text1?

In [None]:
# Import word_tokenize as wtoken
from nltk.tokenize import word_tokenize

# Use text1 (Moby Dick) from NLTK book corpus
tholmes = text1

## 6 Simple Text Analysis with NLTK

In text analysis, one common operation is to study how a particular word (or phrase) appeared in a text document, especially in a classical and famous literature and text document such as public speeches.
Different from normal "search" function, "concordance()" function allows us to study and analyze how a particular appeared within a text document. In other words, it not only shows the occurence, but more importantly the "neighbouring words and phrases" as well.
Let's try some example in Adventures Holmes.

In [None]:
# Check concordance of word "Sherlock "
tholmes.concordance("Sherlock")

no matches


1. The above example shows all the occurrence of "Sherlock" inside the text document, by that we will study when it will be used. As one can see, as "Sherlock" is a rather "special" word with is strongly linked with the name "Sherlock Holmes", so almost all the time "Sherlock" and "Holmes" will be appeared together.

2. However in terms of text analysis and especially for the learning of English from some great literature such as Adventures of Sherlock Holmes. Of course, one important thing we want to know to how some commonly-used words and phrases are used by these great authors and what other words of similar meanings (i.e. Synonyms) are being used to improve the Use of English.

In the following example, let's study how "extreme" is used in Adventures of Sherlock Holmes.

In [None]:
# Check concordance of word "extreme"
tholmes.concordance("extreme")

Displaying 13 of 13 matches:
he streets take you waterward . Its extreme downtown is the battery , where tha
ll flourish , must indeed have been extreme . But it was not in reasonable natu
ue lurks in these small things when extreme political superstitions invest them
hem for the event . It took off the extreme edge of their wonder ; and so what 
t been descried . Likewise upon the extreme stern of the boat where it was also
mes over a man only in some time of extreme tribulation ; it comes in the very 
, both by night and by day , and so extreme was the hard work they underwent , 
the leaded chocks or grooves in the extreme pointed prow of the boat , where a 
re now to consider that only in the extreme , lower , backward sloping part of 
s . His motions plainly denoted his extreme exhaustion . In most land animals t
ntly rocking , jerking boat , under extreme headway . Steel and wood included ,
rced his groin ; nor was it without extreme difficulty that the agonizing wound
' ll heave 

<img src="./note.png" width = "" height = "" alt="note" align=left />
Note: As one can see, in many dictionaries,we are using concordance technique to learn English, so-called "Use of English", which is not only on the grammatic aspect, but rather how different words (or phrases) are being used. As what we now called "Learn by Examples".
In this example, we learnt how to use the word "extreme" in various situations and scenarios.

In [None]:
tholmes.similar("extreme")

it long there huge little bound short ishmael nor deck inducements
spent outer roughly ahab terrific jagged severest impressive manifold


In [None]:
# Check concordance of word "extreme" in text2
text2.concordance("extreme")

Displaying 4 of 4 matches:
n another day or two perhaps ; this extreme mildness can hardly last longer -- 
ng her that he was kept away by the extreme affection for herself , which he co
 of his brother , and lamenting the extreme GAUCHERIE which he really believed 
y which had been leading her to the extreme of languid indolence and selfish re


In [None]:
# Check similar word "extreme" in text2
text2.similar("extreme")

family centre good opinion life death loss house society children
attachment wishes interest goodness heart comfort cheerfulness
existence marriage son


In [None]:
# Check concordance word "extreme" in text4
text4.concordance("extreme")

Displaying 3 of 3 matches:
 vigilance no Administration by any extreme of wickedness or folly can very ser
ent , and communication between the extreme limits of the country made easier t
the politics of petty bickering and extreme partisanship they plainly deplore .


In [None]:
# Check similar word "extreme" in text4
text4.similar("extreme")

one other just hope motives act people agency system right form loss
length knowledge science portion quarter narrowest requisite member


As one can see, even a commonly used word "extreme" different people have differnt "style" of usage.
In short, Herman Melville used the word "extreme"quite frequently in his literature and each with different style of usage.
Jane Austen's usage of "extreme" is also very "colorful" and "fruitful", but not as frequently as Herman Melville.
While in the Inaugural Address Corpus, the usage of word "extreme" become more "standard" and "rigid" in some sense.

The common_contexts() method allows you to examine the contexts that are shared by two or more words.

Let's take a look on how it works.

First, use Micky Dicky as example and try what is the common context for the two words: "extreme" and "huge".

To do so, call the common_contexts() function from object "tholmes".

In [None]:
# Check common contexts on tholmes
tholmes.common_contexts(["extreme","huge"])

the_lower


What it meant is that: After analysing the two words "extreme" and "huge", it find out that the common context(s) for the usage of these two words is the "pattern" of: the_lower.

To check it, what you can do is to call concordance() function for these two words and check against the patterns it extracted. As below.

In [None]:
# Check concordance word "extreme" in tholmes
tholmes.concordance("extreme")

Displaying 13 of 13 matches:
he streets take you waterward . Its extreme downtown is the battery , where tha
ll flourish , must indeed have been extreme . But it was not in reasonable natu
ue lurks in these small things when extreme political superstitions invest them
hem for the event . It took off the extreme edge of their wonder ; and so what 
t been descried . Likewise upon the extreme stern of the boat where it was also
mes over a man only in some time of extreme tribulation ; it comes in the very 
, both by night and by day , and so extreme was the hard work they underwent , 
the leaded chocks or grooves in the extreme pointed prow of the boat , where a 
re now to consider that only in the extreme , lower , backward sloping part of 
s . His motions plainly denoted his extreme exhaustion . In most land animals t
ntly rocking , jerking boat , under extreme headway . Steel and wood included ,
rced his groin ; nor was it without extreme difficulty that the agonizing wound
' ll heave 

In [None]:
# Check concordance word "huge" in tholmes
tholmes.concordance("huge")

Displaying 25 of 30 matches:
close behind some promontory lie The huge Leviathan to attend their prey , And
. HARRIS COLL . " Here they saw such huge troops of whales , that they were fo
 mummies of those creatures in their huge bake - houses the pyramids . No , wh
being , it seems , for some reason a huge favourite with them , they raised a 
chbowl ;-- taking it I suppose for a huge finger - glass . " Now ," said Queeq
glittering in the clear , cold air . Huge hills and mountains of casks on cask
feet high ; consisting of the long , huge slabs of limber black bone taken fro
 rising solemnly and fumbling in the huge pockets of his broad - skirted drab 
d like the white ivory tusks of some huge elephant , vast curving icicles depe
inctly recognised a peculiar sort of huge mole under the whale ' s eye , which
e thick mists were dimly parted by a huge , vague form . Affrighted , we all s
ng ," make out one whit better . The huge corpulence of that Hogarthian monste
r on their backs as the

In [None]:
import nltk
nltk.download('stopwords')
text3.collocations()

[nltk_data] Downloading package stopwords to /root/nltk_data...


said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast


[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; Vice President; American people; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
