<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/02_intro_to_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with NLTK


You'll notice that sections 1.1 and 1.2 of Chapter 1 in the NLTK Book contain instructions for how to install Python and NLTK on your computer. If you are using Google Collaboratory, these points are not relevant. However, you are still encouraged to read these sections for a better understanding of how Python and NLTK would work outside of Google Colab.

*A few other notes*
- Google Colab does not have the `>>>` prompt mentioned in the NLTK book - this is replaced by the code cells in Colab.
- the `generate()` function will not work. 
- there are likely many other little things that won't work based on various updates to Python/NLTK and/or the use of Google Colab, such as plotting and other functions which come later in the book. 

## Accessing NLTK on Google Colab

- NLTK is already pre-installed in Google Colab. But NLTK requires a lot of additional resources which we need to download. In section 1.2 of Chapter 1, the NLTK book explains that to access these resources one should use `nltk.download()`. We will use the same function but will not see the graphical downloader shown in the book.

- For instance, one of the very first lessons in NLTK section 1.2 asks you to use `from nltk.book import *`, which means import everything from `nltk.book` (the `*` means everything). However you will get an error if you try this in Colab because the `book` data has not yet been downloaded.

- When you do not have the right resource to run a particular NLTK function, you will see an error which looks like this:
>> ![error](https://drive.google.com/uc?id=1nt76M0KbiLueTYHb72HSFnREC9_1DsM4)

- If you see this error, don't panic! It just means you are missing a specific resource. In this example, the part that I selected in yellow is what is missing - in this case it is the resource `stopwords`. All you need to do is ask Colab to download the resource using the `nltk.download()` function. Because the Colab notebook is running on a temporary server, you will need to repeat this each time you connect to a new sessino. Fortunately, it does not take very long to download the data. 

You can specify which resources you need by passing each resource as a `string` inside a single `list` (i.e., inside square brackets `[]`). In the example below, I include a wide range of specific resources which you will need for the first chapter of NLTK. 

```
# define a list of resources and save to variable
nltk_resources = ['gutenberg', 'genesis', 'inaugural', 'nps_chat', 'webtext',
 'treebank', 'stopwords', 'punkt', 'brown', 'reuters', 'udhr', 'words', 'names', 'cmudict', 'swadesh', 'wordnet', 'state_union']

# Pass the list to nltk.download(), which will then download each resource
nltk.download(nltk_resources)
```

Once you have sorted out your ability to access NLTK resources, you are ready to go through the rest of the notebook lessons. 

In [1]:
# import the main nltk module
import nltk

# create a list of resources we will need for this notebook
nltk_resources = ['gutenberg', 'genesis', 'inaugural', 'nps_chat', 'webtext', 'treebank', 'stopwords', 'punkt', 'brown', 'reuters', 'udhr', 'words', 'names', 'cmudict', 'swadesh', 'wordnet', 'state_union']

# download them
nltk.download(nltk_resources)

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package stopwords is already up

True

# Searching Text

Chapter 1 asks you to load in a series of books and corpora which are stored in NLTK as examples. Take a moment to look at the names of the files - some are single books and movie scripts, while others are different corpora. What is the difference between the two? A book is simply a stand-alone book, whereas a corpus is a large collection of text from similar texts/documents, which can be longer or shorter than a single book. 

In [2]:
# import everything from ntlk.book()
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Concordances

NLTK starts off with concordances, which is a method that draws from corpus linguistics to analyse the function of different words in texts. This method is fundamental for performing corpus analysis because a concordance will let you see a single word or patterns of words **in context** of the surrounding words. 

The example in the NLTK book is to search for the word `monstrous` in *Moby Dick*. 

Here are some other collocations I searched for based on my own guesses about what you might find in the differnt corpora. 

In [3]:
# search for "color" in Holy Grail
text6.concordance('color')

Displaying 2 of 2 matches:
BRIDGEKEEPER : What is your favorite color ? LAUNCELOT : Blue . BRIDGEKEEPER : 
BRIDGEKEEPER : What is your favorite color ? GALAHAD : Blue . No yel -- auuuuuu


In [4]:
# search for "like" in webchat
text5.concordance('like')

Displaying 25 of 160 matches:
OIN what did you but on e-bay i feel like im in the wrong room yeee haw U30 im
1 . . . but appearently she does not like that . cya later guys single white m
!! answers for U139 ... hi U101 ;) I like it when you do it , U83 iamahotnipwi
how are you ? that sounds freakishly like dr seuss wow . twice , I 'm impresse
you asshole ! U115 can i pm you ? is like no it 's not U12 , you shut yo mouth
y its election season ooo a scorpion like always U7 and yours same well i was 
up lol JOIN shakin that 's the way I like it oh ya < sorry dont drink its k < 
 < always 4.20 at my house ya PART i like mine shook over ice why so mad U19 <
T lol U19 wooooohoooo with an answer like that ... nope .... lol ha maybe not 
se it is all good U17 call me what u like just not late for date or dinner lol
ometime ... lmao lol with party hair like that .... knockin down the door I be
m ... lol had a girlfriend with hair like you U9 howdy doody and what happened
going to be disappoint

In [5]:
# search for 'looking' in the personals corpus
text8.concordance('looking') 

Displaying 25 of 34 matches:
 for a meaningful long term rship . Looking forward to hearing from you all . A
/ Med build , GSOH , high needs and looking for someone similar . You WONT be d
OPPER REDHEAD ? I am 36 y . o . and looking for companionship / friendship . I 
out for meal & c . Fun to be with . Looking for a com panion , aged between 35 
up to a special person in my life . Looking for a caring , honest lady for frie
elationship . BUSINESSMAN 60 '' ish Looking for lady , non - smoker , 56 or tal
ing and music . Love my 2 kids . Am looking for a lady with similar interests ,
 eyes , brown hair , Mid 20s , good looking , honest n / smoker . Likes movies 
 important , nor distance / speed . Looking for someone to complete my social c
endant , under standing , mid 50s , looking for classy lady who wants to retain
till retain her independance and is looking for a special private relationship 
 with 2 teen daughters , 46 y . o . Looking for a special lady FIT & HEALTHY 60
MATURE GENT

# Comparing words

What you probably noticed is that words are used in specific ways in different corpora. This relatively simple anlaysis has thus already told us something about the way language works: context determines the way words are used and understood. Corpus linguistic anlaysis is thus a crucial way to gain a better understanding of word meanings and language use.


The `.similar()` and `.common_contexts()` methods allow you to find words that are used in the same contexts. This means words which occur before/after the same other words. For example, in the following two sentences the words "truck" and "apple" are similar because they both occur after the word "red":

- A big red truck
- A big red apple


Try testing the word "hello" using the `.similar()` method in different corpora. The output that you see represents words that are used in similar contexts are your input word. 


In [6]:
# which words occur in the same contexts as 'hello' in Monty Python
text6.similar('hello')

what shit sir


In [7]:
# and what about in the webchat corpus?
text5.similar('hello')

hi part lol join hey and hiya right all m there what one too where wb
hugs if nite cool


The `.common_contexts()` method allows you to compare two words in the same text. The book uses the example of testing `monstrous` and `very` in Moby Dick. The book asks you to pick two words of your own and compare them. If you're like me, you will see that many times you have no results because the words you searched for do not occur in the corpus or do not have any common contexts - this further demonstrates how we can predict the types of words based on our knowledge of the corpus.

In [8]:
# do hello and goodbye have simlar contexts in the webchat corpus?
text5.common_contexts(['hello', 'goodbye'])

No common contexts were found


In [9]:
# what about 'hi' and 'bye' in the webchat corpus? 
text5.common_contexts(['hi', 'bye'])

hi_hi part_part part_hey hi_i waves_to


In [10]:
# how about "hi" and "bye" in the personals?
text8.common_contexts(['hi', 'bye'])

('The following word(s) were not found:', 'hi bye')


In [11]:
# these words in moby dick?
text1.common_contexts(['white', 'whale'])

the_- a_- the_, -_, his_- and_- the_head ,_,


If we wanted to make sure the words are actually in the corpus before running the function, we could use an `in` conditional statement and check the `.vocab()` method of the corpus (which is a list of the words in the corpus!).

In [12]:
# check if a word is are in the corpus text
'like' in text5.vocab()

True

In [13]:
'knight' in text6.vocab()

True

### **Your Turn**

1. Spend a few moments using the `.concordance()` function to search for different words in the texts. See if you can find any interesting examples and share with the class.  

2. After looking through some concordances, play with the `.similar()` and `.common_contexts()` functions to see if you can find words used in similar contexts. 

3. What can this analysis tell us, if anything, about the nature of the different texts? 

