<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/12_NLTK_corpus_resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK Text Corpora

The first section of Chapter 2 in NLTK teaches you how to explore the built-in corpora provided by NLTK. It is important to note that some of the examples used in Chapter 01 of NLTK were pedagogical – the authors have provided us with copora and texts that have already been pre-processed in various ways, or otherwise simplified. While a corpus *is* a large collection of language and documents, a corpus usually also contains metadata telling the user about different categories *within* a corpus. Categories can be anything - genre, speaker, task, etc.

In this notebook, we will explore some of the other corpus resources available through the NLTK.

We will also look at how you can load your own data into Colab to create your own corpus using NLTK function.

Run the following cell to load in the required resources for this notebook.



In [None]:
# import the NLTK library
import nltk

# download resources for this notebook (takes a bit)
nltk_resources = ['gutenberg', 'punkt', 'brown', 'state_union']

nltk.download(nltk_resources)

## The Gutenberg Corpus

A good example of a corpus with different categories is the Gutenberg corpus used by NLTK, which is a collection of different public domain books. [Project Gutenberg](https://www.gutenberg.org/) is a website containing thousands of free eBooks, and is named after [Johannes Gutenberg](https://en.wikipedia.org/wiki/Johannes_Gutenberg), associated with the development of the printing press.

<img src = https://i.imgur.com/skJBrKl.png height = '200'>


The Gutenberg corpus is part of the `nltk.corpus` module, which provides several built-in methods for working with text data. You can see a complete list of the methods here: [NLTK corpus package](https://www.nltk.org/api/nltk.corpus.html). You can also see a breakdown in Table 1.3 in Chapter 02 of the NLTK book.

NLTK includes just a small set of books from Project Gutenberg.  We access the gutenberg data using `nltk.corpus.gutenberg` followed by different NLTK functions and methods.

To see the list of all of the files, we use the `.fileids()` method. The filenames are in the form of `author-title.txt`

In [None]:
# inspect the different files in the gutenberg corpus - have you read any of these books?
nltk.corpus.gutenberg.fileids()

Note that the `.fileids()` represent some meta data — the file id contains both the author and the name of the text. So, in this corpus, there are different texts grouped by different authors. As such, authors represent the categories of this corpus.


Chapter 02 introduces you to the methods somewhat implicitly, but it is good to look through all the possibilities. Perhaps the most basic possibility is to use the `.raw()` method to obtain a view of the raw text file.

Note that we need to input the fileid of the text we are interested in.


In [None]:
# we can select a single book using the book's name and the format we want, such as words or sentences.
# We use `raw` to get the raw text file (as a string)
macbeth = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')

# look at the entire text file.
macbeth

There are different methods for sorting the corpus into words and sentences:

In [None]:
# the .words method returns words as tokens
macbeth_words = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
macbeth_words[0:10]

In [None]:
# could we do this on our own?
# why do we get different results when we manually split the text?
# do you think the .words is using .split(), or nltk.word_tokenize()?
nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt').split()[:10]

In [None]:
# use .sents to get the sentences
macbeth_sents = nltk.corpus.gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sents[0:2]

Chapter 02 also provides a discussion about how to import Python modules into a shorthand, which saves typing. This helps provide some clarity into why many Python scripts and examples start with lines such as `from x import y as z` — this just means import something and give it a shorter name.

For the current example, we can import gutenburg directly, and thus can avoid typing the `nltk.corpus` bit before it. And, you can do this same thing with other functions from other packages / modules.

In [None]:
# import the gutenberg package directly
from nltk.corpus import gutenberg

# now you can access `gutenberg` without needing to type `nltk.corpus`
gutenberg.fileids()

## Looping through Gutenberg

Chapter 02 includes a demonstration of the different methods NLTK can provide for a raw text by asking you to think about how the following function works.

First, look at how the loop works. The loop is over the `fileids()` in `gutenberg.fileids()`, which is just a list of the different text files names. In the body of the loop, the fileid is used to access the specific text - this is why the iterator variable `fileid` is placed inside the brackets for `gutenberg.raw()` and all the other functions (`.words`, `.sents`).

Examine the print statement and the output - do you understand how they have controlled the output using this `print()` statement? You might find it useful to include some comments in the code cell below, explaining what each line is doing.

The book also claims there are some patterns related to average word length, average sentence length, and lexical diversity for different authors. Do you see these patterns? What uses could this information provide?

In [None]:
# Can you add comments to this code explaining what each line is doing?
for fileid in gutenberg.fileids():
  num_chars = len(gutenberg.raw(fileid))
  num_words = len(gutenberg.words(fileid))
  num_sents = len(gutenberg.sents(fileid))
  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

## The Brown Corpus

You should read through the explanation of various corpora in Chapter 02. One of the more well-known corpora is the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus), which has been used in numerous studies of language either for checking word use and collocations, or for compiling frequency statistics of words in spoken and written English. As Chapter 02 explains, the Brown Corpus is an ideal example for how one can use different categories in a corpus to test questions about differences in language use.

We load in Brown in a similar manner to Gutenberg. The Brown corpus has a function that Gutenberg did not have: `.categories()`. This function shows different genres or classifications of texts in the Brown corpus.

In [None]:
# load in Brown and look at the categories
from nltk.corpus import brown

brown.categories()

You can select a specific subsection of the brown corpus using `categories = ` when accessing the brown text as raw, words, sentences, etc.

In [None]:
# look at the a sentence from one of the the texts labelled as "humour" in the Brown corpus
brown.sents(categories = 'humor')[80]

### Comparing language among different genres

The different genres or categories in Brown allows for a means to compare different writing styles. The NLTK book provide an analysis of [modal verbs](https://en.wikipedia.org/wiki/Modal_verb) as an example.

Modal verbs are auxiliary verbs which provide a level of certainty, possibility, or urgency upon a main verb. Words such as *must*, *will*, *could*, etc.

In the following example, the authors of NLTK wrote a function to create a frequency distribution of modal verbs in the `news` category of Brown using the `nltk.FreqDist()` function.

Note how they do this - first they define a list of modal words — so they can provide the program with the targets it is trying to find. Then they save the words of the Brown corpus to a new variable `news_text`. Then they create a frequency distribution from a list comprehension which first lowercases the word (i.e., pre-processes it) and then only includes words if they are in the list of modals. Any word *not* in that list of modals will in turn not be included in the resulting Frequency Distribution.

Run the cell and then ask yourself, what do you think about the frequency of modal verbs in the `news` category - does it make sense that `will` is the most frequent modal verb for news?


In [None]:
# create a frequency distribution for specific modal verbs

# define list of modal verbs
modals = ['can', 'could', 'may', 'might', 'must', 'will']

# create an object of words which occur in the news category of brown
news_text = brown.words(categories = 'news')

# create a frequency distribution - does it make sense for there to be a .lower() here?
fdist = nltk.FreqDist(w.lower() for w in news_text)

# loop through each modal and print the fdist
for m in modals:
  print(m + ':', fdist[m], end = ' ') # the end argument replaces the default newline which comes at the end of a print statement

## **Your Turn**

Have a play with the Brown corpus, explore the different categories and make sure you are comfortable loading them into Python.

## State of the Union Corpus

Another corpus which provides us some interesting data to use for various comparisons is a set of speeches given by US presidents over the years. Each year, the sitting president gives a "state of the union" speech which explains how everything they have done is good and how everything their opponents want to do is bad. Because most US Presidents will serve four or eight subequent years in office, this provides a neat way to compare the speech of different US Presidents over the years.

We load in the corpus just like Gutenberg and Brown



In [None]:
# Load in the state_union corpus
from nltk.corpus import state_union

# fileids shows us the different files
state_union.fileids()

Much like Gutenberg, we can access various properties of specific speeches by using the `.raw()`, `.words()`, or `.sents()` methods with a specific fileid in the brackets:

In [None]:
# Raw words of 1945 speech
state_union.raw('1945-Truman.txt')

In [None]:
# tokenized version (truncated output)
state_union.words('1945-Truman.txt')

In [None]:
# sentences (truncated output)
state_union.sents('1945-Truman.txt')

We could ask a variety of questions about the nature of State of the Union addresses as they have occurred over time.

- Have they changed in length?
- What words are similar among all presidents?
- What words are unique to different presidents?
- Which president is the most lexically diverse?
- and so on...

Do you have an idea about how to do this? It probably involves some sort of looping over the fileids and then performing a function. For example, I'll write a for loop which reports the most frequent word from each speech. Run the function and then inspect the results. What could you do to get more interesting results? And, what does this function say about the distribution of words in the English language?


In [None]:
for fileid in state_union.fileids():
  most_frequent_word = nltk.FreqDist(state_union.words(fileid)).most_common(1)
  print(f'{fileid} most frequent word is: \t {most_frequent_word}')

## Other NLTK Corpora

There are a number of other NLTK corpora explained in Chapter 2 of the NLTK book. I encourage you to read them on your own and have a think about what they might be used for, and we will look at some of them in more depth later on.

# Using your own data - mounting Google Drive
As useful and interesting as the NLTK data is, you will eventually want to load in your own data. One way to do so involves connecting Google Colab with your Google Drive.

The process of connecting Colab to your Google Drive is known as mounting your drive. To do so, you click on the folder icon on the left side of the Colab page:

<img src = https://i.imgur.com/82Wedue.png>

Then you click the "mount drive" icon in the next menu:

<img src = https://i.imgur.com/d8DxFIu.png>

Colab should then automatically add a code cell like this:

<img src = https://i.imgur.com/ttfUkwi.png>

Run the cell to mount your Google Drive. You will most likely see several permissions prompts asking you if its okay to make this connection with the associated Google account. It's fine to do this with notebooks you make or the ones I give you, but be wary of other notebooks that might try to ask for your account permissions. There is likely no big risk but I feel obligated to tell you that you should not blindly trust any other Colab notebooks you might come across.



## Accessing files in your Google Drive

Now that your Drive is connected, you can directly access files in your Google Drive account. This is very handy. (You might need to click the refresh button (the folder with the circle arrow) to see the new folder).

You should see a new folder on the left side menu (after clicking on the folder icon) called `drive`. Clicking that folder should then reveal a subfolder called `MyDrive`. The `MyDrive` folder is the root folder for your Google Drive.

<img src = https://i.imgur.com/Av1mGtQ.png>




In order to access files on your drive, you will need to be able to give Python the full filepath to your files. No matter where your files are, the start of your filepath will always be `/content/drive/MyDrive/...`, where the `...` are any additional folders.

So, for example, if you had a file called `mydata.txt` located in the base level of your Google Drive, the filepath location would be `/content/drive/MyDrive/mydata.txt`. If you had that same file located in a folder called `mydata`, the filepath would be `/content/drive/MyDrive/mydata/mydata.txt`, and so on.

### Practice uploading a file from your drive

Go [here](https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt), you should see a page of text. Manually copy and paste the text into a text editor program, such as notepad on windows or textedit on Mac (don't use Microsoft Word). Save the file as a `.txt` file to your Google Drive folder and name it `marine_biologist.txt`

Once you've done that, you should be able to read the text into Colab using the following cell.

The code uses the Python `open()` function, which, well, opens files! We need to use the `.read()` method at the end of open to return the contents of the file, which in this case is a string of the raw text in the `.txt` file.

In [None]:
marine_biologist = open('/content/drive/MyDrive/marine_biologist.txt').read()

# a random quote from this text
marine_biologist[15041:15135]

If you are having trouble with this step, make sure you are saving the file to your base level folder in Google Drive. Also make sure your drive is mounted, and that you have saved the file using the same filename I used above. If all else fails, please reach out for some help, because being able to access texts on your Google drive will be an important step for a lot of you in order to read in text data. Of course, if you're comfortable using Jupyter on your local machine, you're under no requirement/obligation to use Google drive to store your files.

### Working with a read file.

Once you've loaded the file in, you can perform all of the same operations on it as we have been doing on strings we've typed as well as the built-in data included with NLTK. You should be familiar with the following code at this point — are you able to leave comments explaining what each line is doing?

In [None]:
marine_tokens = nltk.word_tokenize(marine_biologist.lower())

marine_fdist = nltk.FreqDist(marine_tokens)

marine_fdist.most_common(10)

## Using `!wget` and other methods

Using Google Drive is a solid bet for integrating with Colab, but you might not like mounting drives each time you run a notebook, or working with files in your drive.

There are other options which involve reading files directly from the internet, using other functions such as `!wget` or Python libraries for requesting data from URls, such as the `requests` library.

Using these methods requires that the data already exists on the internet somewhere, and also exists at a URL you can access. GitHub is a nice place to store data, as are plenty of other public links. Therefore, I only recommend using this method if you able to control the place where you data lives - and it might just be easier for you to use Google Drive if you don't want to go that route.

The main benefit of using `!wget` is that the data is loaded directly into the notebook environment, so you would not need to much around with sifting through files on the Google Drive.

In [None]:
# using !wget to load a file into the notebook environment
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/tmoom.txt'

Instead of pointing at `/content/drive/MyDrive/...`, you instead just point at `/content/...`

You need to use the appropriate method to open the file, such as using `open()` to open a text file:

In [None]:
# read in the text
tmoom = open('tmoom.txt').read()

# split into tokens
tmoom_tokens = nltk.word_tokenize(tmoom)

# look at the first ten tokens!
[token for token in tmoom_tokens][:10]

## Creating your own NLTK corpus

Even when you use your own data, you can still use the NLTK functions to create the NLTK corpora. There are two ways we can do this, one is to read in a bunch of texts as one single corpus. To do this, we use the `PlaintextCorpusReader` class from NLTK.

In order to use it, we need three things: 1. some files, 2. a filepath which leads to files, and 3. the names of the files.

Again, please follow along. Please go [here](https://github.com/scskalicky/LING-226-vuw/blob/main/other-data/seinfeld.zip) and click the "download" button to download a compressed file containing several scripts from Seinfeld. Download the file, unzip it, and save the folder to your base Google Drive folder. Your files should be located in `/content/drive/MyDrive/seinfeld`. This will be the filepath we feed to the NLTK corpus reader. Let's go ahead and save that to a variable so we only need to type it once:

In [None]:
corpus_root = '/content/drive/MyDrive/seinfeld'

Next, we'll load in the corpus reader from NLTK

In [None]:
# import the module to read in plain text
from nltk.corpus import PlaintextCorpusReader

Now, we need to create a new variable from the `PaintextCorpusReader`. We need to put the path to the files as the first agument, followed by a list of the files names we want to be included in the corpus. The files in the folder are:

```
THE BOYFRIEND PT 1_cleaned.txt
THE BOYFRIEND PT 2_cleaned.txt
THE CHINESE RESTAURANT_cleaned.txt
THE DEALERSHIP_cleaned.txt
THE DOODLE_cleaned.txt
THE ENGLISH PATIENT_cleaned.txt
THE FACE PAINTER_cleaned.txt
THE GOOD SAMARITAN_cleaned.txt
THE JUNIOR MINT_cleaned.txt
THE LITTLE KICKS_cleaned.txt
THE MARINE BIOLOGIST_cleaned.txt
THE PARKING GARAGE_cleaned.txt
THE PARKING SPACE_cleaned.txt
THE PEZ DISPENSER_cleaned.txt
```

Let's try it out on a single file to start. Hey look, the marine biologist episode is in here, so we can try that again.

In [None]:
# read in my text (i've passed the name in a list, so I could include more than one text if I need to later)
marine_biologist_corpus = PlaintextCorpusReader(corpus_root, ['THE MARINE BIOLOGIST_cleaned.txt'])

Now that we've loaded a corpus (even if it is just one text), we can use the built-in NLTK corpus functions.

In [None]:
# The raw version should be just the string
# note we get the exact same output here as when we read the text in manually, above.
marine_biologist_corpus.raw()[15041:15135]

In [None]:
# we can also get sentences
marine_biologist_corpus.sents()

If you remember from the first part of NLTK, they were using functions like `.concordance()` on the built-in data. We can do the same with our data, but we need to wrap the tokenized words in an nltk function called `Text()`.

In [None]:
# Create a special Text version of the corpus
from nltk.text import Text
mb_txt = Text(marine_biologist_corpus.words())

In [None]:
# now we can look for concordance lines
mb_txt.concordance('GEORGE')

In [None]:
mb_txt.concordance('whale')

### Loading in multiple texts to make a corpus

A corpus of a single text is not very interesting. Let's update our `PlaintextCorpusReader` to include all of the texts in our Seinfeld folder. But, it sure would be annoying having to type all of the filenames one-by-one. Fortunately, there's a way around this.

We can use the [`glob` library](https://docs.python.org/3/library/glob.html) to grab all of the filenames in a directory. The `glob` function makes it easy to save all of the filenames from a directory into a variable.  

In [None]:
# import the function which is the same name as the module
from glob import glob

# the * indicates you want everything from the folder.
# we can use more intelligent ways to select only certain files, we'll see this later with regex
filenames = glob('/content/drive/MyDrive/seinfeld/*')

filenames

Doing this gives us the entire filepath which doesn't really hurt us but also is kind of annoying. We could easily remove this using slicing. Because the part that we want to remove is always the same, we could just slice that part off from each filename. All we need to know is where to start the slice

In [None]:
# starting at 32 gives us the episode name only.
filenames[1][32:]

In [None]:
# let's write a list comprehension which removes the start of each filename
#filenames_short = [name[32:] for name in filenames]
filenames_short = [name[38:] for name in filenames]


# voila!
filenames_short

Now we can just pass `filenames_short` to the `PlaintextCorpusReader` function and make a larger corpus. I tested it and it will also work without cleaning the filepath we get from `glob`, but this is nice because we remove the clutter.

In [None]:
# make our seinfeld corpus
seinfeld_corpus = PlaintextCorpusReader(root = corpus_root, fileids = filenames_short)

In [None]:
# we can use the fileids function to see the texts in here
seinfeld_corpus.fileids()

In [None]:
# what are the ten most common words in our corpus?
from nltk import FreqDist
FreqDist(seinfeld_corpus.words()).most_common(10)

In [None]:
# and I can search for concordances, neat!
Text(seinfeld_corpus.words()).concordance('apartment')

## **Your Turn**

Spend some time repeating the steps above for a different set of text data to make your own corpus. You might want to create a specific folder on your Google Drive which has your data for this course.