# NLTK 1: Interactive exploration of corpora

Learning goals:

- How to install and import NLTK and its corpus data
- How to use NLTK to explore text corpora interactively
- Understand how useful raw text corpora can be
- Understand what we can understand about language by quantitative and distributional corpus linguistic applications
- Know how list comprehension helps to quickly do interactive exploration of corpora


## Installation and Setup

First, we need to install NLTK and download the book data.


In [None]:
!pip install nltk

In [2]:
import nltk

nltk.download("book")

[nltk_data]    |   Unzipping corpora/reuters.zip.
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package timit to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/siclemat/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data

True

## Importing Modules

Before we start working with NLTK, let's understand how Python imports work.

### Statement: `import Module`

When you import a module, you need to use fully qualified dot notation to access its objects and functions:

```python
# Import module book from package nltk
import nltk.book

# Objects and functions from nltk.book can only
# be accessed with fully qualified dot notation.
print("Second token from text1:", nltk.book.text1[1])
# Second token from text1: Moby

# Objects and functions cannot be accessed directly:
print(text1[1])  # This will raise a NameError
# NameError: name 'text1' is not defined
```


### Statement: `from Module import *`

Alternatively, you can import all objects and functions from a module directly into your namespace:

```python
# Load module book from package nltk and
# import all its objects and functions into the current module
from nltk.book import *

# Objects and functions from nltk.book can be used without
# dotted notation: package.module.object
print("Second token of text1:", text1[1])

# The fully qualified dot notation does not work in this case
print("Second token of text1:", nltk.book.text1[1])  # This will raise a NameError
# NameError: name 'nltk' is not defined
```

**Note:** Using `from Module import *` is convenient for interactive exploration but should be used carefully in production code to avoid namespace conflicts.


## Loading the NLTK Interactive Demo

Note: This code is really meant for interactive exploration and prints out results more than returning values to compute with.


In [3]:
from nltk.book import *

texts()

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Texts are sequences of tokens

We can use the indexing or slicing notation to access tokens in a text.


In [None]:
text1[0:10]

Their type is `nltk.text.Text`, but their slices are simple lists of strings.


In [None]:
type(text1), type(text1[0]), type(text1[0:10])

## Create concordances: KWIC (Keyword in Context)

Which text is "Moby Dick"? Which one is "Sense and Sensibility"?


In [None]:
text1.concordance("man", lines=10, width=68)
print()
text2.concordance("man", lines=10, width=68)

In [None]:
text1.concordance("woman", lines=10, width=68)
print()
text2.concordance("woman", lines=10, width=68)

## Word frequencies in a corpus

Which book talks more about "love" independent of its length? Let's compute relative frequencies...


In [None]:
text1.count("love") / len(text1)

In [None]:
text2.count("love") / len(text2)

Ok, these numbers urgently need some formatting. Let's use format strings from Python


In [None]:
print(
    f"Text1: {text1.count('love')/len(text1):.4%}\nText2:"
    f" {text2.count('love')/len(text2):.4%}"
)

## Frequency distributions

Calculate the frequency of all different tokens (=Types) in a text.
Should follow the [Zipfian Law](https://en.wikipedia.org/wiki/Zipf%27s_law) for larger text corpora


In [None]:
fdist = FreqDist(text1)
vocabulary = sorted(fdist, key=fdist.get, reverse=True)
for w in vocabulary[:20]:
    print(w, "\t\t", fdist[w])

<h3>Printing a plot</h3>
Make sure that the plot object is rendered by Jupyter


In [None]:
! pip install matplotlib

In [None]:
%matplotlib inline
fdist.plot(20,cumulative=True)

Let's create a log-log plot to see if Zipf's law holds.


In [None]:
import matplotlib.pyplot as plt

# Get the frequencies of words in the vocabulary, which is already sorted by frequency
frequencies = [fdist[word] for word in vocabulary]
# Generate the ranks for the words
ranks = range(1, len(vocabulary) + 1)

# Create the log-log plot
plt.loglog(ranks, frequencies)
plt.xlabel("Rank", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.title("Zipf's Law")
plt.show()

## Distributional Similarity

- "You shall know a word by the company it keeps!" (J. R. Firth, 1957)
- "words that occur in the same contexts tend to have similar meanings" (Pantel, 2005)

Which words do appear in similar contexts?

### How does it work technically?

- NLTK ranks similar words by the number of shared context pairs.
- For each target word, NLTK collects all `(left-word, right-word)` contexts in which
  it appears.
- It then computes, for every other word, how many of these context pairs it shares.
- The words are sorted in descending order of shared-context count.

Thus the similar() output is a frequency-based ranking: words at the top occur in the largest number of identical left–right contexts as the target word.

Again: Could you guess which text is "Moby Dick" and which one is "Sense and
Sensibility"?


In [None]:
text1.similar("woman")
print()
text2.similar("woman")

In [None]:
text1.similar("love")
print()
text2.similar("love")

## Statistical collocations

Which word pairs occur more often than expected by chance?

-     **Expected frequency**: Assume each word occurs independently according to its unigram frequency. Drawing two words consecutively from an urn models the probability of a bigram occurring by chance.
-     **Empirical frequency**: Compute the actual bigram distribution observed in the corpus.
-     A word pair is a [**statistical collocation**] (https://en.wikipedia.org/wiki/Collocation) when its observed bigram frequency substantially exceeds the expected frequency, indicating a non-random association.


In [None]:
print(text1.collocation_list())
print()
print(text2.collocation_list())

## Dispersion plots

How are specific words distributed across a chronological sequence of texts?  
Example: _U.S. Inaugural Addresses_

- The timeline is represented implicitly by the **ordered sequence of speeches**.
- A dispersion plot marks each occurrence of a word along this sequence, showing **when** and **how frequently** it appears.


In [None]:
text4.dispersion_plot(["freedom", "war"])
text4.dispersion_plot(["economy", "war", "digital", "slavery"])

## Frequency-Based Text Generation

A simple n-gram language model predicts the next word from the **preceding n−1 words**.

- Build an **n-gram frequency table** from the corpus (e.g. trigrams).
- Convert counts into **conditional probabilities**:  
  P(next_word | w1, w2) = count(w1, w2, next_word) / count(w1, w2)
- Generate text by repeatedly **sampling the next word** from this conditional distribution.
- Resulting output reflects local phrase patterns learned from the presidential speeches.


In [None]:
t = text4.generate(text_seed="Freedom".split(), length=40)

With 2025's generative AI language models, we know what is possible. But here are some
earlier approaches that represent important steps in the development of text generation
(and can be even more fun to play with):

Text generation using recursive neural networks from 2015, which can take a little more of the already expressed material into account when proposing the next word: https://cyborg.tenso.rs

- Recommended: Language model of (re-)tweets by/with Donald Trump (e.g. start with "America")
- Start with "I love" and select different training corpora (e.g. Linux:-)

Early GPT-2 transformer-based text generation:

- Write your next ACL paper with it: [This paper describes](https://transformer.huggingface.co/doc/arxiv-nlp/BcKBkznNiWnDfJdynrvMxQkF/edit)


# Processing XML-based Corpora

Theater plays have a more complex structure than raw texts.
NLTK contains XML-encoded Shakespeare. See this excerpt:

```
<PLAY>
  <TITLE>The Tragedy of Othello</TITLE>
  <PERSONAE>
    <PERSONA>DUKE OF VENICE</PERSONA>
    ...
  </PERSONAE>

  <ACT>
    <SCENE>
      <SPEECH>
        <SPEAKER>RODERIGO</SPEAKER>
        <LINE>Tush! never tell me; I take it much unkindly</LINE>
      </SPEECH>
    </SCENE>
  </ACT>
</PLAY>
```

Let's use the NLTK XML reader to load the NLTK book samples of Shakespeare plays.


In [None]:
from nltk.corpus import shakespeare

# Read XML tree for one play
tree = shakespeare.xml("othello.xml")

# Extract all speaker–line pairs (first scene)
scene = tree.find(".//SCENE")
for speech in scene.findall(".//SPEECH"):
    speaker = speech.findtext("SPEAKER")
    lines = [l.text for l in speech.findall("LINE")]
    print(speaker, lines)