# Representation of raw text corpora

Learning goals

- Explain the Gutenberg corpus reader and how it exposes English raw texts
- Show how raw texts can be represented at different levels: character string, token list, sentence list, paragraph list
- Demonstrate that Gutenberg is an instance of the PlaintextCorpusReader class
- Show how PlaintextCorpusReader can be adapted to languages other than English


In [None]:
from nltk.corpus import gutenberg

# Where are the text files stored?
gutenberg.root

In [None]:
help(gutenberg)

## Text as a single string: method raw()

- Text = sequence of characters.


In [None]:
emma_chars = gutenberg.raw("austen-emma.txt")
emma_chars[-224:]

## Text as a sequence of words: method words()

- Text = sequence of words
- Word = sequence of characters (string)


In [None]:
filename = "austen-emma.txt"
emma_words = gutenberg.words(filename)
emma_words[11:40]

## Text as a sequence of sentences: method sents()

- Text = sequence of sentences
- Sentence = sequence of words
- Word = sequence of characters


In [None]:
emma_sents = gutenberg.sents(filename)

# Last 2 sentences
emma_sents[-2:]

## Document as a sequence of paragraphs: method paras()

- Corpus = sequence of paragraphs
- Paragraph = sequence of sentences
- Sentence = sequence of words
- Word = sequence of characters


In [None]:
emma_paras = gutenberg.paras(filename)
emma_paras[0:4]

## Corpus-linguistic questions

How many paragraphs does "Emma" contain?


In [None]:
len(emma_paras)

### How many sentences does "Emma" contain?


In [None]:
len(emma_sents)

### What is the average number of sentences per paragraph?


In [None]:
len(emma_sents) / len(emma_paras)

Format the result for presentation:


In [None]:
avg = len(emma_sents) / len(emma_paras)
f"Average # of sentence per paragraph: {avg:.2f}"

## Reading your own text corpora

Specifying the correct text encoding when loading files can prevent decoding errors.


In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

root = "/Users/siclemat/nltk_data/corpora/udhr2/"
file_pattern = r".+\.txt"
my_humanrights = PlaintextCorpusReader(root, file_pattern, encoding="utf-8")

print(my_humanrights.sents("deu_1901.txt")[:3])

How many declarations (languages) are included in the collection?


In [None]:
! ls  /Users/siclemat/nltk_data/corpora/udhr2/

In [None]:
# http://www.iana.org/assignments/lang-tags/zh-cmn-Hans
print(my_humanrights.sents("cmn_hans.txt")[:3])

What types do the objects `gutenberg` and `my_humanrights` have?


In [None]:
print(type(my_humanrights))
print(type(gutenberg))

In [None]:
help(PlaintextCorpusReader)

When reading a corpus directory you can optionally provide a sentence tokenizer (sentence splitter), a word tokenizer, and a paragraph reader. This makes PlaintextCorpusReader flexible and adaptable.


## Downloading texts directly from a URL

Example: how to download a text from the Deutsches Textarchiv:
https://www.deutschestextarchiv.de/book/show/abschatz_gedichte_1704


In [None]:
import urllib.request

url = "https://www.deutschestextarchiv.de/book/download_txt/abschatz_gedichte_1704"
response = urllib.request.urlopen(url)
data = response.read()  # a `bytes` object
text = data.decode("utf-8")

In [None]:
type(text)

In [None]:
print(text[:200])