# Representation of pure text corpora
Learning goals:
 - Understand the functionality of the Gutenberg corpus reader object for English raw texts
 - Understand how raw text corpora can be  represented on different levels: character string, token list, sentence list, paragraph list
 - Understand that Gutenberg is just an instance of the PlaintextCorpusReader class
 - Understand how this PlaintextCorpusReader can be adapted to other languages than English


In [1]:
from nltk.corpus import gutenberg

# Where are the text files stored?
gutenberg.root

FileSystemPathPointer('/Users/siclemat/nltk_data/corpora/gutenberg')

In [2]:
help(gutenberg)

Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object:

class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader)
 |  PlaintextCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=None, para_block_reader=<function read_blankline_block at 0x112217240>, encoding='utf8')
 |
 |  Reader for corpora that consist of plaintext documents.  Paragraphs
 |  are assumed to be split using blank lines.  Sentences and words can
 |  be tokenized using the default tokenizers, or by custom tokenizers
 |  specified as parameters to the constructor.
 |
 |  This corpus reader can be customized (e.g., to skip preface
 |  sections of specific document formats) by creating a subclass and
 |  overriding the ``CorpusView`` class variable.
 |
 |  Method resolution order:
 |      PlaintextCorpusReader
 |      nltk.corpus.reader.api.CorpusReader
 |      builtins.o

## Text as a long string: method raw()

- Text = sequence of characters

In [3]:
emma_chars = gutenberg.raw("austen-emma.txt")
emma_chars[-224:]

'"--But, in spite of these deficiencies, the wishes,\nthe hopes, the confidence, the predictions of the small band\nof true friends who witnessed the ceremony, were fully answered\nin the perfect happiness of the union.\n\n\nFINIS\n'

## Text as sequence of words: method words()
 - Text = sequence of words
 - Word = sequence of characters = string
 

In [4]:
filename="austen-emma.txt"
emma_words = gutenberg.words(filename)
emma_words[11:40]

['Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';']

## Text as a sequence of sentences: Method sents()
 - Text = sequence of sentences
 - sentence = sequence of words
 - Word = sequence of characters

In [5]:
emma_sents = gutenberg.sents(filename)

# Last 2 sentences
emma_sents[-2:]

[['--',
  'But',
  ',',
  'in',
  'spite',
  'of',
  'these',
  'deficiencies',
  ',',
  'the',
  'wishes',
  ',',
  'the',
  'hopes',
  ',',
  'the',
  'confidence',
  ',',
  'the',
  'predictions',
  'of',
  'the',
  'small',
  'band',
  'of',
  'true',
  'friends',
  'who',
  'witnessed',
  'the',
  'ceremony',
  ',',
  'were',
  'fully',
  'answered',
  'in',
  'the',
  'perfect',
  'happiness',
  'of',
  'the',
  'union',
  '.'],
 ['FINIS']]

## Document as a sequence of paragraphs: method paras()
 - corpus = sequence of paragraphs
 - paragraph = sequence of sentences
 - sentence = sequence of words
 - word = sequence of characters

In [6]:
emma_paras = gutenberg.paras(filename)
emma_paras[0:4]

[[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']],
 [['VOLUME', 'I']],
 [['CHAPTER', 'I']],
 [['Emma',
   'Woodhouse',
   ',',
   'handsome',
   ',',
   'clever',
   ',',
   'and',
   'rich',
   ',',
   'with',
   'a',
   'comfortable',
   'home',
   'and',
   'happy',
   'disposition',
   ',',
   'seemed',
   'to',
   'unite',
   'some',
   'of',
   'the',
   'best',
   'blessings',
   'of',
   'existence',
   ';',
   'and',
   'had',
   'lived',
   'nearly',
   'twenty',
   '-',
   'one',
   'years',
   'in',
   'the',
   'world',
   'with',
   'very',
   'little',
   'to',
   'distress',
   'or',
   'vex',
   'her',
   '.']]]

## Corpus linguistic questions
How many paragraphs does "Emma" have?

In [7]:
len(emma_paras)

2371

### How many sentences does "Emma" have?

In [8]:
len(emma_sents)

7752

### How many sentences does a paragraph have on average?

In [9]:
len(emma_sents)/len(emma_paras)

3.2695065373260226

How can we format that nicely?

In [10]:
avg = len(emma_sents)/len(emma_paras)
f"Average # of sentence per paragraph: {avg:.2f}"

'Average # of sentence per paragraph: 3.27'

# Read your own text corpora
Explicit correct decoding of the text file can be helpful

In [11]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
root = '/Users/siclemat/nltk_data/corpora/udhr2/'
file_pattern = r'.+\.txt'
my_humanrights = PlaintextCorpusReader(root,
                    file_pattern,
                    encoding='utf-8')

print(my_humanrights.sents('deu_1901.txt')[:3])

[['Die', 'Allgemeine', 'Erklärung', 'der', 'Menschenrechte', 'Resolution', '217', 'A', '(', 'III', ')', 'vom', '10', '.', '12', '.', '1948'], ['Präambel', 'Da', 'die', 'Anerkennung', 'der', 'angeborenen', 'Würde', 'und', 'der', 'gleichen', 'und', 'unveräußerlichen', 'Rechte', 'aller', 'Mitglieder', 'der', 'Gemeinschaft', 'der', 'Menschen', 'die', 'Grundlage', 'von', 'Freiheit', ',', 'Gerechtigkeit', 'und', 'Frieden', 'in', 'der', 'Welt', 'bildet', ',', 'da', 'die', 'Nichtanerkennung', 'und', 'Verachtung', 'der', 'Menschenrechte', 'zu', 'Akten', 'der', 'Barbarei', 'geführt', 'haben', ',', 'die', 'das', 'Gewissen', 'der', 'Menschheit', 'mit', 'Empörung', 'erfüllen', ',', 'und', 'da', 'verkündet', 'worden', 'ist', ',', 'daß', 'einer', 'Welt', ',', 'in', 'der', 'die', 'Menschen', 'Rede', '-', 'und', 'Glaubensfreiheit', 'und', 'Freiheit', 'von', 'Furcht', 'und', 'Not', 'genießen', ',', 'das', 'höchste', 'Streben', 'des', 'Menschen', 'gilt', ',', 'da', 'es', 'notwendig', 'ist', ',', 'die', '

How many declarations have been collected?

In [12]:
! ls  /Users/siclemat/nltk_data/corpora/udhr2/

007.txt		   ell_polytonic.txt  lia.txt		 rmy.txt
008.txt		   emk.txt	      lin.txt		 roh.txt
009.txt		   eml.txt	      lin_tones.txt	 ron.txt
010.txt		   eng.txt	      lit.txt		 ron_1953.txt
011.txt		   epo.txt	      lnc.txt		 ron_1993.txt
abk.txt		   est.txt	      lns.txt		 run.txt
ace.txt		   eus.txt	      lot.txt		 rus.txt
acu.txt		   eve.txt	      loz.txt		 sag.txt
acu_1.txt	   evn.txt	      ltz.txt		 sah.txt
ada.txt		   ewe.txt	      lua.txt		 san.txt
afr.txt		   fao.txt	      lue.txt		 sco.txt
agr.txt		   fij.txt	      lug.txt		 sey.txt
aii.txt		   fin.txt	      lun.txt		 shk.txt
ajg.txt		   flm.txt	      lus.txt		 shp.txt
aka_akuapem.txt    fon.txt	      mad.txt		 skr.txt
aka_asante.txt	   fra.txt	      mag.txt		 slk.txt
aka_fante.txt	   fri.txt	      mah.txt		 slv.txt
als.txt		   fuc.txt	      mai.txt		 sme.txt
alt.txt		   fur.txt	      mal.txt		 smo.txt
amc.txt		   gaa.txt	      mam.txt		 sna.txt
ame.txt		   gag.txt	      mar.txt		 snk.txt
amh.txt		   gax.

In [13]:
# http://www.iana.org/assignments/lang-tags/zh-cmn-Hans
print(my_humanrights.sents('cmn_hans.txt')[:3])

[['世界人权宣言', '联合国大会一九四八年十二月十日第217A', '(', 'III', ')', '号决议通过并颁布', '1948', '年', '12', '月', '10', '日', '，', '联合国大会通过并颁布', '《', '世界人权宣言', '》。', '这一具有历史意义的', '《', '宣言', '》', '颁布后', '，', '大会要求所有会员国广为宣传', '，', '并且', '“', '不分国家或领土的政治地位', ',', '主要在各级学校和其他教育机构加以传播', '、', '展示', '、', '阅读和阐述', '。”《', '宣言', '》', '全文如下', '：'], ['序言', '鉴于对人类家庭所有成员的固有尊严及其平等的和不移的权利的承认', ',', '乃是世界自由', '、', '正义与和平的基础', ',', '鉴于对人权的无视和侮蔑已发展为野蛮暴行', ',', '这些暴行玷污了人类的良心', ',', '而一个人人享有言论和信仰自由并免予恐惧和匮乏的世界的来临', ',', '已被宣布为普通人民的最高愿望', ',', '鉴于为使人类不致迫不得已铤而走险对暴政和压迫进行反叛', ',', '有必要使人权受法治的保护', ',', '鉴于有必要促进各国间友好关系的发展', ',', '鉴于各联合国国家的人民已在联合国宪章中重申他们对基本人权', '、', '人格尊严和价值以及男女平等权利的信念', ',', '并决心促成较大自由中的社会进步和生活水平的改善', ',', '鉴于各会员国业已誓愿同联合国合作以促进对人权和基本自由的普遍尊重和遵行', ',', '鉴于对这些权利和自由的普遍了解对于这个誓愿的充分实现具有很大的重要性', ',', '因此现在', ',', '大会', ',', '发布这一世界人权宣言', ',', '作为所有人民和所有国家努力实现的共同标准', ',', '以期每一个人和社会机构经常铭念本宣言', ',', '努力通过教诲和教育促进对权利和自由的尊重', ',', '并通过国家的和国际的渐进措施', ',', '使这些权利和自由在各会员国本身人民及在其管辖下领土的人民中得到普遍和有效的承认和遵行', ';'], ['第一条', '人人生而自由', ',', '在尊严和权利上

What type do the objects `gutenberg` and `my_humanrights` have?

In [14]:
print(type(my_humanrights))
print(type(gutenberg))

<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>


In [None]:
help(PlaintextCorpusReader)

When reading a corpus directory, we can optionally specify  the sentence tokenizer (sentence splitter) and word tokenizer as well as the reader for paragraphs. This makes the reader class flexible and general!

## Reading from URLs without NLTK
How can we just download the text file from "Deutsches Textarchiv"?
https://www.deutschestextarchiv.de/book/show/abschatz_gedichte_1704

In [None]:
import urllib.request
url = ('https ://www.deutschestextarchiv.de/book/' 
       'download_txt/abschatz_gedichte_1704')
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') 


In [None]:
type(text)

In [None]:
print(text[:200])