# NLTK 1: Interactive exploration of  corpora

## Loading the NLTK Interactive Demo

In [None]:
from nltk.book import *
texts()

## Texts are sequences of tokens

In [None]:
nltk.book.text1[0:20]

In [None]:
globals()

## Create concordances
KWIC (Keyword in Context)

In [None]:
text1.concordance("man", lines=10, width=68)
print()
text2.concordance("man",lines=10, width=68)

In [None]:
text1.concordance("woman", lines=10, width=68)
print()
text2.concordance("woman",lines=10, width=68)

## Word frequencies in a corpus

In [None]:
text1.count('love')/len(text1)

In [None]:
text2.count('love')/len(text2) 

<h3>Frequency distributions</h3>
Calculate the frequency of all different tokens (=Types) in a text. 
Should follow the [Zipfschen Gesetz](https://de.wikipedia.org/wiki/Zipfsches_Gesetz) for larger text sets...

In [None]:
fdist = FreqDist(text1)
vocabulary= sorted(fdist, key=fdist.get, reverse=True)
for w in vocabulary[:20]:
    print(w, "\t\t", fdist[w])


<h3>Printing a plot</h3>
Make sure that the plot object is rendered by Jupyter

In [None]:
%matplotlib inline  
fdist.plot(20,cumulative=True)

## Similarity
Distributional similarity 
- "You shall know a word by the company it keeps!" (J. R. Firth, 1957)
- "words that occur in the same contexts tend to have similar meanings" (Pantel, 2005)

Which words do appear in similar contexts?

In [None]:
text1.similar("woman")
print()
text2.similar("woman")

In [None]:
text1.similar("love")
print()
text2.similar("love")

## Statistical collocations
Which words occur unexpectedly often next to each other?
 - Simple **expected frequency** of word grammars: Each word lies in an urn as often as it occurs in the corpus. Randomly draw two words one after the other from the urn.
 - Empirical frequency** of word bigrams: Create the probability distribution of all word bigrams effectively occurring in the corpus.
 - If expected frequency deviates strongly from empirical frequency, a [statistical collocation] (https://en.wikipedia.org/wiki/Collocation) is available.
 
collocation method is currently buggy, use collocation_list()

In [None]:
print(text1.collocation_list())
print()
print(text2.collocation_list())

## Dispersion plots
Where do words appear how often on a timeline? American speeches (still without Trump)
 - Timeline is implicit in the chronological order of the speeches.

In [None]:
text4.dispersion_plot(["freedom","war"])
text4.dispersion_plot(["economy","war","digital","slavery"])

## Text generation
Do you have to make a political speech? Let yourself be inspired by the presidential speeches of the past presidents of the USA. 
 - Statistical generation of texts e.g. from trigram statistics of words. Typical word combinations from a corpus. (Currently broken in NLTK 3.)
 - Simple bigram alternative https://github.com/crisbal/Markov-Chain-Random-Text-Generator: 
  - For each word in the corpus, compile the occurrence list of all its subsequent words. 
  - Start randomly with a word from the corpus. 
  - Then randomly select the next word from the successor list sampled by their observered probability. The more often a word occurs, the greater the chances that it will be chosen. 
  - End the sentence if you choose a punctuation that ends the sentence.

In [None]:
! timeout 1 python3 markov.py -w <all.txt > generated.txt 
! cat generated.txt

Sophisticated text generation using recursive neural networks, which can take a little more of the already expressed material into account when proposing the next word: https://cyborg.tenso.rs
 - Recommended: Language model of (re-)tweets by/with Donald Trump (e.g. start with "America")
 - Start with "I love" and select different training corpora (e.g. Linux:-)

<h2>Listenkomprehension</h2>

In [30]:
V = set(text2)
long_words = [w for w in V if len(w)>15]
long_words

['companionableness',
 'disqualifications',
 'incomprehensible',
 'disinterestedness']

In [31]:
help(set)

Help on class set in module builtins:

class set(object)
 |  set() -> new empty set object
 |  set(iterable) -> new set object
 |  
 |  Build an unordered collection of unique elements.
 |  
 |  Methods defined here:
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __contains__(...)
 |      x.__contains__(y) <==> y in x.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iand__(self, value, /)
 |      Return self&=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __ior__(self, value, /)
 |      Return self|=value.
 |  
 |  __isub__(self, value, /)
 |      Return self-=value.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __ixor__(self, value, /)
 |      Re