# NLTK 1: Interactive exploration of  corpora
Learning goals:
 - Understand how useful raw text corpora can be
 - Understand what we can understand about language by quantitative and distributional corpus linguistic applications
 - Know how list comprehension helps to quickly do interactive exploration of corpora

## Loading the NLTK Interactive Demo
This code is really meant for interactive exploration and prints out results more than returning values to compute with

In [None]:
from nltk.book import *
texts()

## Texts are sequences of tokens

In [None]:
text1[0:10]

## Create concordances
KWIC (Keyword in Context)

In [None]:
text1.concordance("man", lines=10, width=68)
print()
text2.concordance("man",lines=10, width=68)

In [None]:
text1.concordance("woman", lines=10, width=68)
print()
text2.concordance("woman",lines=10, width=68)

## Word frequencies in a corpus
Let's compute relative frequencies...

In [None]:
text1.count('love')/len(text1)

In [None]:
text2.count('love')/len(text2) 

## Frequency distributions
Calculate the frequency of all different tokens (=Types) in a text. 
Should follow the [Zipfian Law](https://en.wikipedia.org/wiki/Zipf%27s_law) for larger text corpora

In [None]:
fdist = FreqDist(text1)
vocabulary= sorted(fdist, key=fdist.get, reverse=True)
for w in vocabulary[:20]:
    print(w, "\t\t", fdist[w])


<h3>Printing a plot</h3>
Make sure that the plot object is rendered by Jupyter

In [None]:
%matplotlib inline  
fdist.plot(20,cumulative=True)

## Similarity
Distributional similarity 
- "You shall know a word by the company it keeps!" (J. R. Firth, 1957)
- "words that occur in the same contexts tend to have similar meanings" (Pantel, 2005)

Which words do appear in similar contexts?

In [None]:
text1.similar("woman")
print()
text2.similar("woman")

In [None]:
text1.similar("love")
print()
text2.similar("love")

## Statistical collocations
Which words occur unexpectedly often next to each other?
 - Simple **expected frequency** of word grammars: Each word lies in an urn as often as it occurs in the corpus. Randomly draw two words one after the other from the urn.
 - Empirical frequency** of word bigrams: Create the probability distribution of all word bigrams effectively occurring in the corpus.
 - If expected frequency deviates strongly from empirical frequency, a [statistical collocation] (https://en.wikipedia.org/wiki/Collocation) is available.
 
collocation method is currently buggy, use collocation_list()

In [None]:
print(text1.collocation_list())
print()
print(text2.collocation_list())

## Dispersion plots
Where do words appear how often on a timeline? U.S. Inaugural Addresses
 - Timeline is implicit in the chronological order of the speeches.

In [None]:
text4.dispersion_plot(["freedom","war"])
text4.dispersion_plot(["economy","war","digital","slavery"])

## Text generation
Do you have to make a political speech? Let yourself be inspired by the presidential speeches of the past presidents of the USA. 
 - Statistical generation of texts e.g. from trigram statistics of words. Typical word combinations from a corpus are used to build a probability distribution for the next word to generate. 


In [None]:
t = text4.generate(text_seed="Freedom".split(),length=40)

Sophisticated text generation using recursive neural networks, which can take a little more of the already expressed material into account when proposing the next word: https://cyborg.tenso.rs
 - Recommended: Language model of (re-)tweets by/with Donald Trump (e.g. start with "America")
 - Start with "I love" and select different training corpora (e.g. Linux:-)

Transformer-based text generation, which is the best method currently:
 - [Automatic Text Generation is](https://transformer.huggingface.co/doc/arxiv-nlp/SBXtFhUnmNIAtyDMpRNGxuIl/edit)

# List comprehension

In [None]:
Vocabulary = set(text2)
long_words = [w for w in Vocabulary if len(w)>15]
long_words

In [None]:
help(set)