# Exploring a POS-Tagged and Text-Type-Balanced Corpus using Conditional Frequency Distributions
Learning goals:
 - Get to know the Electronic Pioneer corpus from 1961 (see http://en.wikipedia.org/wiki/Brown_Corpus) (punch card times)
 - Understand how to represent tagged corpora in a structured format
 - Understan how powerful indexing in structured representations is
 - Understand how conditional frequency distributions allow quick comparative evaluations in corpus linguistics

What does textual representation look like in the original files?

In [None]:
! head ~/nltk_data/corpora/brown/ca22

What is a reasonable data structure for POS tagged corpora?

In [None]:
from nltk.corpus import brown
brown_tagged_words_humor = brown.tagged_words(categories='humor')
brown_tagged_words_humor[:20]

## Corpus as sequence of tuples (word, POS tag)

In [None]:
brown_tagged_words_humor

What does the following expression calculate?

In [None]:
brown_tagged_words_humor[0][1][0]

## The balanced corpus contains texts from 15 categories.

In [None]:
brown.categories()

# Show the path of every 25th file of the 500 files in the corpus.

In [None]:
print(brown.root)
for f in brown.fileids()[1:500:25]:
    print(brown.abspath(f), f)

## Bivariate frequency distributions 
 - Separate distributions on a variable computed for another variable

## Compute a bivariate frequency distribution of words separately for each genre

In [None]:
import nltk
from nltk.corpus import brown


cfd = nltk.ConditionalFreqDist([
    (genre, word)
    for genre in brown.categories()
      for word in brown.words(categories=genre)])
type(cfd)

How many times appears the word "woman" in the two genres "romance" and "religion"?

In [None]:
cfd["romance"]["woman"], cfd["religion"]["woman"]

Display table for a selection of words for each genre

In [None]:
genres = ['news', 'religion', 'hobbies',
         'science_fiction', 'romance', 'humor']

modals = ['god','good','bad','man','woman']
cfd.tabulate(conditions=genres, samples=modals)

Bivariate frequency distributions contain a univariate frequency distributions for each condition

In [None]:
type(cfd['religion'])

In [None]:
cfd['religion'].N(), cfd['news'].N()

Quiz question: Is the relative frequency of the word "woman" greater in the category "news" than in the category "religion"?

In [None]:
cfd['news'].freq('woman')

In [None]:
cfd['religion'].freq('woman')

In [None]:
cfd['news'].freq('woman') > cfd['religion'].freq('woman')

Is it more likely to find the word "man" in the category "news" than "religion"? 

In [None]:
cfd['news'].freq('man') > cfd['religion'].freq('man')

Your turn: Test a word distribution hypothesis...

# Task: How would Conditional Frequency Distribution look like that counts for each word the distribution of its POS tag?
In other words, how can we build a tagger lexicon in one expression? Can you complete the following code?

In [None]:

tagging_lexicon = nltk.ConditionalFreqDist([
    #your code here
])

In [None]:
tagging_lexicon["can"]