# Brown Corpus: Exploring a POS-Tagged and Text-Type-Balanced Corpus using Conditional Frequency Distributions
Learning goals:
 - Get to know the Electronic Pioneer corpus from 1961 (see http://en.wikipedia.org/wiki/Brown_Corpus)
 - Understand how to represent tagged corpora in a structured format
 - Understan how powerful indexing in structured representations is
 - Understand how conditional frequency distributions allow quick comparative evaluations in corpus linguistics

What does textual representation look like in the original files?

In [None]:
! head ~/nltk_data/corpora/brown/ca22

What is a reasonable data structure for POS tagged corpora?

In [None]:
from nltk.corpus import brown
brown_tagged_words = brown.tagged_words(categories='humor')
brown_tagged_words[:20]

## Corpus as sequence of tuples (word, POS tag)

In [None]:
brown_tagged_words[:]

What does the following expression calculate?

In [None]:
brown_tagged_words[0][1][0]

## The balanced corpus contains texts from 15 categories.

In [None]:
brown.categories()

In [None]:
# Show the path of every 20th file of the 500 files in the corpus.

for f in brown.fileids()[1:500:20]:
    print(brown.abspath(f))

brown.root

## Bivariate frequency distributions

In [None]:
import nltk
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist([
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre)])


genres = ['news', 'religion', 'hobbies',
         'science_fiction', 'romance', 'humor']

modals = ['can', 'could', 'may', 'might', 'must', 'will']
modals = ['god','good','bad','man','woman']
cfd.tabulate(conditions=genres, samples=modals)


Bivariate frequency distributions contain a univariate frequency distributions for each condition

In [None]:
cfd['religion'].N()

Quiz question: Is the relative frequency of the word "can" greater in the category "news" than in the category "religion"?

In [None]:
cfd['news'].freq('can')

In [None]:
cfd['religion'].freq('can')

In [None]:
cfd['news'].freq('can') > cfd['religion'].freq('can')

Is it more likely to find the word "could" in the category "science fiction" than "news"? 

In [None]:
cfd['science_fiction'].freq('could') > cfd['news'].freq('could')

Your turn: Test a word distribution hypothesis...