# Exploring a POS-Tagged and Text-Type-Balanced Corpus using Conditional Frequency Distributions
Learning goals:
 - Get to know the Electronic Pioneer corpus from 1961 (see http://en.wikipedia.org/wiki/Brown_Corpus) (punch card times)
 - Understand how to represent tagged corpora in a structured format
 - Understan how powerful indexing in structured representations is
 - Understand how conditional frequency distributions allow quick comparative evaluations in corpus linguistics

What does textual representation look like in the original files?

In [16]:
! head ~/nltk_data/corpora/brown/ca22



	Emory/np-tl University's/nn$-tl Board/nn-tl of/in-tl Trustees/nns-tl announced/vbd Friday/nr that/cs it/pps was/bedz prepared/vbn to/to accept/vb students/nns of/in any/dti race/nn as/ql soon/rb as/cs the/at state's/nn$ tax/nn laws/nns made/vbd such/abl a/at step/nn possible/jj ./.


	``/`` Emory/np-tl University's/nn$-tl charter/nn and/cc by-laws/nns have/hv never/rb required/vbn admission/nn or/cc rejection/nn of/in students/nns on/in the/at basis/nn of/in race/nn ''/'' ,/, board/nn chairman/nn Henry/np L./np Bowden/np stated/vbd ./.


	But/cc an/at official/jj statement/nn adopted/vbn by/in the/at 33-man/jj Emory/np board/nn at/in its/pp$ annual/jj meeting/nn Friday/nr noted/vbd that/cs state/nn taxing/vbg requirements/nns at/in present/nn are/ber a/at roadblock/nn to/in accepting/vbg Negroes/nps ./.



What is a reasonable data structure for POS tagged corpora?

In [17]:
from nltk.corpus import brown
brown_tagged_words_humor = brown.tagged_words(categories='humor')
brown_tagged_words_humor[:20]

[('It', 'PPS'),
 ('was', 'BEDZ'),
 ('among', 'IN'),
 ('these', 'DTS'),
 ('that', 'CS'),
 ('Hinkle', 'NP'),
 ('identified', 'VBD'),
 ('a', 'AT'),
 ('photograph', 'NN'),
 ('of', 'IN'),
 ('Barco', 'NP'),
 ('!', '.'),
 ('!', '.'),
 ('For', 'CS'),
 ('it', 'PPS'),
 ('seems', 'VBZ'),
 ('that', 'CS'),
 ('Barco', 'NP'),
 (',', ','),
 ('fancying', 'VBG')]

## Corpus as sequence of tuples (word, POS tag)

In [18]:
brown_tagged_words_humor

[('It', 'PPS'), ('was', 'BEDZ'), ('among', 'IN'), ...]

What does the following expression calculate?

In [19]:
brown_tagged_words_humor[0][1][0]

'P'

## The balanced corpus contains texts from 15 categories.

In [20]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

# Show the path of every 25th file of the 500 files in the corpus.

In [21]:
print(brown.root)
for f in brown.fileids()[1:500:25]:
    print(brown.abspath(f), f)

/Users/siclemat/nltk_data/corpora/brown
/Users/siclemat/nltk_data/corpora/brown/ca02 ca02
/Users/siclemat/nltk_data/corpora/brown/ca27 ca27
/Users/siclemat/nltk_data/corpora/brown/cb08 cb08
/Users/siclemat/nltk_data/corpora/brown/cc06 cc06
/Users/siclemat/nltk_data/corpora/brown/cd14 cd14
/Users/siclemat/nltk_data/corpora/brown/ce22 ce22
/Users/siclemat/nltk_data/corpora/brown/cf11 cf11
/Users/siclemat/nltk_data/corpora/brown/cf36 cf36
/Users/siclemat/nltk_data/corpora/brown/cg13 cg13
/Users/siclemat/nltk_data/corpora/brown/cg38 cg38
/Users/siclemat/nltk_data/corpora/brown/cg63 cg63
/Users/siclemat/nltk_data/corpora/brown/ch13 ch13
/Users/siclemat/nltk_data/corpora/brown/cj08 cj08
/Users/siclemat/nltk_data/corpora/brown/cj33 cj33
/Users/siclemat/nltk_data/corpora/brown/cj58 cj58
/Users/siclemat/nltk_data/corpora/brown/ck03 ck03
/Users/siclemat/nltk_data/corpora/brown/ck28 ck28
/Users/siclemat/nltk_data/corpora/brown/cl24 cl24
/Users/siclemat/nltk_data/corpora/brown/cn19 cn19
/Users/sic

## Bivariate frequency distributions 
 - Separate distributions on a variable computed for another variable

## Compute a bivariate frequency distribution of words separately for each genre

In [13]:
import nltk
from nltk.corpus import brown


cfd = nltk.ConditionalFreqDist([
    (genre, word)
    for genre in brown.categories()
      for word in brown.words(categories=genre)])
type(cfd)

nltk.probability.ConditionalFreqDist

How many times appears the word "woman" in the two genres "romance" and "religion"?

In [None]:
cfd["romance"]["woman"], cfd["religion"]["woman"]

Display table for a selection of words for each genre

In [None]:
genres = ['news', 'religion', 'hobbies',
         'science_fiction', 'romance', 'humor']

modals = ['god','good','bad','man','woman']
cfd.tabulate(conditions=genres, samples=modals)

Bivariate frequency distributions contain a univariate frequency distributions for each condition

In [None]:
type(cfd['religion'])

In [None]:
cfd['religion'].N(), cfd['news'].N()

Quiz question: Is the relative frequency of the word "woman" greater in the category "news" than in the category "religion"?

In [None]:
cfd['news'].freq('woman')

In [None]:
cfd['religion'].freq('woman')

In [None]:
cfd['news'].freq('woman') > cfd['religion'].freq('woman')

Is it more likely to find the word "man" in the category "news" than "religion"? 

In [None]:
cfd['news'].freq('man') > cfd['religion'].freq('man')

Your turn: Test a word distribution hypothesis...

# Task: How would Conditional Frequency Distribution look like that counts for each word the distribution of its POS tag?
In other words, how can we build a tagger lexicon in one expression? Can you complete the following code?

In [None]:

tagging_lexicon = nltk.ConditionalFreqDist([
    #your code here
])

In [None]:
tagging_lexicon["can"]