# Assignment 1

This exercise is meant to help you get familiar with some language data. 

The corpus used is the **Penn Treebank**, which is a collection of data from the newspaper 
The Wall Street Journal. The exercise will not take more than a few lines of code; the idea is to examine the data, and notice interesting properties.

## Required software

### Easy installation

The easiest way to get the required software is to install Anaconda. See https://www.continuum.io/downloads . It contains all required packages, including python and jupyter. You can choose python 2.7 or 3.5.

### Manual installation

Make sure that you have `numpy` and `matplotlib` installed. If you don't, you can use e.g. `pip install <package> --user` (python2) or `pip3 install <package> --user` (python3).

## Submission
We will grade your assignments with pass/fail/good, so don't forget to hand them in! 
Choose `File->Download as->HTML` and check if the HTML-file contains all your answers. You can work and submit in pairs. Please e-mail your TA with `"[NLP1] Assignment 1"` as the subject. 
**The deadline for submission is Sunday 6 nov 23:59.**

## Start the notebook

Start a terminal, and `cd` into the directory where you saved the notebook. Then type `jupyter notebook`. Your web browser will open.

## Exercise 1.1

You are provided with a corpus containing words and their Part-of-speech tags in the format is
**word|POS** (one sentence per line) (file name : **sec02-21.gold.tagged**). This data is extracted from Sections 02-21 from the Penn Treebank: these sections are most commonly used for training statistical models (like POS-taggers and parsers).

(a) What are the total number of words (tokens) in this corpus? 
What is the number of distinct word types?


In [3]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

from collections import Counter

In [19]:
f = open('sec02-21.gold.tagged', 'r')

n_sentences = 0
words = []
pos = []

for SENTENCE in f:
    n_sentences += 1
    temp = SENTENCE.split()
    words += (temp[i].split('|')[0] for i in range(len(temp)))
    pos += (temp[i].split('|')[1] for i in range(len(temp)))
    
distinct_words = set(words)
distinct_pos = set(pos)
    
print "Number of sentences: ", n_sentences
print "Number of words: ", len(words)
print "Number of distinct words: ", len(distinct_words)
print "Number of distinct pos: ", len(distinct_pos)


Number of sentences:  39604
Number of words:  929552
Number of distinct words:  44210
Number of distinct pos:  48


(b) Plot a graph of word frequency versus rank of a word, in this corpus. Does this corpus obey Zipf’s law?

In [20]:
corpus_words = Counter(words)

(c) What is the 25th most common word in the corpus? How many times does it occur? What about the 50th most common word, the 100th and the 1000th?

In [21]:
most_common_25 = corpus_words.most_common(25)[-1]
most_common_50 = corpus_words.most_common(50)[-1]
most_common_100 = corpus_words.most_common(100)[-1]

print "The 25th most common word: \n", most_common_25
print "\nThe 50th most common word: \n", most_common_50
print "\nThe 100th most common word: \n", most_common_100

The 25th most common word: 
('Mr.', 4147)

The 50th most common word: 
('had', 1755)

The 100th most common word: 
('A', 860)


(d) How many different Part-of-speech tags are present in the corpus?

In [18]:
print "Number of distinct pos: ", len(distinct_pos)

Number of distinct pos:  48


(e) Print a list of the 10 most common part-of-speech tags. Spend a few minutes trying to guess what each tag means, by looking at associated words.

In [23]:
corpus_pos = Counter(pos)

most_common_10 = corpus_pos.most_common(10)

print "The 25th most common word: \n", most_common_10

The 25th most common word: 
[('NN', 132134), ('IN', 99413), ('NNP', 90711), ('DT', 82147), ('JJ', 59643), ('NNS', 59332), (',', 48314), ('.', 39252), ('CD', 36148), ('RB', 30232)]


(f) Assume that the probability $P(w_1^n)$ of a sentence $w_1 \ldots w_n$   can be calculated as follows:

$$P(w_1^n) = P(w_1) \cdot P(w_2) \ldots P(w_n) $$

The probability of a word $w_i$ can be calculated from a corpus as 
$$P(w_i) = \frac{count (w_i)}{N}$$ where $N$ is the total number of word tokens in the corpus. 

What is the probability of the first two sentences in the corpus? 

(g) A word may have several part-of-speech tags, for example the word 'record' can be a noun or a verb. How many words do have more than one POS tag? What are the 10 most frequent combinations of POS tags?

## Exercise 1.2

You are also provided with another file called **sec00.gold.tagged**. 
Section 00 of the Penn Treebank is typically used as development data.

(a) How many unseen words are present in the development data (i.e., words that have not occurred in the training data)?

(b) What are the three most common kind of unseen word (their POS tags)?