# Assignment 1

This exercise is meant to help you get familiar with some language data. 

The corpus used is the **Penn Treebank**, which is a collection of data from the newspaper 
The Wall Street Journal. The exercise will not take more than a few lines of code; the idea is to examine the data, and notice interesting properties.

## Required software

### Easy installation

The easiest way to get the required software is to install Anaconda. See https://www.continuum.io/downloads . It contains all required packages, including python and jupyter. You can choose python 2.7 or 3.5.

### Manual installation

Make sure that you have `numpy` and `matplotlib` installed. If you don't, you can use e.g. `pip install <package> --user` (python2) or `pip3 install <package> --user` (python3).

## Submission
We will grade your assignments with pass/fail/good, so don't forget to hand them in! 
Choose `File->Download as->HTML` and check if the HTML-file contains all your answers. You can work and submit in pairs. Please e-mail your TA with `"[NLP1] Assignment 1"` as the subject. 
**The deadline for submission is Sunday 6 nov 23:59.**

## Start the notebook

Start a terminal, and `cd` into the directory where you saved the notebook. Then type `jupyter notebook`. Your web browser will open.

## Exercise 1.1

You are provided with a corpus containing words and their Part-of-speech tags in the format is
**word|POS** (one sentence per line) (file name : **sec02-21.gold.tagged**). This data is extracted from Sections 02-21 from the Penn Treebank: these sections are most commonly used for training statistical models (like POS-taggers and parsers).

(a) What are the total number of words (tokens) in this corpus? 
What is the number of distinct word types?


In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

from collections import Counter

In [None]:

n_sentences = 0
clear_sentences = [] # We need the "clean" sentences in order to measure probabilities per sentence
words = []
pos = []

with open('sec02-21.gold.tagged', 'r') as f:
    for sentence in f:
        n_sentences += 1
        temp = sentence.split()
        words += (temp[i].split('|')[0] for i in range(len(temp)))
        pos += (temp[i].split('|')[1] for i in range(len(temp)))


clear_text = (' ').join(words)

clear_sentences += clear_text.split(' . ')

print clear_sentences[1]

distinct_words = set(words)
distinct_pos = set(pos)

print "Number of sentences: ", n_sentences
print "Number of words: ", len(words)
print "Number of distinct words: ", len(distinct_words)
print "Number of distinct pos: ", len(distinct_pos)


(b) Plot a graph of word frequency versus rank of a word, in this corpus. Does this corpus obey Zipf’s law?

In [None]:
import numpy as np

corpus_words = Counter(words)
# My laptop sucks and cannot plot for all words, try removing the 100 and run cell again Selaki :)
# I also think that it might be impossible to plot all with barchart... maybe we can try histogram!
sorted_corpus_words = corpus_words.most_common(80)


tokens = [x[0] for x in sorted_corpus_words]
counts = [x[1] for x in sorted_corpus_words]

#print tokens, values (logspace(-0.5, log10(len(counts)), 20).astype(int))

fig = plt.figure(figsize=(15,5))

#Plot bar with values from dict and label with keys
plt.bar(range(len(sorted_corpus_words)), counts, width=0.3)
plt.xticks(range(len(sorted_corpus_words)), tokens)

#Rotate labels by 90 degrees so you can see them
locs, tokens = plt.xticks()
plt.setp(tokens, rotation=90,fontsize=6)

fig.show()

(c) What is the 25th most common word in the corpus? How many times does it occur? What about the 50th most common word, the 100th and the 1000th?

In [None]:
most_common_25 = corpus_words.most_common(25)[-1]
most_common_50 = corpus_words.most_common(50)[-1]
most_common_100 = corpus_words.most_common(100)[-1]

print "The 25th most common word: \n", most_common_25
print "\nThe 50th most common word: \n", most_common_50
print "\nThe 100th most common word: \n", most_common_100

(d) How many different Part-of-speech tags are present in the corpus?

In [None]:
print "Number of distinct pos: ", len(distinct_pos)

(e) Print a list of the 10 most common part-of-speech tags. Spend a few minutes trying to guess what each tag means, by looking at associated words.

In [None]:
corpus_pos = Counter(pos)

most_common_10 = corpus_pos.most_common(10)

print "The 25th most common word: \n", most_common_10

(f) Assume that the probability $P(w_1^n)$ of a sentence $w_1 \ldots w_n$   can be calculated as follows:

$$P(w_1^n) = P(w_1) \cdot P(w_2) \ldots P(w_n) $$

The probability of a word $w_i$ can be calculated from a corpus as 
$$P(w_i) = \frac{count (w_i)}{N}$$ where $N$ is the total number of word tokens in the corpus. 

What is the probability of the first two sentences in the corpus? 

In [None]:
import operator
import functools

#Function to measure the probability of a uni-gram (w) in a corpus of total count N
def get_word_probability(corpus, w):
    #print corpus[w]
    #print N
    prob_w = (100.0 * float(corpus[w])) / (100 * float(len(corpus)))
    return prob_w

#Function to measure the probabilities of each word in a sentence and 
# the total probability of an n-gram sentence (s) in the corpus

def get_sentence_probability(corpus,sentence):
    #Probabilities of a sentence
    probs_of_s = []
    for word in sentence.split(" "):
        print word
        probs_of_s.append(get_word_probability(corpus,word))
    
    total_prob_s = functools.reduce(operator.mul, probs_of_s, 1)
    
    return probs_of_s, total_prob_s
                          
print "For the word the ", get_word_probability(corpus_words,'Inc.')

# If you wanna take the first 2 sentences, then un-comment the for loop underneath. 

# Probability of the first two sentences in the corpus
# enum = 0
# for sentence in clear_sentences:
#     if enum < 3:
#         prob_of_s, total = get_sentence_probability(corpus_words,sentence)
#         print prob_of_s, total
#         enum += 1
#     else:
#         continue
# print clear_sentences[1]

x, y = get_sentence_probability(corpus_words, clear_sentences[2])
print "The probabilities of all words in the sentence is : \n", x
print "\n"
print "The total probability of the sentence is : ", y 

(g) A word may have several part-of-speech tags, for example the word 'record' can be a noun or a verb. How many words do have more than one POS tag? What are the 10 most frequent combinations of POS tags?

## Exercise 1.2

You are also provided with another file called **sec00.gold.tagged**. 
Section 00 of the Penn Treebank is typically used as development data.

(a) How many unseen words are present in the development data (i.e., words that have not occurred in the training data)?

(b) What are the three most common kind of unseen word (their POS tags)?