# Assignment 2 - Elementary Probability and Information Theory 
# Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* This notebook loads some data into a `pandas` dataframe, then does a small amount of preprocessing. Make sure your data can load by stepping through all of the cells up until question 1. 
* Most of the questions require you to write some code. In many cases, you will write some kind of probability function like we did in class using the data. 
* Some of the questions only require you to write answers, so be sure to change the cell type to markdown or raw text
* Don't worry about normalizing the text this time (e.g., lowercase, etc.). Just focus on probabilies. 
* Most questions can be answered in a single cell, but you can make as many additional cells as you need. 
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment. 

In [1]:
import pandas as pd 

data = pd.read_csv('pnp-train.txt',delimiter='\t',encoding='latin-1', # utf8 encoding didn't work for this
                  names=['type','name']) # supply the column names for the dataframe

# this next line creates a new column with the lower-cased first word
data['first_word'] = data['name'].map(lambda x: x.lower().split()[0])
data[:10]

Unnamed: 0,type,name,first_word
0,drug,Dilotab,dilotab
1,movie,Beastie Boys: Live in Glasgow,beastie
2,person,Michelle Ford-Eriksson,michelle
3,place,Ramsbury,ramsbury
4,place,Market Bosworth,market
5,drug,Cyanide Antidote Package,cyanide
6,person,Bill Johnson,bill
7,place,Ettalong,ettalong
8,movie,The Suicide Club,the
9,place,Pézenas,pézenas


In [2]:
data.describe()

Unnamed: 0,type,name,first_word
count,21001,21001,21001
unique,5,20992,13703
top,movie,Iris,the
freq,6262,2,635


## 1. Write a probability function/distribution $P(T)$ over the types. 

Hints:

* The Counter library might be useful: `from collections import Counter`
* Write a function `def P(T='')` that returns the probability of the specific value for T
* You can access the types from the dataframe by calling `data['type']`

In [3]:
print(data['type'].unique(),'\n',
data.groupby('type').count())

['drug' 'movie' 'person' 'place' 'company'] 
          name  first_word
type                     
company  2484        2484
drug     5030        5030
movie    6262        6262
person   3836        3836
place    3389        3389


In [4]:
from collections import Counter as ctr

word_type_ctr = ctr(data.type)

def P(T=''): 
    return word_type_ctr[T] / len(data)

## 2. What is `P(T='movie')` ?

In [5]:
print(P(T='movie'))

0.29817627732012764


## 3. Show that your probability distribution sums to one.

In [6]:
round(P(T='drug') + P(T='movie') + P(T='person') + P(T='place') + P(T='company'),2)

1.0

## 4. Write a joint distribution using the type and the first word of the name

Hints:

* The function is $P2(T,W_1)$
* You will need to count up types AND the first words, for example: ('person','bill)
* Using the [itertools.product](https://docs.python.org/2/library/itertools.html#itertools.product) function was useful for me here

In [7]:
def P2(T='',W1=''):
    sub_df = data.loc[(data['type'] == T) & (data['first_word'] == W1)]
    if W1 not in sub_df.first_word.unique(): return 1E-10
    return len(sub_df) / len(data)

## 5. What is P2(T='person', W1='bill')? What about P2(T='movie',W1='the')?

In [8]:
P2(T='person', W1='bill') ## should be 0.00047616780153326033

0.00047616780153326033

In [9]:
P2(T='movie', W1='the') ## should be 0.02747488214846912

0.02747488214846912

## 6. Show that your probability distribution P(T,W1) sums to one.

In [10]:
import numpy as np
def Ps(Words, Type=''):
    return np.sum([P2(T=Type,W1=word) for word in Words])

In [11]:
word_types = data.type.unique()

result = []

for ii in range(0,len(word_types)):
#     print(word_types[ii])
    sub_set_by_type = data[data.type == word_types[ii]]
    probability = Ps(Type=word_types[ii], Words=sub_set_by_type.first_word.unique())
#     print(probability)
    result.append(probability)

print(result, "\n Total = ",sum(result)) ## should be 1.0

[0.23951240417122996, 0.2981762773201277, 0.18265796866815864, 0.16137326793962198, 0.11828008190086188] 
 Total =  1.0


## 7. Make a new function Q(T) from marginalizing over P(T,W1) and make sure that Q(T) sums to one.

Hints:

* Your Q function will call P(T,W1)
* Your check for the sum to one should be the same answer as Question 3, only it calls Q instead of P.

In [12]:
def Q(T=''):
    if T not in data.type.unique(): return 1E-10
    sub_set_by_type = data[data.type == T]
    probability = Ps(Type=T, Words=sub_set_by_type.first_word.unique())
    return probability

Q(T = 'company')

0.11828008190086188

In [13]:
Q(T = 'movie') ## should equal 0.2981762773201236

0.2981762773201277

In [14]:
Q(T = 'drug') + Q(T = 'movie') + Q(T = 'person') + Q(T = 'place') + Q(T = 'company')

1.0

## 8. What is the KL Divergence of your Q function and your P function for Question 1?

* Even if you know the answer, you still need to write code that computes it.

In [15]:
import math

(P('company') * math.log(P('company') / Q('company')) + 
 P('place') * math.log(P('place') / Q('place')) +
 P('person') * math.log(P('person') / Q('person'))+
 P('movie') * math.log(P('movie') / Q('movie'))+
 P('drug') * math.log(P('drug') / Q('drug')))

-1.1912125810141734e-16

## 9. Convert from P(T,W1) to P(W1|T) 

Hints:

* Just write a comment cell, no code this time. 
* Note that $P(T,W1) = P(W1,T)$

(try to use markdown math formating, answer in this cell)

$P(T,W1)$

$P(T,W1) = P(W1,T)$

via the fundamental rule:

$P(W1,T) = P(W1|T) * P(T)$

$P(W1|T) = \frac{P(W1,T)}{P(T)} $

## 10. Write a function `Pwt` (that calls the functions you already have) to compute $P(W_1|T)$.

* This will be something like the multiplication rule, but you may need to change something

In [16]:
def Pwt(W1 = '', T = ''):
    return P2(T = T, W1 = W1) / P(T = T) 

## 11. What is P(W1='the'|T='movie')?

In [17]:
Pwt(W1='the',T='movie') ## should be 0.09214308527626956

0.09214308527626956

## 12. Use Baye's rule to convert from P(W1|T) to P(T|W1). Write a function Ptw to reflect this. 

Hints:

* Call your other functions.
* You may need to write a function for P(W1) and you may need a new counter for `data['first_word']`

In [18]:
word_one_ctr = ctr(data.first_word)

def P_word1(W1 = ''):
    return word_one_ctr[W1] / len(data)

def Ptw(W1 = '', T = ''):
    return (Pwt(W1 = W1, T = T) * P(T = T)) / P_word1(W1 = W1)

## 13 
### What is P(T='movie'|W1='the')? 
### What about P(T='person'|W1='the')?
### What about P(T='drug'|W1='the')?
### What about P(T='place'|W1='the')
### What about P(T='company'|W1='the')

In [19]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [20]:
Ptw(T='person',W1='the')

3.3072440944881886e-09

In [21]:
Ptw(T='drug',W1='the')

3.307244094488189e-09

In [22]:
Ptw(T='place',W1='the')

0.0015748031496062992

In [23]:
Ptw(T='company',W1='the')

0.08976377952755905

## 14 Given this, if the word 'the' is found in a name, what is the most likely type?

The most likely type is 'movie' given that 'the' is the 'data.first_word'

## 15. Is Ptw(T='movie'|W1='the') the same as Pwt(W1='the'|T='movie') the same? Why or why not?

In [24]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [25]:
Pwt(W1='the', T='movie')

0.09214308527626956

Ptw(T='movie'|W1='the') is the probability of the type of word being 'movie' given that the first word is 'the.' And Pwt(W1='the'|T='movie') is the probability of the first word being 'the' given the type of word being 'movie.' The probability is different because you are given different pieces of information and asked the likelihood of different things occurring. You would expect the probability to be the same is these categories were independent of one another, and they are not. The word type and the first word are very dependent upon one another.

## 16. Do you think modeling Ptw(T|W1) would be better with a continuous function like a Gaussian? Why or why not?

- Answer in a markdown cell


The distribution of the categories of word type do not follow a normal distribution, and the frequency of word type is limited to 5 categories. Thus with 5 categorical variables I do not think that a continious function is needed. And the corpus of first word should be a lot larger, similar to an almost infinate space. In 