# Assignment 2 - Elementary Probability and Information Theory 
# Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* This notebook loads some data into a `pandas` dataframe, then does a small amount of preprocessing. Make sure your data can load by stepping through all of the cells up until question 1. 
* Most of the questions require you to write some code. In many cases, you will write some kind of probability function like we did in class using the data. 
* Some of the questions only require you to write answers, so be sure to change the cell type to markdown or raw text
* Don't worry about normalizing the text this time (e.g., lowercase, etc.). Just focus on probabilies. 
* Most questions can be answered in a single cell, but you can make as many additional cells as you need. 
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment. 

In [1]:
from client.api.notebook import Notebook

ok = Notebook('a2.ok')

ok.auth(inline=True, force=True)

Assignment: A2 Python and Jupyter
OK, version v1.18.1


Open the following URL:

https://okpy.org/client/login/

After logging in, copy the code from the web page and paste it into the box.
Then press the "Enter" key on your keyboard.

Paste your code here: OdwTHKxswqjKBzVnEcaVXrxVrgwqGt
Successfully logged in as SajiaZafreen@u.boisestate.edu


In [4]:
import pandas as pd 
import numpy as np
from collections import Counter as ctr
data = pd.read_csv('pnp-train.txt',delimiter='\t',encoding='latin-1', # utf8 encoding didn't work for this
                  names=['type','name']) # supply the column names for the dataframe

# this next line creates a new column with the lower-cased first word
data['first_word'] = data['name'].map(lambda x: x.lower().split()[0])

In [5]:
data[:10]

Unnamed: 0,type,name,first_word
0,drug,Dilotab,dilotab
1,movie,Beastie Boys: Live in Glasgow,beastie
2,person,Michelle Ford-Eriksson,michelle
3,place,Ramsbury,ramsbury
4,place,Market Bosworth,market
5,drug,Cyanide Antidote Package,cyanide
6,person,Bill Johnson,bill
7,place,Ettalong,ettalong
8,movie,The Suicide Club,the
9,place,Pézenas,pézenas


In [6]:
data.describe()

Unnamed: 0,type,name,first_word
count,21001,21001,21001
unique,5,20992,13703
top,movie,George Washington,the
freq,6262,2,635


## 1. Write a probability function/distribution $P(T)$ over the types. 

Hints:

* The Counter library might be useful: `from collections import Counter`
* Write a function `def P(T='')` that returns the probability of the specific value for T
* You can access the types from the dataframe by calling `data['type']`

In [7]:
data['type']

0          drug
1         movie
2        person
3         place
4         place
          ...  
20996     movie
20997     place
20998     place
20999     place
21000      drug
Name: type, Length: 21001, dtype: object

In [8]:
type_ctr = ctr(data.type)
def P(T=''):
    return type_ctr[T] / len(data)

## 2. What is `P(T='movie')` ?

In [9]:
P('movie')

0.29817627732012764

## 3. Show that your probability distribution sums to one.

In [11]:
# find the unique types
arr = data['type'].unique()
#print(arr)

In [12]:
#using the map() to iterate over the unique types, then using list() , sum() all the values in the list

val = sum((list(map(P,arr))))
print(val)

1.0


### Extra Calculation

In [11]:
#df.groupby("state", sort=False)["last_name"].count()
data.groupby("type")["first_word"].count()

type
company    2484
drug       5030
movie      6262
person     3836
place      3389
Name: first_word, dtype: int64

In [12]:
type_df = data['type'].value_counts()
df = type_df.to_frame(name = 'types')

print(df)

         types
movie     6262
drug      5030
person    3836
place     3389
company   2484


In [13]:
data.groupby('type').count()

Unnamed: 0_level_0,name,first_word
type,Unnamed: 1_level_1,Unnamed: 2_level_1
company,2484,2484
drug,5030,5030
movie,6262,6262
person,3836,3836
place,3389,3389


In [14]:
data['type'].apply(P).sum()

4613.304937860102

In [15]:
proba = data['type'].apply(P)

In [16]:
#myseries.to_frame(name='my_column_name')
df = proba.to_frame(name = 'Probabilities')

print(df)

       Probabilities
0           0.239512
1           0.298176
2           0.182658
3           0.161373
4           0.161373
...              ...
20996       0.298176
20997       0.161373
20998       0.161373
20999       0.161373
21000       0.239512

[21001 rows x 1 columns]


In [17]:
#df.groupby('Probabilities').sum()
#print (total_prob)
#df.sum(axis=0)
val = df["Probabilities"].sum()
print(val)

4613.304937860102


## 4. Write a joint distribution using the type and the first word of the name

Hints:

* The function is $P2(T,W_1)$
* You will need to count up types AND the first words, for example: ('person','bill)
* Using the [itertools.product](https://docs.python.org/2/library/itertools.html#itertools.product) function was useful for me here

In [244]:
#data.groupby("type")["first_word"]
data.groupby(['type', 'first_word']).size()

type     first_word          
company  a.                      1
         a/s                     1
         aames                   1
         aar                     1
         aaron                   2
                                ..
place    zoeterwoude-rijndijk    1
         zuid-holland            1
         zwickau                 1
         zwolle                  1
         zülpich                 1
Length: 14240, dtype: int64

In [18]:
sum_column = data['type'] + data['first_word']
data['combine'] = sum_column
print (len(data[data['combine'] == 'moviethe']['combine']))
print(len(data))

577
21001


In [19]:
def P2(T='', W1=''):
    com_word = T + W1
    return (len(data[data['combine'] == com_word]['combine'])) / len(data)

## 5. What is P2(T='person', W1='bill')? What about P2(T='movie',W1='the')?

In [20]:
#data.combine
P2(T='person', W1='bill')

0.00047616780153326033

In [21]:
P2(T='movie', W1='the')

0.02747488214846912

## 6. Show that your probability distribution P(T,W1) sums to one.

In [11]:
types = data['type'].unique()
words = data['first_word'].unique()
types_words = data['combine'].unique()

sum_list = []
for a in types:
    for b in words:
        sum_list.append(P2(a,b))
print (round(sum(sum_list),4))

1.0


## 7. Make a new function Q(T) from marginalizing over P(T,W1) and make sure that Q(T) sums to one.

Hints:

* Your Q function will call P(T,W1)
* Your check for the sum to one should be the same answer as Question 3, only it calls Q instead of P.

In [16]:
words = data['first_word'].unique()
def Q(T=''):
    sum_list = []
    for a in words:
        sum_list.append(P2(T,a))
    return sum(sum_list)

In [17]:
Q('movie')

0.29817627732011875

In [58]:
types = data['type'].unique()
val = sum((list(map(Q,types))))
print(round(val,4))

1.0


## 8. What is the KL Divergence of your Q function and your P function for Question 1?

* Even if you know the answer, you still need to write code that computes it.

In [72]:
import math
types = data['type'].unique()
sum_types = []
for a in types:
    sum_types.append(P(a)*(math.log((P(a)/Q(a)),10)))
print(round(sum(sum_types),4))

-0.0


## 9. Convert from P(T,W1) to P(W1|T) 

Hints:

* Just write a comment cell, no code this time. 
* Note that $P(T,W1) = P(W1,T)$
* The relationship is $P(W1|T) = P(T,W1)$      /  $P(T)$

(try to use markdown math formating, answer in this cell)

## 10. Write a function `Pwt` (that calls the functions you already have) to compute $P(W_1|T)$.

* This will be something like the multiplication rule, but you may need to change something

In [22]:
def Pwt (W1='',T=''):
    return P2(T,W1) / P(T)

## 11. What is P(W1='the'|T='movie')?

In [23]:
Pwt(W1='the',T='movie')

0.09214308527626956

## 12. Use Baye's rule to convert from P(W1|T) to P(T|W1). Write a function Ptw to reflect this. 

Hints:

* Call your other functions.
* You may need to write a function for P(W1) and you may need a new counter for `data['first_word']`
* $P(T|W1) = $P(W1|T) $P(T)/ $P(W1)

In [24]:
word_ctr = ctr(data.first_word)
def Pw(W1=''):
    return word_ctr[W1] / len(data)

def Ptw(T ='',W1=''):
    return Pwt(W1,T)*P(T)/Pw(W1)

## 13 
### What is P(T='movie'|W1='the')? 
### What about P(T='person'|W1='the')?
### What about P(T='drug'|W1='the')?
### What about P(T='place'|W1='the')
### What about P(T='company'|W1='the')

In [25]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [26]:
Ptw(T='person',W1='the')

0.0

In [27]:
Ptw(T='drug',W1='the')

0.0

In [28]:
Ptw(T='place',W1='the')

0.0015748031496062992

In [29]:
Ptw(T='company',W1='the')

0.08976377952755905

## 14 Given this, if the word 'the' is found in a name, what is the most likely type?

In [31]:
types = data['type'].unique()
max_dict = {}
#max_type =[]
for a in types:
    max_dict[a] = Ptw(a,'the')
#print(max_dict)   -> the dictionary
#value_max_dict = list(max_dict.values())  -> make the list of the values
#max(set(value_max_dict), key=value_max_dict.count) -> prints the highest frequency value
max(max_dict, key=max_dict.get)



#max_feq = max(max_dict.values(), key=lambda x:max_dict.values().count)

'movie'

## 15. Is Ptw(T='movie'|W1='the') the same as Pwt(W1='the'|T='movie') the same? Why or why not?

In [32]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [33]:
Pwt(W1='the', T='movie')

0.09214308527626956

#### Why or why not:
Of course there are not the same. Given 'the', what is the probability of 'movie' is never same as given 'movie' what is the probability of 'the'. Because, 
> $Ptw(T|W1) = P(T,W1)$    /  $P(W1)$ and  
> $Ptw(W1|T) = P(W1,T)$   /  $P(T)$.  
> Even though, $P(T,W1) = P(W1,T)$, these values are divided by different probabilites.   
> $Ptw(T|W1)$ can only be equal to $Ptw(W1|T)$ only when $P(T) = P(W1)$.

## 16. Do you think modeling Ptw(T|W1) would be better with a continuous function like a Gaussian? Why or why not?

- It is possible to do a Gaussian distribution, however, it depends on the data of course. For a discrete data we do not need a Gaussian distribution.


In [35]:
ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'A2-probability-information-theory.ipynb'.
Submit... 100% complete
Submission successful for user: SajiaZafreen@u.boisestate.edu
URL: https://okpy.org/bsu/nlp/sp21/a2/submissions/14z4MV

