# Cosine Similarity Example

### Intro to Algorithmic Marketing textbook example (Ilya Katsov):
![alt text](../images/cos-sim-textbook1.png "Logo Title Text 1")


## Finding Magnitude of a Vector

In [22]:
import math
import numpy as np
def magnitude(x): 
    return math.sqrt(sum(i**2 for i in x))

vectorA = [0,3,1,2]

print(f"First approach: {magnitude(vectorA)}")
print(f"Second approach: {np.linalg.norm(vectorA)}")

First approach: 3.7416573867739413
Second approach: 3.7416573867739413


# Pointwise Mutual Information

It's important to identify a **context window** when analyzing co-occurence. In the image below, the context window size is 4 (2 tokens to either side of the target word). Example from [Nourchene Ouerhani](https://www.researchgate.net/figure/Example-The-quick-brown-fox-jumps-over-the-lazy-dog-Model-CBOW-Input-Layer-White_fig4_348558204):

![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/context_window.png "Logo Title Text 1")

For the purposes of the next section, we'll define the **entire document as the context window.**

Pointwise mutual information measures the ratio between the **joint probability of two events happening** with the probabilities of the two events happening, assuming they are independent. It can be defined with the following equation:

$$
PMI_{A,B} = log\frac{p(A,B)}{p(A)p(B)}
$$

Remember that when two events are independent, $P(i,j) = P(i)P(j)$. Using PMI to just a raw word count is often preferable because very common words have extreme skew ("the" and "of" will co-occur frequently in the same  )

```python
import math
def pmi(tokenA, tokenB, documents, word_counts):
    
    # word_counts[token_A] => number of times tokenA appears in the documents
    # float(len(documents)) => number of documents
    # bigram_freq => a dictionary of the number of times tokenA and tokenB are in the same document together
    
    prob_A = word_counts[tokenA] / float(len(documents))
    prob_B = word_counts[tokenB] / float(len(documents))
    prob_A_B = bigram_freq[" ".join([tokenA, tokenB])] / float(len(documents))
    return math.log(prob_A_B/float(prob_A*prob_B),2) 
```

# Collocation

Many times, in previous homeworks, we've had to manually try to find phrases that belong together. For example, `New York City`.

From [nltk.org](http://www.nltk.org/howto/collocations.html), **collocation** can be defined as

> expressions of multiple words which commonly co-occur together. 

In [3]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english') + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"])

In [4]:
documents = []
articles = [f"../datasets/bbcsport/football/00{i}.txt" for i in range(1,10)]

for article in articles:
    article = open(article) # open each sports article
    for line in article.readlines():
        line = line.replace("\n", "") # replace the new line escape character
        if len(line) > 0: # if the line is not empty, process it
            line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
            documents.append(line)

In [4]:
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [26]:
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 15)

[('Champions', 'League'),
 ('Manchester', 'United'),
 ('Cristiano', 'Ronaldo'),
 ('Van', 'Nistelrooy'),
 ('Wayne', 'Rooney'),
 ('Alex', 'Ferguson'),
 ('FA', 'Cup'),
 ('Ferguson', 'wa'),
 ('Gary', 'Neville'),
 ('Man', 'Utd'),
 ('Manchester', 'City'),
 ('Sir', 'Alex'),
 ('national', 'team'),
 ('wa', "n't"),
 ('23', 'minute')]