# **CIS 419/519**
## NLP Worksheet
In this notework, we rank plays and words of Shakespeare using Term Context Matrix and Term Document Matrix.

In [None]:
import os
import csv
import subprocess
import re
import random
import numpy as np
import gdown

if not os.path.exists("play_names.txt"):
    !gdown --id 19To7EPkUd2bHTF0tqJqUkMSq1JRdK33k
if not os.path.exists("vocab.txt"):
    !gdown --id 1doAEuyZ5cBfzEBZUYqrW0RMKcciBZbk5
if not os.path.exists("will_play_txt.csv"):
    !gdown --id 1wGL3C7xAoNdlfUi8fcotj31syc2rIyM_

Downloading...
From: https://drive.google.com/uc?id=19To7EPkUd2bHTF0tqJqUkMSq1JRdK33k
To: /content/play_names.txt
100% 587/587 [00:00<00:00, 1.09MB/s]
Downloading...
From: https://drive.google.com/uc?id=1doAEuyZ5cBfzEBZUYqrW0RMKcciBZbk5
To: /content/vocab.txt
100% 206k/206k [00:00<00:00, 73.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1wGL3C7xAoNdlfUi8fcotj31syc2rIyM_
To: /content/will_play_text.csv
100% 10.3M/10.3M [00:00<00:00, 162MB/s]


In [None]:
def read_in_shakespeare():
    '''Reads in the Shakespeare dataset processesit into a list of tuples.
       Also reads in the vocab and play name lists from files.
    Each tuple consists of
    tuple[0]: The name of the play
    tuple[1] A line from the play as a list of tokenized words.
    Returns:
      tuples: A list of tuples in the above format.
      document_names: A list of the plays present in the corpus.
      vocab: A list of all tokens in the vocabulary.
    '''

    tuples = []

    with open('will_play_text.csv') as f:
        csv_reader = csv.reader(f, delimiter=';')
        for row in csv_reader:
            play_name = row[1]
            line = row[5]
            line_tokens = re.sub(r'[^a-zA-Z0-9\s]', ' ', line).split()
            line_tokens = [token.lower() for token in line_tokens]

            tuples.append((play_name, line_tokens))

    with open('vocab.txt') as f:
        vocab = [line.strip() for line in f]

    with open('play_names.txt') as f:
        document_names = [line.strip() for line in f]

    return tuples, document_names, vocab


def get_row_vector(matrix, row_id):
    return matrix[row_id, :]


def get_column_vector(matrix, col_id):
    return matrix[:, col_id]

## Term Document Matrix
Recall that a term document matrix describes the frequency of terms that occur in a collection of documents. In the matrix, rows correspond to documents in the collection and columns correspond to terms.

In [None]:
def create_term_document_matrix(line_tuples, document_names, vocab):
    '''Returns a numpy array containing the term document matrix for the input lines.
    Inputs:
      line_tuples: A list of tuples, containing the name of the document and 
      a tokenized line from that document.
      document_names: A list of the document names
      vocab: A list of the tokens in the vocabulary
    Let m = len(vocab) and n = len(document_names).
    Returns:
      td_matrix: A mxn numpy array where the number of rows is the number of words
          and each column corresponds to a document. A_ij contains the
          frequency with which word i occurs in document j.
    '''

    vocab_to_id = dict(zip(vocab, range(0, len(vocab))))
    docname_to_id = dict(zip(document_names, range(0, len(document_names))))
    matrix = np.zeros((len(vocab), len(document_names)))
    for (doc, lines) in line_tuples:
        for w in lines:
            matrix[vocab_to_id[w]][docname_to_id[doc]] += 1
    return matrix

## Term Context Matrix (word-word co-occurrence matrix)
Instead of using entire documents, we can use smaller contexts to describe words. For example, we can use a paragraph or a window size of 10. A word is now defined by a vector over counts of context words.

In [None]:
def create_term_context_matrix(line_tuples, vocab, context_window_size=1):
    '''Returns a numpy array containing the term context matrix for the input lines.
    Inputs:
      line_tuples: A list of tuples, containing the name of the document and 
      a tokenized line from that document.
      vocab: A list of the tokens in the vocabulary
    Let n = len(vocab).
    Returns:
      tc_matrix: A nxn numpy array where A_ij contains the frequency with which
          word j was found within context_window_size to the left or right of
          word i in any sentence in the tuples.
    '''

    vocab_to_id = dict(zip(vocab, range(0, len(vocab))))
    matrix = np.zeros((len(vocab), len(vocab)))

    for _, line in line_tuples:
        for i in range(len(line)):
          w = line[i]
          for j in range(max(0, i-context_window_size), min(len(line), i+context_window_size)):
            c = line[j]
            matrix[vocab_to_id[w]][vocab_to_id[c]] += 1

    return matrix

## Ranking Similarity of Plays
From the term document matrix, we can represent a document by a vector. Each element of the vector is the number of occurances of a word in the document. Therefore, by comparing two vectors, we can compare the similarity of two documents. 

In [None]:
def rank_plays(target_play_index, term_document_matrix, similarity_fn):
    ''' Ranks the similarity of all of the plays to the target play.
    Inputs:
      target_play_index: The integer index of the play we want to compare all others against.
      term_document_matrix: The term-document matrix as a mxn numpy array.
      similarity_fn: Function that should be used to compared vectors for two
        documents. Either compute_dice_similarity, compute_jaccard_similarity, or
        compute_cosine_similarity.
    Returns:
      A length-n list of integer indices corresponding to play names,
      ordered by decreasing similarity to the play indexed by target_play_index
    '''

    _, N = term_document_matrix.shape
    target = term_document_matrix[:, target_play_index]
    similarity_score = np.zeros(N)
    for i in range(N):
        similarity_score[i] = similarity_fn(term_document_matrix[:, i], target)

    return np.argsort(similarity_score*-1)

## Ranking Similarity of Words
From the term context matrix, we can represent a word by a vector. Each element of the vector is the number of occurances of the word in a context. Therefore, by comparing two vectors, we can compare the similarity of two words. 

In [None]:
def rank_words(target_word_index, matrix, similarity_fn):
    ''' Ranks the similarity of all of the words to the target word.
    Inputs:
      target_word_index: The index of the word we want to compare all others against.
      matrix: Numpy matrix where the ith row represents a vector embedding of the ith word.
      similarity_fn: Function that should be used to compared vectors for two word
        ebeddings. Either compute_dice_similarity, compute_jaccard_similarity, or
        compute_cosine_similarity.
    Returns:
      A length-n list of integer word indices, ordered by decreasing similarity to the 
      target word indexed by word_index
    '''

    m, n = matrix.shape
    target = matrix[target_word_index, :]
    similarity_score = np.zeros(m)
    for i in range(m):
        similarity_score[i] = similarity_fn(matrix[i,:], target)

    return np.argsort(similarity_score*-1)


## Similarity Metrics
How can we compare the similarity between two vectors? We introduce three measures: cosine similarity, Jaccard similarity and Dice similarity. 

###Cosine Similarity

$\text{Similarity}({\bf t},{\bf e})= {{\bf t} {\bf e} \over \|{\bf t}\| \|{\bf e}\|} = \frac{ \sum_{i=1}^{n}{{\bf t}_i{\bf e}_i} }{ \sqrt{\sum_{i=1}^{n}{({\bf t}_i)^2}} \sqrt{\sum_{i=1}^{n}{({\bf e}_i)^2}} }$ \\
This equals the cosine of the angle between two vectors.

###Jaccard Similarity
$\text{Similarity}({\bf t}, {\bf e}) = \frac{|{\bf t}\cap {\bf e}|}{|{\bf t}\cup {\bf e}|}$

###Dice Similarity 
$\text{Similarity}({\bf t}, {\bf e}) = \frac{2J}{J+1}$ where $J$ is the Jaccard Similarity.

In [None]:
def compute_cosine_similarity(vector1, vector2):
    '''Computes the cosine similarity of the two input vectors.
    Inputs:
      vector1: A nx1 numpy array
      vector2: A nx1 numpy array
    Returns:
      A scalar similarity value.
    '''

    return np.inner(vector1, vector2)/(np.linalg.norm(vector1)*np.linalg.norm(vector2))


def compute_jaccard_similarity(vector1, vector2):
    '''Computes the jaccard similarity of the two input vectors.
    Inputs:
      vector1: A nx1 numpy array
      vector2: A nx1 numpy array
    Returns:
      A scalar similarity value.
    '''

    return np.sum(np.minimum(vector1, vector2))/np.sum(np.maximum(vector1, vector2))


def compute_dice_similarity(vector1, vector2):
    '''Computes the dice similarity of the two input vectors.
    Inputs:
      vector1: A nx1 numpy array
      vector2: A nx1 numpy array
    Returns:
      A scalar similarity value.
    '''

    j = compute_jaccard_similarity(vector1, vector2)
    return 2*j/(j+1)


## Running the code

In [None]:
tuples, document_names, vocab = read_in_shakespeare()

print('Computing term document matrix...')
td_matrix = create_term_document_matrix(tuples, document_names, vocab)

print('Computing term context matrix...')
tc_matrix = create_term_context_matrix(tuples, vocab, context_window_size=2)


random_idx = 6 #for Hamlet
similarity_fns = [compute_cosine_similarity, compute_jaccard_similarity, compute_dice_similarity]
for sim_fn in similarity_fns:
  print('\nThe 10 most similar plays to "%s" using %s are:' % (document_names[random_idx], sim_fn.__qualname__))
  ranks = rank_plays(random_idx, td_matrix, sim_fn)
  for idx in range(1, 11):
    doc_id = ranks[idx]
    print('%d: %s' % (idx, document_names[doc_id]))

word = 'abhor'
vocab_to_index = dict(zip(vocab, range(0, len(vocab))))
for sim_fn in similarity_fns:
  print('\nThe 10 most similar words to "%s" using %s on term-context frequency matrix are:' % (word, sim_fn.__qualname__))
  ranks = rank_words(vocab_to_index[word], tc_matrix, sim_fn)
  for idx in range(0, 11):
    word_id = ranks[idx]
    print('%d: %s' % (idx, vocab[word_id]))

Computing term document matrix...
Computing term context matrix...

The 10 most similar plays to "Hamlet" using compute_cosine_similarity are:
1: Henry VIII
2: A Winters Tale
3: Troilus and Cressida
4: Cymbeline
5: King Lear
6: Pericles
7: Richard III
8: macbeth
9: King John
10: Loves Labours Lost

The 10 most similar plays to "Hamlet" using compute_jaccard_similarity are:
1: Othello
2: Cymbeline
3: A Winters Tale
4: King Lear
5: Richard III
6: Coriolanus
7: Troilus and Cressida
8: Henry VIII
9: Alls well that ends well
10: Antony and Cleopatra

The 10 most similar plays to "Hamlet" using compute_dice_similarity are:
1: Othello
2: Cymbeline
3: A Winters Tale
4: King Lear
5: Richard III
6: Coriolanus
7: Troilus and Cressida
8: Henry VIII
9: Alls well that ends well
10: Antony and Cleopatra

The 10 most similar words to "abhor" using compute_cosine_similarity on term-context frequency matrix are:
0: abhor
1: i
2: gaps
3: promulgate
4: wis
5: recommended
6: propend
7: debted
8: weringly
9

## Result Analysis
The ranking between Jaccard Similarity and Dice Similarity are identical for both cases, which is probably due to the similar nature of Jaccard and Dice scoring.  \
Many other conclusions could be drawn and this is left as an exercise for the reader.