# Assignment 1 Q3: Sentence Representation

Based on word representation which we learned from Question 1 and 2, we will represent sentence by averag-ing vectors of words consisting of sentences. Skeleton code is provided on this file. Every methods and functions are presented for you. What you are supposed to do is just run those codes and write down your answer


In [None]:
# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from platform import python_version
assert int(python_version().split(".")[1]) >= 5, "Please upgrade your Python version following the instructions in \
    the README.txt file found in the same directory as this notebook. Your Python version is " + python_version()
from nltk.corpus import stopwords
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters') #to specify download location, optionally add the argument: download_dir='/specify/desired/path/'
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import reuters
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)
# ----------------

Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. You do **not** have to perform any other kind of pre-processing.

In [None]:
def read_corpus():
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids()
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]


Let's have a look what these documents are like….

In [None]:
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

## Vector Representation of Sentences

As discussed in class, more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe . Here, we shall represent the sentence by averaging word embeddings produced by GloVe. If you want to know further details of GloVe, try reading [GloVe's original paper](https://nlp.stanford.edu/pubs/glove.pdf).

Then run the following cells to load the GloVe vectors into memory. **Note**: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes.

In [None]:
def load_embedding_model():
    """ Load GloVe Vectors
        Return:
            wv_from_bin: All 400000 embeddings, each lengh 200
    """
    import gensim.downloader as api
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
    return wv_from_bin

In [None]:
# -----------------------------------
# Run Cell to Load Word Vectors
# Note: This will take a couple minutes
# -----------------------------------
wv_from_bin = load_embedding_model()

#### Note
(1) If you are receiving a "reset by peer" error, rerun the cell to restart the download. 

(2) If you are receiving out of memory issues on your local machine, try closing other applications to free more memory on your device. You may want to try restarting your machine so that you can free up extra memory. Then immediately run the jupyter notebook and see if you can load the word vectors properly. If you still have problems with loading the embeddings onto your local machine after this, please go to office hours or contact course TA.

### Problem (a): Tokenization
Tokenization splits a sentence (string) into tokens, rough equivalent to words and punctuation. For example, to process the sentence 'I love New York', the given sentence need to be tokenized to ['I', 'love', 'New', 'York']. Many NLP libraries and packages support tokenization, because it is one of the most fundamental steps in NLP pipeline. However, there is no standard solution that every NLP practitioners agrees upon. Let's compare how different NLP packages tokenize sentences.

In [None]:
sentence1="The BBC's correspondent in Athens, Malcolm Brabant, said that in the past few weeks more details had emerged of the alleged mistreatment by Greek-speaking agents."
sentence2="A new chapter has been written into Australia's rich sporting history after the Socceroos qualified for the World Cup finals following their 4-2 win over Uruguay on penalties at the Olympic Stadium in Sydney."

print("tokenization of sentence 1", word_tokenize(sentence1))
print("tokenization of sentence 1", WordPunctTokenizer().tokenize(sentence1))
print("tokenization of sentence 2", word_tokenize(sentence2))
print("tokenization of sentence 2", WordPunctTokenizer().tokenize(sentence2))

### Problem (b): Stopword
Stop words are the words in a stop list which are filtered out (i.e. stopped) before or after processing of natural language data (text). Let's check out the english stopwords list of NLTK as running the code below.

In [None]:
stop_words_list = stopwords.words('english')
print('# of stop word list :', len(stop_words_list))
print('The whole stop word list',stop_words_list)

Run the code and skim the list. State ***TWO*** reasons why those stopwords are filtered out during the preprocessing.

### Problem (c)

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three sentences $(s_1,s_2,s_3)$ where $s_1$ and $s_2$ are sentences which have similar meanin and $s_1$ and $s_3$ are antonyms, but Cosine Distance $(s_1,s_3) <$ Cosine Distance $(s_1,s_2)$. 

As an example, $s_1$="John likes to watch movies. Mary likes movies too." is closer to $s_3$="Mary likes to watch football games." than to $s_2$="John likes to watch films. So does mary." in the vector space. Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `counter_intuitive_sentences` function which returns true when the condition above is satisfied.

In [None]:

def counter_intuitive_sentences(s1:str,s2:str,s3:str)->bool:
    s1_embedding=sentence_embedding(s1)
    s2_embedding=sentence_embedding(s2)
    s3_embedding=sentence_embedding(s3)
    if (cos_distance(s1_embedding,s3_embedding)<cos_distance(s1_embedding,s2_embedding)):
        return True
    return False

def cos_distance(a:np.ndarray, b:np.ndarray)->float:
    return 1 - sp.spatial.distance.cosine(a, b)

def sentence_embedding(s: str)->np.ndarray:
    s=s.lower()
    s=WordPunctTokenizer().tokenize(s)
    s_embedding=np.empty([200,])
    stop_words=set(stopwords.words('english')) 
    count=0
    for word in s:
        if word not in stop_words:
            s_embedding+=wv_from_bin.get_vector(word)
            count+=1
    s_embedding=s_embedding/count
    return s_embedding

s1="John likes to watch movies. Mary likes movies too."
s2="John likes to watch films. So does mary." 
s3="Mary likes to watch football games."
counter_intuitive_sentences(s1,s2,s3)

In [None]:
#### YOUR EXAMPLE HERE ####
s1=""
s2=""
s3=""
#### BELOW SHOULD RETURN TRUE
print(counter_intuitive_sentences(s1,s2,s3))
