# Session Outline

Go through the rest of session 8:

- Word Sense Disambiguation 
- Named Entity Recognition
- Entity Linking

New Stuff:

- Word-embeddings and Vector Space
- Cosine Similarity

Homework:
- Session 8 excercises + Session 9 excercises = Homework today


In [None]:
import codecs, nltk, string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:


exclude = set(string.punctuation)
stop_word_list = stopwords.words('english')

# input should be a string - we need a simple pipeline for getting word embeddings
def nlp_simple_pipeline(text):
    
    #it depends if the words have been lowercased or not
    text = text.lower()
    
    text = nltk.word_tokenize(text)
        
    text = [token for token in text if token not in exclude and token.isalpha()]
    
    text = [token for token in text if token not in stop_word_list]

    return text

### How do we know if words are related?

Similarity: two words sharing a high number of salient (e.g., synonyms)
Relatedness: two words semantically associated, without being necessarily similar (car and pilot)

### Why do we need embeddings?

- To capture the meaning of a word in the vector-space
- words with similar context occupy close spatial positions

## Vector Space Model

The most basic and naive method for transforming words into vectors is to count occurrence of each word in each document - **countvectorizing or one-hot encoding **


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# create CountVectorizer object
vectorizer = CountVectorizer()
corpus = [
          'Text of first document.',
          'Text of the second document made longer.',
          'Number three.',
          'This is number four.',
]

# learn the vocabulary and store CountVectorizer sparse matrix in X
X = vectorizer.fit_transform(corpus)

# columns of X correspond to the result of this method
vectorizer.get_feature_names() == (
    ['document', 'first', 'four', 'is', 'longer',
     'made', 'number', 'of', 'second', 'text',
     'the', 'this', 'three'])
# retrieving the matrix in the numpy form
X.toarray()


# transforming a new document according to learn vocabulary
#vectorizer.transform(['A new document.']).toarray()


array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]], dtype=int64)

- The idea is to collect a set of documents (they can be words, sentences, paragraphs or even articles) and count the occurrence of every word in them. 
- Strictly speaking, the columns of the resulting matrix are words and the rows are documents.

**This is a however sparse vector** - has mostly zero values

In [8]:
X = vectorizer.fit_transform(corpus)
X#make list of aall sentenced 

<4x13 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [9]:
vectorizer.get_feature_names()

['document',
 'first',
 'four',
 'is',
 'longer',
 'made',
 'number',
 'of',
 'second',
 'text',
 'the',
 'this',
 'three']

In [10]:
X

<4x13 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [6]:
X.toarray()

array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]], dtype=int64)

# Wordembeddings

- Lower dimension dense vectors

Word2vec - Train the neural network for two different tasks:
- Predicting the word, given the context
- Classification of a word, given another word in the sentence

## You can train your own embeddings: for example on political text, manifestos etc.

## Lets see how we can use them!

In [None]:

# the model is organized like this: word = embeddings
small_model = gensim.models.KeyedVectors.load_word2vec_format('/Users/Ashrakat/Desktop/small-embeddings.txt', binary=False)

In [None]:
#to see the embeddings of a word, you just do:

print (small_model["clinton"])
print (small_model["obama"])

In [None]:
small_model.wv.most_similar(positive=['obama'])


In [None]:
# get relatedness

print (small_model.wv.similarity('clinton', 'clinton'))
print (small_model.wv.similarity('clinton', 'obama'))

In [None]:

# you can represent the meaning of an article, by the average of their embeddings
# let's compute the embeddings for an article

dataset = codecs.open("/Users/Ashrakat/Desktop/rt_dataset.tsv", "r", "utf-8").read().strip().split("\n")

article = dataset[4].split("\t")[3]

cleaned_article = nlp_simple_pipeline(article)
print (cleaned_article)

In [None]:
# for each word, load embeddings
for word in cleaned_article:
    print (word)
    embed_word = small_model[word]

In [None]:
# handling exceptions
for word in cleaned_article:
    try:
        embed_word = small_model[word]
    except KeyError:
        print (word)
        continue

## Representing the meaning of an article as a single embedding


In [None]:

def article_embedding(cleaned_article):
    
    article_embedd = []
    # for each word in the article, you take the embeddings
    for word in cleaned_article:
        try:
            embed_word = small_model[word]
            article_embedd.append(embed_word)
        except KeyError:
            continue
    
    # average vectors of all words
    avg = [float(sum(col))/len(col) for col in zip(*article_embedd)]
    avg = np.array(avg).reshape(1, -1)
    return avg

In [None]:
article = dataset[1]
cleaned_article = nlp_simple_pipeline(article)
embed_art = article_embedding(cleaned_article)

In [None]:
title

# Cosine-similarity

- Cosine similarity calculates similarity by measuring the cosine of angle between two vectors in a multi-dimensional space.
- The smaller the angle the higher the cosine similarity
- scikit learn library

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("/Users/Ashrakat/Desktop/cosines1.png")
plt.imshow(image)
plt.gcf().set_size_inches(15, 10)
plt.show()

Google Images used by Dhruvil Karani in his Medium article

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("/Users/Ashrakat/Desktop/cosines.png")
plt.imshow(image)
plt.gcf().set_size_inches(15, 10)
plt.show()

Source: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [None]:
def nlp_simple_pipeline(text):
    
    #it depends if the words have been lowercased or not
    text = text.lower()
    
    text = nltk.word_tokenize(text)
        
    text = [token for token in text if token not in exclude and token.isalpha()]
    
    text = [token for token in text if token not in stop_word_list]

    return text


def article_embedding(cleaned_article):
    
    article_embedd = []
    # for each word in the article, you take the embeddings
    for word in cleaned_article:
        try:
            embed_word = small_model[word]
            article_embedd.append(embed_word)
        except KeyError as e:
            print (e,word)
            continue
    
    # average vectors of all words
    avg = [float(sum(col))/len(col) for col in zip(*article_embedd)]
    avg = np.array(avg).reshape(1, -1)
    return avg

In [None]:
Document_A="The sun is shining today I want to go out"
Document_B= "What a beautiful summer day."
Document_C= "The winter is coming"
Document_D= "It is snowing over here"


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

doc_a=nlp_simple_pipeline(Document_A)
doc_b=nlp_simple_pipeline(Document_B)
doc_c=nlp_simple_pipeline(Document_C)
doc_d=nlp_simple_pipeline(Document_D)

doc_a=article_embedding(doc_a)
doc_b=article_embedding(doc_b)
doc_c=article_embedding(doc_c)
doc_d=article_embedding(doc_d)




In [None]:
print("compare a and b",cosine_similarity(doc_a, doc_b))
print("compare a and c",cosine_similarity(doc_a, doc_c))
print("compare c and d",cosine_similarity(doc_c, doc_d))


In [None]:
#define that you need to exclude punctuation
exclude = set(string.punctuation)

# this represent any text as a single "doc-embedding" we use it both for the query and the sentences
# input should be a string
def text_embedding(text):
    
    #this works to lower text
    text = text.lower()
    
    # we tokenize the text in single words
    text = nltk.tokenize.WordPunctTokenizer().tokenize(text)
    
    # we remove numbers and punctuation
    text = [token for token in text if token not in exclude and token.isalpha()]
    
    doc_embed = []
    
    # for each word we get the embedding and we append it to a list
    for word in text:
            try:
                embed_word = emb_model[word]
                doc_embed.append(embed_word)
            except KeyError as e: # if there is an error we continue
                print (e,word)
                continue
    # we average the embeddings of all the words, getting an overall doc embedding
    if len(doc_embed)>0:
        avg = [float(sum(col))/len(col) for col in zip(*doc_embed)]

        avg = np.array(avg).reshape(1, -1)

        # the output is a doc-embedding
        return avg
    else:
        return "Empty"

In [None]:
text_query = ["crime","criminal","murder","drugs","rape"]

#query = [" ".join(nlp_pipeline(" ".join(text_query)))]

emb_query = nlp_simple_pipeline(" ".join(text_query))
emb_query = article_embedding(" ".join(text_query))

emb_query

In [None]:
import csv

tsv_file = open("/Users/Ashrakat/Desktop/rt_dataset.tsv")
read_tsv = csv.reader(tsv_file, delimiter="\t")
all_lines=[]
for line in read_tsv:
    print(line)
    all_lines.append(line)
tsv_file.close()

In [None]:
all_lines=all_lines[1:101]
len(all_lines)

In [None]:
embs_corpus = [x+[text_embedding(x[3])] for x in all_lines] #what is x

In [None]:
embs_corpus

In [None]:
import codecs, nltk, string, os, gensim
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

scores = [x + [cosine_similarity(x[4], emb_query)[0]] for x in embs_corpus]
scores

In [None]:
#inspired by fede
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("/Users/Ashrakat/Desktop/screen.png")
plt.imshow(image)
plt.gcf().set_size_inches(15, 10)
plt.show()


# Excercises:

**Excercise 1**

Using Pandas and sascat_excerpts


- Use the word embeddings on your content column --> create a new column that reflects the content with the word embeddings

**Excercise 2**

Use the rt_dataset.tsv

a)
- use single function that takes a law/document (one row), cleans the article and creates the article-embeddings
- you can only consider the first 100 articles for speed

b)
- create a dictionary of article embeddings [key number of article: value is the embeeding] (of course here you do that for the content)
- bonus: only if you can: the dictionary should be nested. a dictionary within a dictionary. so that item 0/dictionary 0 will have a dictionary that has the embedding and the title of the artcile. for example they keys can be embed: value the embeedings etc. if you cant do this step its okay you can follow with the rest
- extract the embedding of article 4
- calculate the cosine similarity between that article (article 4) and all the values in your dictionary - save them in a list of list of lists [if you have created the title include it in this list]
- sort your values 

In [None]:
def create_doc_embedding(cleaned_article):
    
    # ....
    
    return doc_emb

In [None]:
def article_embedding(cleaned_article):
    
    article_embedd = []
    # for each word in the article, you take the embeddings
    for word in cleaned_article:
        try:
            embed_word = small_model[word]
            article_embedd.append(embed_word)
        except KeyError:
            continue
    
    # average vectors of all words
    avg = [float(sum(col))/len(col) for col in zip(*article_embedd)]
    avg = np.array(avg).reshape(1, -1)
    return avg

In [None]:
def article_embedding(cleaned_article):
    
    article_embedd = []
    # for each word in the article, you take the embeddings
    for word in cleaned_article:
        try:
            embed_word = small_model[word]
            article_embedd.append(embed_word)
        except KeyError:
            continue
    
    # average vectors of all words
    avg = [float(sum(col))/len(col) for col in zip(*article_embedd)]
    avg = np.array(avg).reshape(1, -1)
    return avg

In [None]:
#dictionary of article embeddings
articles_embeddings = {}

# you can limit this to the first 100 articles with
#for k in range(len(dataset[:100])):

for k in range(len(dataset[:1000])):
    article = dataset[k]
    title = article.split("\t")[1]
    cleaned_article = nlp_simple_pipeline(article)
    embed_art = article_embedding(cleaned_article)
    articles_embeddings[str(k)] = {"title":title,"embed":embed_art}

In [None]:
articles_embeddings

In [None]:
our_article_embedd = articles_embeddings["4"]["embed"]
print (our_article_embedd)

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics.pairwise import cosine_similarity
#This method takes either a vector array or a distance matrix, and returns a distance matrix. If the input is a vector array, the distances are computed. If the input is a distances matrix, it is returned instead.
#to compute similarity

ranking = []
#values() returns a list of all the values available in a given dictionary.
for article,values in articles_embeddings.items(): #Return an iterator over the dictionary's (key, value) 
    similarity_score = cosine_similarity(values["embed"],our_article_embedd)[0][0] #first values
    ranking.append([values["title"],similarity_score])
    
print(ranking)

In [None]:
ranking.sort(key=lambda x: x[1],reverse=True)

ranking

In [None]:
for el,score in ranking[:10]:
    print (el,score)