# Prepare the data


## Reading the text

Load the data (see .zip file in Learnweb) into the environment of this notebook. 

If you are using google colab, you can upload the file as follows:

In [1]:
from google.colab import files

uploaded = files.upload()

Saving movie_lines.tsv.zip to movie_lines.tsv.zip


Extract the contents of the zip file.

In [2]:
from zipfile import ZipFile
with ZipFile('movie_lines.tsv.zip', 'r') as zipObj:
   zipObj.extractall()

Read the file and extract the text.

Note: reading the file with pandas or csv module does not work well, this file has a bad tsv format.

In [3]:
movie_lines = []
for line in open('movie_lines.tsv'):
    line = line.strip()
    while line[0] == '"' and line[-1] == '"':
        line = line[1:-1]
    movie_lines.append(line.split('\t')[-1])
movie_lines = [l.replace('""', '"').replace('  ', ' ') for l in movie_lines]
len(movie_lines)

304713

## Pre-processing

Tokenization:

In [5]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer(preserve_case=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
sentences = []
for line in movie_lines:
    line_sents = sent_tokenize(line)
    for sent in line_sents:
        sentences.append(tokenizer.tokenize(sent))
len(sentences)

514866

In [7]:
sentences[76:81]

[['this', 'is', 'my', 'head'],
 ['let', 'me', 'see', 'what', 'i', 'can', 'do', '.'],
 ['gosh', 'if', 'only', 'we', 'could', 'find', 'kat', 'a', 'boyfriend', '...'],
 ["that's", 'a', 'shame', '.'],
 ['unsolved', 'mystery', '.']]

In [8]:
# total number of words
len([c for clist in sentences for c in clist])

3829713

In [9]:
# number of tokens (unique words)
from collections import Counter

words = Counter(c for clist in sentences for c in clist)
len(words)

61245

In [10]:
str(words)[:154]

"Counter({'.': 340514, 'you': 128297, '?': 110199, 'i': 103456, 'the': 99061, 'to': 80585, 'a': 71092, '-': 55093, '...': 51426, 'it': 47335, 'and': 45812,"

In [99]:
from flair.embeddings import FlairEmbeddings
from flair.data import Sentence

# init Flair forward and backwards embeddings
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')



In [18]:
from flair.embeddings import StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        flair_embedding_forward,
                                        flair_embedding_backward,
                                       ])

In [21]:
words.items()




saw- count = 1558 
It has 2 meanings - to see , tool used for cutting

teeth - count- 172 , one meaning 

Store flair embeddings of required word from the sentences containing that word to another array

In [58]:
sentences_new = []
sent_array = []
for sentence in sentences:
  if('teeth' in sentence):
    sentences_new.append(' '.join(sentence))
    sent_array.append(sentence)

print(sentences_new[12])

what about his teeth ?


In [56]:
len(sentences_new)

172

In [50]:
sentences_new2 = []
for sentence in sentences:
  if('saw' in sentence):
    sentences_new2.append(' '.join(sentence))

print(sentences_new2[1])
print(len(sentences_new2))

i saw your car .
1538


In [86]:
x = []
for i in sentences_new:
  sent = Sentence(i)
  stacked_embeddings.embed(sent)
  
  for token in sent:
    
    if token.text == 'teeth':  
      x.append(token.embedding)



In [87]:
len(x)


172

In [94]:
emb_list = []
for i in sentences_new2:
  sent2 = Sentence(i)
  stacked_embeddings.embed(sent2)
  
  for token in sent2:
    
    if token.text == 'saw':  
      emb_list.append(token.embedding)

In [95]:
len(emb_list)

1558

Calculating centroid of the lists

In [90]:
import torch

mean_teeth = torch.mean(torch.stack(x), dim=0)

print(mean_teeth)

tensor([-0.0015,  0.0007,  0.0061,  ..., -0.0007, -0.0403,  0.0016])


In [96]:
mean_saw = torch.mean(torch.stack(emb_list), dim=0)

print(mean_saw)

tensor([-0.0027, -0.0089, -0.0079,  ...,  0.0177, -0.0153,  0.0026])


Finding standard deviation btw similarities btw each embeddings and centroid

In [97]:
sim = []
cos = torch.nn.CosineSimilarity(dim = 0)
for i in x:
  sim.append(cos(i,mean_teeth))

std_teeth = torch.std(torch.stack(sim), dim=0)
print("STD of word teeth ",std_teeth)

STD of word teeth  tensor(0.0601)


In [98]:
sim2 = []

for i in emb_list:
  sim2.append(cos(i,mean_saw))

std_saw = torch.std(torch.stack(sim2), dim=0)
print("STD of word saw ",std_saw)

STD of word saw  tensor(0.0770)


Standard deviation of word having multiple meanings is more. Which is what we expect also as it will have different meanings in different contexts. But I expected the standard deviation difference to be higher.