<a href="https://colab.research.google.com/github/volgasezen/is584/blob/main/Lab 5/2 - Sub_and_negative_sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="margin-bottom:0">IS 584: Deep Learning for Text Analytics</center></h1>
<br>
<h3 style="margin-top:0">Lab 5: Subsampling and Negative Sampling</center></h2>
<h4 style="margin-top:0">Given by Volga Sezen</center></h4>

<i>Thanks to Arif Ozan Kızıldağ</i>

In this tutorial, we will talk about negative sampling and sub-sampling. For this purpose, we will utilize The Children’s [Books Test(CBT) dataset](https://www.kaggle.com/datasets/amoghjrules/babi-childrens-books-facebool-ai). First, let us load the dataset and do some data cleaning.

In [None]:
!wget  -P './data' "https://raw.githubusercontent.com/volgasezen/is584/main/Lab 5/data/text.txt"

In [1]:
with open('data/text.txt', 'r') as f:
    text = f.read()

In [2]:
print(text[:1000])

_BOOK_TITLE_ : Andrew_Lang___The_Grey_Fairy_Book.txt.out
DONKEY SKIN There was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .
His palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white Arabs , with big brown eyes .
Strangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .
It was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked 

In [3]:
import re
import nltk
import numpy as np
import pandas as pd

In [4]:
text =re.sub(r"(?m)^(\_BOOK_TITLE\_|CHAPTER).*\n?","",text,re.MULTILINE) # some preprocessing but not all
text =re.sub(r"(?m)-LCB.*RCB-","",text,re.MULTILINE)
text[:1000]

'DONKEY SKIN There was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .\nHis palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white Arabs , with big brown eyes .\nStrangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .\nIt was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked up each morning by the attendants .\nAfter many years

In [5]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download("popular")

[nltk_data] Error loading popular: <urlopen error [WinError 10054] An
[nltk_data]     existing connection was forcibly closed by the remote
[nltk_data]     host>


False

In [6]:
sentence_tokens= sent_tokenize(text.lower())
sentence_tokens[0:10]

['donkey skin there was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .',
 'his palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white arabs , with big brown eyes .',
 'strangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .',
 'it was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked up each morning by the attendants .',
 'aft

In [7]:
word_token =[word_tokenize(token) for token in sentence_tokens]
print(word_token[0])

['donkey', 'skin', 'there', 'was', 'once', 'upon', 'a', 'time', 'a', 'king', 'who', 'was', 'so', 'much', 'beloved', 'by', 'his', 'subjects', 'that', 'he', 'thought', 'himself', 'the', 'happiest', 'monarch', 'in', 'the', 'whole', 'world', ',', 'and', 'he', 'had', 'everything', 'his', 'heart', 'could', 'desire', '.']


In [8]:
token = [tok  for sent in word_token  for tok in sent ]

In [9]:
words = tuple(set(token))
int2str = dict(enumerate(words))
str2int = {ch: i for i, ch in int2str.items()}

In [10]:
print('Length of the vocabulary: ', len(words))
words[0:10]

Length of the vocabulary:  5473


('discovered',
 'shrill',
 'sorcerer',
 'tread',
 'forgetting',
 'mountain',
 'wed',
 'risk',
 'cruellest',
 'dress')

Our first step is sub-sampling. In lectures, we have gone over how sub-sampling works in `Lecture 3, part 5`. Briefly, sub-sampling removes some more frequent words like `the` to create a more uniform dataset. For this purpose, we will create a probability function to see whether we will remove a token from our dataset. This probability will be checked for **each occurrence** of the token. 

In [11]:
from collections import Counter,defaultdict

wordFreq = defaultdict(int)

for sent in word_token:
    for word in sent:
        wordFreq[word] += 1

In [12]:
import math

In [13]:
totalWords = sum([freq for freq in wordFreq.values()])
wi = {word:(freq/totalWords) for word, freq in wordFreq.items()}
wordProb ={ word:(math.sqrt(wi[word]/0.001)+1)*0.001/wi[word]  for word in wi}

In [14]:
posSet = []  ## there is a problem in this approach
dropped = 0
# add positive examples
for sent in word_token:
    for i in range(1, len(sent)-1):
      if   np.random.rand()<wordProb[sent[i]]:
        word = sent[i]
        context_words = [sent[i-1], sent[i+1]]   
        for context in context_words:
            posSet.append((word, context))  # we are creating bi-grams for text generation task here
      else:
        dropped+=1
n_pos_examples = len(posSet)
print(dropped)
posSet[0:10]

33964


[('skin', 'donkey'),
 ('skin', 'there'),
 ('there', 'skin'),
 ('there', 'was'),
 ('was', 'there'),
 ('was', 'once'),
 ('once', 'was'),
 ('once', 'upon'),
 ('upon', 'once'),
 ('upon', 'a')]

There is a problem with the approach above. Do you see what it is?

In [15]:
posSet = [] 
dropped = 0
for sent in word_token:
  dum_sent = sent.copy()
  for i in range(len(dum_sent)-1):
    if   np.random.rand()>wordProb[dum_sent[i]]:
        dum_sent[i] = None
        dropped +=1
  for i in range(1, len(dum_sent)-2):
      if(dum_sent[i]!= None):
        if(dum_sent[i+1]!= None):
          posSet.append((dum_sent[i], dum_sent[i+1]))
        if(dum_sent[i-1]!= None):
          posSet.append((dum_sent[i], dum_sent[i-1]))
print(dropped)
posSet[0:10]

36105


[('skin', 'there'),
 ('skin', 'donkey'),
 ('there', 'skin'),
 ('once', 'upon'),
 ('upon', 'a'),
 ('upon', 'once'),
 ('a', 'time'),
 ('a', 'upon'),
 ('time', 'a'),
 ('time', 'a')]

In [16]:
n_pos_examples = len(posSet)
len(posSet)

87739

If you guessed that it keeps words as context you are right

Now that we finished our sub-sampling, we can do a negative sampling to enrich our data. Negative sampling is utilized to balance the positive examples with negatives so that our network will not overfit positive examples. This is again done by creating a probabilistic function to create examples. In the following code, we again create a word probability function. Then, utilizing this probability, we select create negative examples for each positive example.

In [17]:
totalWords = sum([freq**(3/4) for freq in wordFreq.values()])
wordProb = {word:(freq**(3/4)/totalWords) for word, freq in wordFreq.items()}

In [18]:
n_neg_examples = 0 # 5m run time
negSet = []
import tqdm

for i in tqdm.tqdm(range(n_pos_examples)):
  context=np.random.choice(list(wordProb.keys()), p=list(wordProb.values())) 
  while ((posSet[i][0],context)  in posSet):
    context=np.random.choice(list(wordProb.keys()), p=list(wordProb.values()))
  negSet.append((posSet[i][0], context))


100%|██████████| 87739/87739 [04:53<00:00, 298.86it/s]


In [19]:
len(negSet)

87739

In [20]:
pos_data = pd.DataFrame(posSet,columns=["word","context"])
pos_data["out"] = 1
pos_data.head()

Unnamed: 0,word,context,out
0,skin,there,1
1,skin,donkey,1
2,there,skin,1
3,once,upon,1
4,upon,a,1


In [21]:
neg_data = pd.DataFrame(negSet,columns=["word","context"])
neg_data["out"] = 0
neg_data.head()

Unnamed: 0,word,context,out
0,skin,demanding,0
1,skin,.,0
2,there,my,0
3,once,apartments,0
4,upon,sight,0


In [22]:
data = pd.concat([pos_data,neg_data],axis=0)
data.describe()

Unnamed: 0,out
count,175478.0
mean,0.5
std,0.500001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [23]:
data2 = data.copy()
data2["text"] =  data["word"]+' '+data["context"]
data2.head(1000)

Unnamed: 0,word,context,out,text
0,skin,there,1,skin there
1,skin,donkey,1,skin donkey
2,there,skin,1,there skin
3,once,upon,1,once upon
4,upon,a,1,upon a
...,...,...,...,...
995,skin,laid,1,skin laid
996,skin,the,1,skin the
997,laid,skin,1,laid skin
998,the,princess,1,the princess


In [24]:
data3 =data2.drop(columns=["context","word"])
data3 = data3[["text","out"]]
data3.head()

Unnamed: 0,text,out
0,skin there,1
1,skin donkey,1
2,there skin,1
3,once upon,1
4,upon a,1


In [25]:
data3.to_csv("data_all_val.csv",index=False)