# Training Bengali News Word Vectors

In this notebook, we will use the data we scraped from news websites to train a Word2Vec model for Bengali.

Then we will test the model to see how well it is performing.

First we import the packages we need

In [1]:
import json
import os
import re
import string
import numpy as np

from gensim.models import Word2Vec

Let's define a function that will read the data file and extract the fields we want.

In our case, we will be using the article body for training

In [2]:
def extract_text(filename, field):
    
    extracted_field=[]
    
    with open(os.path.join('data', filename), 'r') as f:
        articles=json.load(f)
    
    for article in articles['articles']:
        extracted_field.append(article[field].strip())
    
    return extracted_field

Now we define a function to preprocess our data.

The function does the following:
- It replaces common texts found in the data and replaces that with our custom text
- It removes all emoji's and emoticons from the text
- It removes all English text

In [3]:
def replace_strings(texts, replace):
    new_texts=[]
    
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    english_pattern=re.compile('[a-zA-Z0-9]+', flags=re.I)
    
    for text in texts:
        for r in replace:
            text=text.replace(r[0], r[1])
        text=emoji_pattern.sub(r'', text)
        text=english_pattern.sub(r'', text)
        text=re.sub(r'\s+', ' ', text).strip()
        new_texts.append(text)

    return new_texts

We also need to remove all the punctuations in our data. The `remove_pun` function removes all common punctuations found in text.

In [4]:
def remove_punc(sentences):
    # import ipdb; ipdb.set_trace()
    new_sentences=[]
    exclude = list(set(string.punctuation))
    exclude.extend(["‚Äô", "‚Äò", "‚Äî"])
    for sentence in sentences:
        s = ''.join(ch for ch in sentence if ch not in exclude)
        new_sentences.append(s)
    
    return new_sentences

Let's extract some of the data from Ebala and print them to see how the data changes throughout the process.

In [5]:
ebala_body=extract_text('ebala_articles.txt', 'body')

print("\x1b[31mCrawled Unprocessed Text\x1b[0m")
print(ebala_body[12])

replace=[('\u200c', ' '),
         ('\u200d', ' '),
        ('\xa0', ' '),
        ('\n', ' '),
        ('\r', ' ')]

ebala_body=remove_punc(ebala_body)

print("\x1b[31mSentences after removing all punctuations\x1b[0m")
print(ebala_body[12])

ebala_body=replace_strings(ebala_body, replace)

print("\x1b[31mSentences after replacing strings\x1b[0m")
print(ebala_body[12])

[31mCrawled Unprocessed Text[0m
‡¶∏‡¶æ‡¶®‡¶ø‡¶Ø‡¶º‡¶æ ‡¶Æ‡¶ø‡¶∞‡ßç‡¶ú‡¶æ ‡¶ï‡¶ø ‡¶á‡¶§‡¶ø‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá‡¶á ‡¶∏‡¶®‡ßç‡¶§‡¶æ‡¶®‡ßá‡¶∞ ‡¶ú‡¶®‡ßç‡¶Æ ‡¶¶‡¶ø‡¶Ø‡¶º‡ßá‡¶õ‡ßá‡¶®? ‡¶π‡¶†‡¶æ‡ßé‡¶á ‡¶è‡¶Æ‡¶® ‡¶ó‡ßÅ‡¶ú‡¶¨‡ßá ‡¶â‡¶§‡ßç‡¶§‡¶æ‡¶≤ ‡¶∏‡ßã‡¶∂‡ßç‡¶Ø‡¶æ‡¶≤ ‡¶Æ‡¶ø‡¶°‡¶ø‡¶Ø‡¶º‡¶æ‡•§ ‡¶∂‡ßá‡¶∑ ‡¶™‡¶∞‡ßç‡¶Ø‡¶®‡ßç‡¶§ ‡¶∏‡ßá‡¶á ‡¶ó‡ßÅ‡¶ú‡¶¨ ‡¶ñ‡¶£‡ßç‡¶°‡¶®‡ßá ‡¶Ü‡¶∏‡¶∞‡ßá ‡¶®‡¶æ‡¶Æ‡¶§‡ßá ‡¶π‡¶≤ ‡¶∂‡ßã‡¶Ø‡¶º‡ßá‡¶¨ ‡¶Æ‡¶æ‡¶≤‡¶ø‡¶ï‡¶ï‡ßá‡•§ ‡¶§‡¶ø‡¶®‡¶ø ‡¶ü‡ßÅ‡¶á‡¶ü ‡¶ï‡¶∞‡¶≤‡ßá‡¶®, ‚Äò‚Äò‡¶Ü‡¶Æ‡¶∞‡¶æ ‡¶∏‡¶†‡¶ø‡¶ï‡¶≠‡¶æ‡¶¨‡ßá ‡¶∏‡¶ï‡¶≤‡¶ï‡ßá ‡¶ú‡¶æ‡¶®‡¶æ‡¶¨ ‡¶Ø‡¶ñ‡¶® ‡¶Ü‡¶Æ‡¶æ‡¶¶‡ßá‡¶∞ ‡¶∏‡¶®‡ßç‡¶§‡¶æ‡¶® ‡¶≠‡ßÇ‡¶Æ‡¶ø‡¶∑‡ßç‡¶† ‡¶π‡¶¨‡ßá‡•§ ‡¶Ö‡¶®‡ßÅ‡¶ó‡ßç‡¶∞‡¶π ‡¶ï‡¶∞‡ßá ‡¶Ü‡¶Æ‡¶æ‡¶¶‡ßá‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶™‡ßç‡¶∞‡¶æ‡¶∞‡ßç‡¶•‡¶®‡¶æ ‡¶ï‡¶∞‡¶¨‡ßá‡¶®‡•§ ‡¶™‡ßç‡¶≤‡¶ø‡¶ú ‡¶á‡¶®‡ßç‡¶ü‡¶æ‡¶∞‡¶®‡ßá‡¶ü‡ßá ‡¶Ø‡¶æ ‡¶¶‡ßá‡¶ñ‡¶¨‡ßá‡¶®/‡¶™‡¶°‡¶º‡¶¨‡ßá‡¶® ‡¶§‡¶æ ‡¶Æ‡ßã‡¶ü‡ßá‡¶á ‡¶¨‡¶ø‡¶∂‡ßç‡¶¨‡¶æ‡¶∏ ‡¶ï‡¶∞‡¶¨‡ßá‡¶® ‡¶®‡¶æ‡•§‚Äô‚Äô

We will do a proper announcement when the kid decides to arrive, please keep us i

We do the same thing for the other data too

In [6]:
abz_body=extract_text('anandabazar_articles.txt', 'body')

abz_body=remove_punc(abz_body)
abz_body=replace_strings(abz_body, replace)

In [7]:
zee_body=extract_text('zeenews_articles.txt', 'body')

zee_body=remove_punc(zee_body)
zee_body=replace_strings(zee_body, replace)

In [8]:
body=[]
body.extend(zee_body)
body.extend(abz_body)
body.extend(ebala_body)

print(f"Total Number of training data: {len(body)}")

Total Number of training data: 14205


Finally, we need to split the articles into sentences and extract each word from those sentences.

Our final training data looks like this

In [9]:
body=[article.split('‡•§') for article in body]
body=[item for sublist in body for item in sublist]
body=[item.strip() for item in body if len(item.split())>1]

body=[item.split() for item in body]

print(body[:10])

[['‡¶Ø‡¶æ', '‡¶Ü‡¶Æ‡¶æ‡¶¶‡ßá‡¶∞', '‡¶§‡ßç‡¶Ø‡¶æ‡¶ó‡ßá‡¶∞', '‡¶¶‡¶ø‡¶ï‡ßá', '‡¶§‡¶™‡¶∏‡ßç‡¶Ø‡¶æ‡¶∞', '‡¶¶‡¶ø‡¶ï‡ßá', '‡¶®‡¶ø‡¶Ø‡¶º‡ßá', '‡¶Ø‡¶æ‡¶Ø‡¶º', '‡¶§‡¶æ‡¶ï‡ßá‡¶á', '‡¶¨‡¶≤‡¶ø', '‡¶Æ‡¶®‡ßÅ‡¶∑‡ßç‡¶Ø‡¶§‡ßç‡¶¨', '‡¶Æ‡¶æ‡¶®‡ßÅ‡¶∑‡ßá‡¶∞', '‡¶ß‡¶∞‡ßç‡¶Æ'], ['‡¶è‡¶á‡¶∞‡¶ï‡¶Æ‡¶á', '‡¶è‡¶ï', '‡¶ß‡¶∞‡ßç‡¶Æ‡ßá‡¶∞', '‡¶ï‡¶•‡¶æ', '‡¶≠‡ßá‡¶¨‡ßá‡¶õ‡¶ø‡¶≤‡ßá‡¶®', '‡¶∞‡¶¨‡ßÄ‡¶®‡ßç‡¶¶‡ßç‡¶∞‡¶®‡¶æ‡¶•'], ['‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ', '‡¶ï‡ßÄ', '‡¶§‡ßç‡¶Ø‡¶æ‡¶ó', '‡¶ï‡¶ø‡¶∏‡ßá‡¶∞‡¶á', '‡¶¨‡¶æ', '‡¶§‡¶™‡¶∏‡ßç‡¶Ø‡¶æ', '‡¶∞‡¶¨‡ßÄ‡¶®‡ßç‡¶¶‡ßç‡¶∞‡¶®‡¶æ‡¶•', '‡¶§‡¶æ‡¶ï‡ßá', '‡¶¨‡¶≤‡¶õ‡ßá‡¶®', '‡¶Ø‡ßá‡¶ñ‡¶æ‡¶®‡ßá', '‡¶Ü‡¶Æ‡¶ø‡¶ï‡ßá', '‡¶®‡¶æ‡¶Ü‡¶Æ‡¶ø‡¶∞', '‡¶¶‡¶ø‡¶ï‡ßá', '‡¶õ‡¶æ‡¶°‡¶º‡¶§‡ßá', '‡¶¨‡¶æ‡¶ß‡¶æ', '‡¶™‡¶æ‡¶á', '‡¶§‡¶æ‡¶ï‡ßá', '‡¶Ö‡¶π‡¶Ç', '‡¶¨‡ßá‡¶°‡¶º‡¶æ‡¶Ø‡¶º', '‡¶¨‡¶ø‡¶ö‡ßç‡¶õ‡¶ø‡¶®‡ßç‡¶®', '‡¶∏‡ßÄ‡¶Æ‡¶æ‡¶¨‡¶¶‡ßç‡¶ß', '‡¶ï‡¶∞‡ßá', '‡¶¶‡ßá‡¶ñ‡¶ø'], ['‡¶è‡¶ï', '‡¶Ü‡¶§‡ßç‡¶Æ‡¶≤‡ßã‡¶ï‡ßá', '‡¶∏‡¶ï‡¶≤', '‡¶Ü‡¶§‡ßç‡¶Æ‡¶∞', '‡¶Ö‡¶≠‡¶ø‡¶Æ‡ßÅ‡¶ñ‡ßá', '‡¶Ü‡¶§‡ßç‡¶Æ‡¶æ‡¶∞', '‡¶∏‡¶§‡ßç‡¶Ø', '‡¶è‡¶á', '‡¶∏‡¶§‡ßç‡¶

Now that we have our preprocessed training data, we can start training our model.

We will generate embeddings for each word of size 200 and use 5 words in its vicinity to figure out the meaning of the word

In [10]:
model = Word2Vec(body, size=200, window=5, min_count=1)

In [21]:
print("What are the words most similar to chele")
model.wv.most_similar('‡¶õ‡ßá‡¶≤‡ßá', topn=5)

What are the words most similar to chele


[('‡¶Æ‡ßá‡¶Ø‡¶º‡ßá', 0.9279561042785645),
 ('‡¶≠‡¶æ‡¶á', 0.8839391469955444),
 ('‡¶¨‡ßã‡¶®', 0.8741269111633301),
 ('‡¶¨‡¶æ‡¶¨‡¶æ', 0.8684152364730835),
 ('‡¶¨‡¶®‡ßç‡¶ß‡ßÅ', 0.8629558682441711)]

In [20]:
print("What is Father + Girl - Boy =?")
model.wv.most_similar(positive=['‡¶¨‡¶æ‡¶¨‡¶æ', '‡¶Æ‡ßá‡¶Ø‡¶º‡ßá'], negative=['‡¶õ‡ßá‡¶≤‡ßá'], topn=5)

What is Father + Girl - Boy =?


[('‡¶Æ‡¶æ', 0.910052478313446),
 ('‡¶¨‡¶æ‡¶¨‡¶æ‡¶Æ‡¶æ', 0.8313219547271729),
 ('‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ', 0.7864214777946472),
 ('‡¶∏‡¶®‡ßç‡¶§‡¶æ‡¶®', 0.7833602428436279),
 ('‡¶¨‡¶®‡ßç‡¶ß‡ßÅ‡¶∞‡¶æ', 0.7827222347259521)]

In [22]:
print('Find the odd one out')
model.wv.doesnt_match("‡¶ï‡¶≤‡¶ï‡¶æ‡¶§‡¶æ ‡¶ö‡ßá‡¶®‡ßç‡¶®‡¶æ‡¶á ‡¶¶‡¶ø‡¶≤‡ßç‡¶≤‡¶ø ‡¶∞‡¶¨‡ßÄ‡¶®‡ßç‡¶¶‡ßç‡¶∞‡¶®‡¶æ‡¶•".split())

Find the odd one out


'‡¶∞‡¶¨‡ßÄ‡¶®‡ßç‡¶¶‡ßç‡¶∞‡¶®‡¶æ‡¶•'

In [23]:
print("How similar are bengali and sweet?")
model.wv.similarity('‡¶¨‡¶æ‡¶ô‡¶æ‡¶≤‡¶ø', '‡¶Æ‡¶ø‡¶∑‡ßç‡¶ü‡¶ø')

How similar are bengali and sweet?


0.6867578

In [25]:
model.wv.save_word2vec_format('news_vector_text.txt', binary=False)
model.wv.save_word2vec_format('news_vector_binary.txt', binary=True)

In [26]:
print("What about Bihari and Sweets?")
model.wv.similarity('‡¶¨‡¶ø‡¶π‡¶æ‡¶∞‡¶ø', '‡¶Æ‡¶ø‡¶∑‡ßç‡¶ü‡¶ø')

What about Bihari and Sweets?


0.5881788