# Assignment 6 - Creating a Hindi next word predictor using sequence modeling

by Shubham Goyal and Tripti Gupta

In this assignment our goal is to create a sequence model, using numpy implementation of RNN (Recurrent neural networks) to be able to predict the next word in a particular given language (here it is Hindi). We leverage `indic-nlp` library for tokenizing purposes, which is developed by AI4Bharat (a venture by IIT Madras) the documentation can be found [here](https://pypi.org/project/indic-nlp-library/). AI4Bharat has also fine tuned [fasttext](https://fasttext.cc/docs/en/support.html) for Hindi and made public access to the model and [embeddings](https://indicnlp.ai4bharat.org/pages/fasttext/) which we leverage going forward for modeling. 

In [None]:
pip install indic-nlp-library

Collecting indic-nlp-library

  Downloading indic_nlp_library-0.92-py3-none-any.whl (40 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m

[?25hCollecting sphinx-argparse (from indic-nlp-library)

  Downloading sphinx_argparse-0.4.0-py3-none-any.whl (12 kB)


Collecting morfessor (from indic-nlp-library)

  Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)






Collecting sphinx>=1.2.0 (from sphinx-argparse->indic-nlp-library)

  Downloading sphinx-7.2.6-py3-none-any.whl (3.2 MB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m


Collecting sphinxcontrib-applehelp (from sphinx>=1.2.0->sphinx-argparse->indic-nlp-library)

  Downloading sphinxcontrib_applehelp-1.0.7-py3-none-any.whl (120 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.0/120.0 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m

[?25hCollecting sphinxcont

 ### Training data - Harivansh Rai Bachchan's poems
 
We are leveraging poems by [Shree Harivansh Rai Bachchan](https://en.wikipedia.org/wiki/Harivansh_Rai_Bachchan) who is a well known Indian poet, his work in Hindi present a rich tapestry of emotion and meaning. Using the power of neural networks and natural language processing (NLP) let's craft a unique and interactive experience – a Hindi poetry chatbot inspired by the timeless verses of Harivansh Rai Bachchan. We are using 32 of his best poems as the dataset

In [None]:
# pip install stopwordsiso

Importing the required libraries

In [None]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
from indicnlp.tokenize import sentence_tokenize, indic_tokenize
# from indicnlp.corpus import stopwords
# from stopwordsiso import stopwords

import matplotlib.pyplot as plt
%matplotlib inline

### Reading the text file

We created the data by manually scrapping the poems from the internet, the accumulated data is uploaded on kaggle and can be found [here](https://www.kaggle.com/datasets/shubh1596/hindi-poems) 

In [None]:
file_path = "/kaggle/input/hindi-poems/hindipoems.txt"

# Open the file with encoding set to UTF-8 (for Hindi text)
with open(file_path, 'r', encoding='utf-8') as file:
    hindi_text = file.read()

# The data has heading which indicate the source of the data. We will remove the heading.

start = len('Poems by Harivansh Rai Bachchan source<https://hindionlinejankari.com/harivansh-rai-bachchan-poems/> & <https://hindi-kavita.com/HindiPoetryHarivanshRaiBachchan.php>')
hindi_text = hindi_text[start+1:]
hindi_text

"\n\n•• अग्निपथ कविता ••\nवृक्ष हों भले खड़े,\nहों घने हों बड़े,\nएक पत्र छाँह भी,\nमाँग मत, माँग मत, माँग मत,\nअग्निपथ अग्निपथ अग्निपथ।\n\nतू न थकेगा कभी, तू न रुकेगा कभी,\nतू न मुड़ेगा कभी,\nकर शपथ, कर शपथ, कर शपथ,\nअग्निपथ अग्निपथ अग्निपथ।\n\nयह महान दृश्य है,\nचल रहा मनुष्य है,\nअश्रु श्वेत रक्त से,\nलथपथ लथपथ लथपथ,\nअग्निपथ अग्निपथ अग्निपथ।\n\n••• नीड़ का निर्माण •••\nनीड़ का निर्माण फिर-फिर,\nनेह का आह्णान फिर-फिर।\n\nवह उठी आँधी कि नभ में\nछा गया सहसा अँधेरा,\nधूलि धूसर बादलों ने\nभूमि को इस भाँति घेरा,\n\nरात-सा दिन हो गया, फिर\nरात आ\u200cई और काली,\nलग रहा था अब न होगा\nइस निशा का फिर सवेरा,\n\nरात के उत्पात-भय से\nभीत जन-जन, भीत कण-कण\nकिंतु प्राची से उषा की\nमोहिनी मुस्कान फिर-फिर\n\nनीड़ का निर्माण फिर-फिर,\nनेह का आह्णान फिर-फिर।\n\nवह चले झोंके कि काँपे\nभीम कायावान भूधर,\nजड़ समेत उखड़-पुखड़कर\nगिर पड़े, टूटे विटप वर,\n\nहाय, तिनकों से विनिर्मित\nघोंसलो पर क्या न बीती,\nडगमगा\u200cए जबकि कंकड़,\nईंट, पत्थर के महल-घर\n\nबोल आशा के विहंगम,\nकिस जगह पर तू छिपा था,\nजो गगन 

### Pre-processing the data

We start of basic text processing like removing punctuation, some redundant symbols and the headings of each poems which is rapped in '•' pattern

In [None]:
import re

hindi_text = hindi_text.replace(",","")
hindi_text = hindi_text.replace(":","")
hindi_text = hindi_text.replace(";","")
hindi_text = hindi_text.replace('\n', ' ')
hindi_text = hindi_text.replace('\u200c', '')
hindi_text = hindi_text.replace('\u200d', '')
pattern = r"•{2,3}.*?•{2,3}"
hindi_text = re.sub(pattern, "", hindi_text)
hindi_text

"   वृक्ष हों भले खड़े हों घने हों बड़े एक पत्र छाँह भी माँग मत माँग मत माँग मत अग्निपथ अग्निपथ अग्निपथ।  तू न थकेगा कभी तू न रुकेगा कभी तू न मुड़ेगा कभी कर शपथ कर शपथ कर शपथ अग्निपथ अग्निपथ अग्निपथ।  यह महान दृश्य है चल रहा मनुष्य है अश्रु श्वेत रक्त से लथपथ लथपथ लथपथ अग्निपथ अग्निपथ अग्निपथ।   नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।  वह उठी आँधी कि नभ में छा गया सहसा अँधेरा धूलि धूसर बादलों ने भूमि को इस भाँति घेरा  रात-सा दिन हो गया फिर रात आई और काली लग रहा था अब न होगा इस निशा का फिर सवेरा  रात के उत्पात-भय से भीत जन-जन भीत कण-कण किंतु प्राची से उषा की मोहिनी मुस्कान फिर-फिर  नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।  वह चले झोंके कि काँपे भीम कायावान भूधर जड़ समेत उखड़-पुखड़कर गिर पड़े टूटे विटप वर  हाय तिनकों से विनिर्मित घोंसलो पर क्या न बीती डगमगाए जबकि कंकड़ ईंट पत्थर के महल-घर  बोल आशा के विहंगम किस जगह पर तू छिपा था जो गगन पर चढ़ उठाता गर्व से निज तान फिर-फिर  नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।  क्रुद्ध नभ के वज्र दंतों में उषा है मुसकराती घोर गर्जनम

### Sentence and word tokenization

Once we get the cleaned data, we start with tokenization. First of the sentences, and then we'll tokenize each word in the sentences. As we will see going forward `Indic-nlp` library does a fantanstic job in correctly identifying the sentences.

In [None]:
# Tokenizing the sentences
sentences = sentence_tokenize.sentence_split(hindi_text, lang='hi')
sentences[0:10]

['वृक्ष हों भले खड़े हों घने हों बड़े एक पत्र छाँह भी माँग मत माँग मत माँग मत अग्निपथ अग्निपथ अग्निपथ।',
 'तू न थकेगा कभी तू न रुकेगा कभी तू न मुड़ेगा कभी कर शपथ कर शपथ कर शपथ अग्निपथ अग्निपथ अग्निपथ।',
 'यह महान दृश्य है चल रहा मनुष्य है अश्रु श्वेत रक्त से लथपथ लथपथ लथपथ अग्निपथ अग्निपथ अग्निपथ।',
 'नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।',
 'वह उठी आँधी कि नभ में छा गया सहसा अँधेरा धूलि धूसर बादलों ने भूमि को इस भाँति घेरा  रात-सा दिन हो गया फिर रात आई और काली लग रहा था अब न होगा इस निशा का फिर सवेरा  रात के उत्पात-भय से भीत जन-जन भीत कण-कण किंतु प्राची से उषा की मोहिनी मुस्कान फिर-फिर  नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।',
 'वह चले झोंके कि काँपे भीम कायावान भूधर जड़ समेत उखड़-पुखड़कर गिर पड़े टूटे विटप वर  हाय तिनकों से विनिर्मित घोंसलो पर क्या न बीती डगमगाए जबकि कंकड़ ईंट पत्थर के महल-घर  बोल आशा के विहंगम किस जगह पर तू छिपा था जो गगन पर चढ़ उठाता गर्व से निज तान फिर-फिर  नीड़ का निर्माण फिर-फिर नेह का आह्णान फिर-फिर।',
 'क्रुद्ध नभ के वज्र दंतों में उषा है मुसक

We earlier tried to identify stopwords, but later dropped the idea of dropping them instead (the irony). The reason was simple, we thought that the stopwords in next word prediction would also hold an importance 

In [None]:
# # List of common Hindi stopwords
# hindi_stopwords = {
#     'के', 'का', 'एक', 'में', 'की', 'है', 'हैं', 'और', 'से', 'हो', 'को', 'पर', 'इस', 'होते', 'कि',
#     'जो', 'कर', 'मे', 'गया', 'करने', 'किया', 'लिये', 'अपने', 'ने', 'बनी', 'नहीं', 'तो', 'ही',
#     'होती', 'अभी', 'जैसे', 'सभी', 'उनका', 'यही', 'थी', 'जब', 'हम', 'ना', 'इसका', 'था', 'जबकि',
#     'इसी', 'साथ', 'करते', 'कहा', 'ज़रा', 'आप', 'कुछ', 'किसी', 'ये', 'इसके', 'सबसे', 'इसमें',
#     'थे', 'दो', 'होने', 'वाले', 'कोई', 'व', 'अगर', 'उनकी', 'तरह', 'उस', 'आदि', 'कौन', 'सा'
# }

In [None]:
# stop_words_hindi = set(stopwords("hi"))
# len(stop_words_hindi)

We apped "SENTENCE_START" at the start of each sentence so that it is clearer to understand which are the words that come in the start. And for recognizing the end of sentence we rely on the symbols like  "।" (poornviram - full stop in hindi), '?' and '!'

In [None]:
# Remove stopwords from the sentence-tokenized text
filtered_sentences = []
for sentence in sentences:
    words = indic_tokenize.trivial_tokenize(sentence, lang='hi')
    # filtered_words = [word for word in words if word not in stop_words_hindi]
    words_with_markers = ['SENTENCE_START'] + words
    filtered_sentence = ' '.join(words_with_markers)
    filtered_sentences.append(filtered_sentence)

filtered_sentences[0:10]

['SENTENCE_START वृक्ष हों भले खड़े हों घने हों बड़े एक पत्र छाँह भी माँग मत माँग मत माँग मत अग्निपथ अग्निपथ अग्निपथ ।',
 'SENTENCE_START तू न थकेगा कभी तू न रुकेगा कभी तू न मुड़ेगा कभी कर शपथ कर शपथ कर शपथ अग्निपथ अग्निपथ अग्निपथ ।',
 'SENTENCE_START यह महान दृश्य है चल रहा मनुष्य है अश्रु श्वेत रक्त से लथपथ लथपथ लथपथ अग्निपथ अग्निपथ अग्निपथ ।',
 'SENTENCE_START नीड़ का निर्माण फिर - फिर नेह का आह्णान फिर - फिर ।',
 'SENTENCE_START वह उठी आँधी कि नभ में छा गया सहसा अँधेरा धूलि धूसर बादलों ने भूमि को इस भाँति घेरा रात - सा दिन हो गया फिर रात आई और काली लग रहा था अब न होगा इस निशा का फिर सवेरा रात के उत्पात - भय से भीत जन - जन भीत कण - कण किंतु प्राची से उषा की मोहिनी मुस्कान फिर - फिर नीड़ का निर्माण फिर - फिर नेह का आह्णान फिर - फिर ।',
 'SENTENCE_START वह चले झोंके कि काँपे भीम कायावान भूधर जड़ समेत उखड़ - पुखड़कर गिर पड़े टूटे विटप वर हाय तिनकों से विनिर्मित घोंसलो पर क्या न बीती डगमगाए जबकि कंकड़ ईंट पत्थर के महल - घर बोल आशा के विहंगम किस जगह पर तू छिपा था जो गगन पर चढ़ उठाता गर्व

### Tokenizing each word in the sentences

As we'll see the `indic-nlp` again does a great job in identifying the word tokens correctly, which is super impressive

In [None]:
# Tokenize words in each sentence and remove SENTENCE_START
tokenized_sentences = []
for sentence in filtered_sentences:
    words = indic_tokenize.trivial_tokenize(sentence, lang='hi')
    # Remove SENTENCE_START tokens
    words_without_markers = [word for word in words if word not in
                             ['SENTENCE_START', 'SENTENCE_END', 'SENTENCE', '_', 'START', 'END']]
    tokenized_sentences.append(words_without_markers)

# Print tokenized sentences without markers
for sentence in tokenized_sentences:
    print(sentence)

['वृक्ष', 'हों', 'भले', 'खड़े', 'हों', 'घने', 'हों', 'बड़े', 'एक', 'पत्र', 'छाँह', 'भी', 'माँग', 'मत', 'माँग', 'मत', 'माँग', 'मत', 'अग्निपथ', 'अग्निपथ', 'अग्निपथ', '।']

['तू', 'न', 'थकेगा', 'कभी', 'तू', 'न', 'रुकेगा', 'कभी', 'तू', 'न', 'मुड़ेगा', 'कभी', 'कर', 'शपथ', 'कर', 'शपथ', 'कर', 'शपथ', 'अग्निपथ', 'अग्निपथ', 'अग्निपथ', '।']

['यह', 'महान', 'दृश्य', 'है', 'चल', 'रहा', 'मनुष्य', 'है', 'अश्रु', 'श्वेत', 'रक्त', 'से', 'लथपथ', 'लथपथ', 'लथपथ', 'अग्निपथ', 'अग्निपथ', 'अग्निपथ', '।']

['नीड़', 'का', 'निर्माण', 'फिर', '-', 'फिर', 'नेह', 'का', 'आह्णान', 'फिर', '-', 'फिर', '।']

['वह', 'उठी', 'आँधी', 'कि', 'नभ', 'में', 'छा', 'गया', 'सहसा', 'अँधेरा', 'धूलि', 'धूसर', 'बादलों', 'ने', 'भूमि', 'को', 'इस', 'भाँति', 'घेरा', 'रात', '-', 'सा', 'दिन', 'हो', 'गया', 'फिर', 'रात', 'आई', 'और', 'काली', 'लग', 'रहा', 'था', 'अब', 'न', 'होगा', 'इस', 'निशा', 'का', 'फिर', 'सवेरा', 'रात', 'के', 'उत्पात', '-', 'भय', 'से', 'भीत', 'जन', '-', 'जन', 'भीत', 'कण', '-', 'कण', 'किंतु', 'प्राची', 'से', 'उषा', 'की', 'मोहिनी

## Using Ngrams to find the most frequent sequence of words used

As we start on to create the training data set for our model, in sequence modeling ngrams play an important role. As we might be aware, the data we input is sequence of words and the predicting label would be the next word of the sequence, and we'll create similar pairs of two words, three words and so on. In short, this is what ngrams are

In [None]:
%%time
from collections import Counter
from nltk import ngrams
bigram_counts = Counter(ngrams(hindi_text.split(), 2))
bigram_counts.most_common(10)

CPU times: user 5.9 ms, sys: 783 µs, total: 6.68 ms

Wall time: 6.65 ms


[(('लिए', 'फिरता'), 14),
 (('कर', 'ले।'), 10),
 (('फिरता', 'हूँ'), 10),
 (('उसकी', 'विकलता'), 9),
 (('क्या', 'करूँ'), 8),
 (('आ', 'रही'), 8),
 (('रही', 'रवि'), 8),
 (('रवि', 'की'), 8),
 (('की', 'सवारी।'), 8),
 (('ले', 'कर'), 7)]

In [None]:
trigram_counts = Counter(ngrams(hindi_text.split(), 3))
trigram_counts.most_common(10)

[(('लिए', 'फिरता', 'हूँ'), 9),
 (('आ', 'रही', 'रवि'), 8),
 (('रही', 'रवि', 'की'), 8),
 (('रवि', 'की', 'सवारी।'), 8),
 (('पूर्व', 'चलने', 'के'), 6),
 (('चलने', 'के', 'बटोही'), 6),
 (('के', 'बटोही', 'बाट'), 6),
 (('बटोही', 'बाट', 'की'), 6),
 (('बाट', 'की', 'पहचान'), 6),
 (('की', 'पहचान', 'कर'), 6)]

In [None]:
quadgram_counts = Counter(ngrams(hindi_text.replace('&', 'i').split(), 4))
quadgram_counts.most_common(10)

[(('आ', 'रही', 'रवि', 'की'), 8),
 (('रही', 'रवि', 'की', 'सवारी।'), 8),
 (('पूर्व', 'चलने', 'के', 'बटोही'), 6),
 (('चलने', 'के', 'बटोही', 'बाट'), 6),
 (('के', 'बटोही', 'बाट', 'की'), 6),
 (('बटोही', 'बाट', 'की', 'पहचान'), 6),
 (('बाट', 'की', 'पहचान', 'कर'), 6),
 (('कर', 'ले।', 'पूर्व', 'चलने'), 5),
 (('ले।', 'पूर्व', 'चलने', 'के'), 5),
 (('की', 'पहचान', 'कर', 'ले।'), 5)]

In [None]:
pentagram_counts = Counter(ngrams(hindi_text.split(), 5))
pentagram_counts.most_common(10)

[(('आ', 'रही', 'रवि', 'की', 'सवारी।'), 8),
 (('पूर्व', 'चलने', 'के', 'बटोही', 'बाट'), 6),
 (('चलने', 'के', 'बटोही', 'बाट', 'की'), 6),
 (('के', 'बटोही', 'बाट', 'की', 'पहचान'), 6),
 (('बटोही', 'बाट', 'की', 'पहचान', 'कर'), 6),
 (('कर', 'ले।', 'पूर्व', 'चलने', 'के'), 5),
 (('ले।', 'पूर्व', 'चलने', 'के', 'बटोही'), 5),
 (('बाट', 'की', 'पहचान', 'कर', 'ले।'), 5),
 (('क्या', 'करूँ', 'संवेदना', 'ले', 'कर'), 5),
 (('प्राण', 'कह', 'दो', 'आज', 'तुम'), 5)]

In [None]:
hexagram_counts = Counter(ngrams(hindi_text.split(), 6))
hexagram_counts.most_common(10)

[(('पूर्व', 'चलने', 'के', 'बटोही', 'बाट', 'की'), 6),
 (('चलने', 'के', 'बटोही', 'बाट', 'की', 'पहचान'), 6),
 (('के', 'बटोही', 'बाट', 'की', 'पहचान', 'कर'), 6),
 (('कर', 'ले।', 'पूर्व', 'चलने', 'के', 'बटोही'), 5),
 (('ले।', 'पूर्व', 'चलने', 'के', 'बटोही', 'बाट'), 5),
 (('बटोही', 'बाट', 'की', 'पहचान', 'कर', 'ले।'), 5),
 (('प्राण', 'कह', 'दो', 'आज', 'तुम', 'मेरे'), 5),
 (('कह', 'दो', 'आज', 'तुम', 'मेरे', 'लिए'), 5),
 (('दो', 'आज', 'तुम', 'मेरे', 'लिए', 'हो।'), 5),
 (('गीत', 'मेरे', 'देहरी', 'का', 'दीप-सा', 'बन।'), 5)]

In [None]:
heptagram_counts = Counter(ngrams(hindi_text.split(), 7))
heptagram_counts.most_common(10)

[(('पूर्व', 'चलने', 'के', 'बटोही', 'बाट', 'की', 'पहचान'), 6),
 (('चलने', 'के', 'बटोही', 'बाट', 'की', 'पहचान', 'कर'), 6),
 (('कर', 'ले।', 'पूर्व', 'चलने', 'के', 'बटोही', 'बाट'), 5),
 (('ले।', 'पूर्व', 'चलने', 'के', 'बटोही', 'बाट', 'की'), 5),
 (('के', 'बटोही', 'बाट', 'की', 'पहचान', 'कर', 'ले।'), 5),
 (('प्राण', 'कह', 'दो', 'आज', 'तुम', 'मेरे', 'लिए'), 5),
 (('कह', 'दो', 'आज', 'तुम', 'मेरे', 'लिए', 'हो।'), 5),
 (('नीड़', 'का', 'निर्माण', 'फिर-फिर', 'नेह', 'का', 'आह्णान'), 4),
 (('का', 'निर्माण', 'फिर-फिर', 'नेह', 'का', 'आह्णान', 'फिर-फिर।'), 4),
 (('चल', 'मरदाने', 'सीना', 'ताने', 'हाथ', 'हिलाते', 'पांव'), 4)]

In [None]:
# decagram_counts = Counter(ngrams(hindi_text.split(), 10))
# decagram_counts.most_common(10)

# Creating the training dataset

Starting off with creating the training set, we'll follow the following approach :-  <div>

- First we'll add the words which occur at the start of the sentence
- Then we'll start adding ngrams starting from bigram going upto dodecagrams
- Then we'll focus on addding EOS (End of sentence) words

Finding the first word of each sentence.

In [None]:
first_words = []

for sentence in sentences:
    words = indic_tokenize.trivial_tokenize(sentence, lang='hi')
    first_words.append(words[0])

# filtered_sentences[0:10]
first_words[0:10]

['वृक्ष', 'तू', 'यह', 'नीड़', 'वह', 'वह', 'क्रुद्ध', 'पूर्व', 'पूर्व', 'है']

In [None]:
first_word_counts = Counter(first_words)
first_word_counts.most_common(10)

[('मैं', 14),
 ('क्या', 9),
 ('आ', 7),
 ('पूर्व', 6),
 ('स्वप्न', 5),
 ('एक', 5),
 ('जीवन', 5),
 ("'", 5),
 ('गर्म', 5),
 ('वह', 4)]

In [None]:
len(first_word_counts)

123

Adding first words in training dataset

In [None]:
sentence_start_token='SENTENCE_START'

X_train = [[sentence_start_token]*c for sent,c in first_word_counts.items()]
y_train = [[sent]*c for sent,c in first_word_counts.items()]

In [None]:
# X_train,y_train

In [None]:
len(X_train), len(y_train)

(123, 123)

Fisher_yates is a function to randomize the words in the sample

In [None]:
import random

def fisher_yates (arr1, arr2):

    # We will Start from the last element
    # and swap one by one.
    n = len(arr1)
    if n != len(arr2):
        return None

    for i in range(n - 1, 0, -1):

        # Pick a random index from 0 to i
        j = random.randint(0, i)
        #print(i, j)

        # Swap arr[i] with the element at random index
        arr1[i], arr1[j] = arr1[j], arr1[i]
        arr2[i], arr2[j] = arr2[j], arr2[i]

    return arr1, arr2

In [None]:
X_train, y_train = fisher_yates(X_train, y_train)
len(X_train), len(y_train)

(123, 123)

In [None]:
print(y_train)

[['गीत'], ['व्यर्थ'], ['क्षीण'], ['–'], ['मधुर'], ['क्यों', 'क्यों'], ['स्वप्न', 'स्वप्न', 'स्वप्न', 'स्वप्न', 'स्वप्न'], ['मेहंदी'], ['गर्म', 'गर्म', 'गर्म', 'गर्म', 'गर्म'], ['कूक'], ['उस'], ['लालायित'], ['नहीं'], ['त्याग'], ['दिन', 'दिन', 'दिन'], ['गुण'], ['पर', 'पर'], ['त्राहि'], ['क्षण', 'क्षण', 'क्षण'], ['आग'], ['जगती', 'जगती'], ['साथी', 'साथी', 'साथी'], ['प्राण', 'प्राण'], ['गली'], ['हाथों'], ['बने'], ['’'], ['इसी', 'इसी'], ['मुख'], ['पूर्व', 'पूर्व', 'पूर्व', 'पूर्व', 'पूर्व', 'पूर्व'], ['हटो'], ['अर्द्ध'], ['दुखी', 'दुखी', 'दुखी', 'दुखी'], ['‘यह'], ['धार'], ['था', 'था', 'था', 'था'], ['गन्ध'], ['बूँद'], ['ऐसे', 'ऐसे', 'ऐसे'], ['नादन'], ['हो'], ['केवल'], ['जलतरंग'], ['मेला'], ['क्रुद्ध'], ['प्यास'], ['सत्य'], ['कौन', 'कौन', 'कौन'], ['काँधा'], ['वृक्ष'], ['डुबकियां'], ['प्रियतम'], ['लहरों'], ['जीवन', 'जीवन', 'जीवन', 'जीवन', 'जीवन'], ['‘दूर'], ['मृत'], ['बजी'], ['बादल'], ['बहती'], ['मुझसे'], ['असफलता'], ['चल', 'चल', 'चल', 'चल'], ['पाठकगण'], ['है'], ['यह', 'यह'], ['सख्त'], ['भेद'],

### Creating Vocabulory, and word indexing

Here we create a vocabulary based on the words we have in our corpus, and then create a indexing for each in it

In [None]:
# Append SENTENCE_START and SENTENCE_END

# sentence_start_token = "SENTENCE_START"

# sentences = ["%s %s" % (sentence_start_token, x[:-1].replace("&","")) for x in filtered_sentences]
# print(  "Parsed %d sentences." % (len(sentences)))

# Tokenize the sentences into words, making sure to remove end-of-sentence period
tokenized_sentences = [nltk.word_tokenize(sent.replace('.', '')) for sent in filtered_sentences]
unknown_token = "UNKNOWN_TOKEN"

# sentence_end_token = "SENTENCE_END"
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print(  "Found %d unique words tokens." % len(word_freq.items()))


vocabulary_size = len(word_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

# Replace all words not in our vocabulary with the unknown token
#for i, sent in enumerate(tokenized_sentences):
#    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
vocabulary_size = len(word_freq.items())
print("Using vocabulary size %d." % vocabulary_size)

print(  "\nExample sentence: '%s'" % filtered_sentences[0])
print(  "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])

Found 1557 unique words tokens.

The least frequent word in our vocabulary is 'पल' and appeared 1 times.

Using vocabulary size 1557.



Example sentence: 'SENTENCE_START वृक्ष हों भले खड़े हों घने हों बड़े एक पत्र छाँह भी माँग मत माँग मत माँग मत अग्निपथ अग्निपथ अग्निपथ ।'



Example sentence after Pre-processing: '['SENTENCE_START', 'वृक्ष', 'हों', 'भले', 'खड़े', 'हों', 'घने', 'हों', 'बड़े', 'एक', 'पत्र', 'छाँह', 'भी', 'माँग', 'मत', 'माँग', 'मत', 'माँग', 'मत', 'अग्निपथ', 'अग्निपथ', 'अग्निपथ', '।']'


In [None]:
# tokenized_sentences

In [None]:
X_train = [item for sublist in X_train for item in sublist]
y_train = [item for sublist in y_train for item in sublist]

Encoding to tokens based on the word_to_index

In [None]:
X_tokens = [[word_to_index[symbol]] for symbol,word in zip(X_train, y_train) if word in word_to_index]
y_tokens = [[word_to_index[word]] for symbol,word in zip(X_train, y_train) if word in word_to_index]

In [None]:
X_train = X_tokens
y_train = y_tokens

In [None]:
len(X_train), len(y_train)

(213, 213)

In [None]:
X_train[0:5], y_train[0:5]

([[0], [0], [0], [0], [0]], [[62], [138], [1320], [104], [338]])

Finding all ngrams upto n =20

In [None]:
ngrams_up_to_20 = []
for i in range(2, 21):
    ngram_counts = Counter(ngrams(hindi_text.split(), i))
    print('ngram-', i, 'length:', len(ngram_counts))
    ngrams_up_to_20.append(ngram_counts)

ngram- 2 length: 3830

ngram- 3 length: 4150

ngram- 4 length: 4273

ngram- 5 length: 4359

ngram- 6 length: 4427

ngram- 7 length: 4478

ngram- 8 length: 4520

ngram- 9 length: 4558

ngram- 10 length: 4589

ngram- 11 length: 4606

ngram- 12 length: 4619

ngram- 13 length: 4627

ngram- 14 length: 4633

ngram- 15 length: 4638

ngram- 16 length: 4641

ngram- 17 length: 4644

ngram- 18 length: 4647

ngram- 19 length: 4650

ngram- 20 length: 4652


In [None]:
ngrams_up_to_20[0].most_common(10)

[(('लिए', 'फिरता'), 14),
 (('कर', 'ले।'), 10),
 (('फिरता', 'हूँ'), 10),
 (('उसकी', 'विकलता'), 9),
 (('क्या', 'करूँ'), 8),
 (('आ', 'रही'), 8),
 (('रही', 'रवि'), 8),
 (('रवि', 'की'), 8),
 (('की', 'सवारी।'), 8),
 (('ले', 'कर'), 7)]

In [None]:
word_to_index['।']

2

In [None]:
bigrams_to_learn = ngrams_up_to_20[0]
X_train_2 = [[word_to_index[sent[0][0]]] for sent in (bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
y_train_2 = [[word_to_index[sent[0][1]]] for sent in (bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)

In [None]:
len(X_train_2), len(y_train_2)

(3392, 3392)

In [None]:
X_train_2[0:10], y_train_2[0:10]

([[89], [1025], [11], [25], [71], [5], [15], [19], [13], [13]],
 [[72], [4], [1352], [475], [673], [15], [139], [107], [31], [151]])

In [None]:
X_train[0:10], y_train[0:10], len(X_train), len(y_train)

([[0], [0], [0], [0], [0], [0], [0], [0], [0], [0]],
 [[62], [138], [1320], [104], [338], [90], [90], [75], [75], [75]],
 213,
 213)

Appending bigrams to training dataset

In [None]:
X_train.extend(X_train_2)
y_train.extend(y_train_2)

In [None]:
len(X_train), len(y_train)

(3605, 3605)

**So far our training dataset has 3605 instances**

Now working with trigrams

In [None]:
ngrams_to_learn = ngrams_up_to_20[1]
ngrams_to_learn.most_common(10)

[(('लिए', 'फिरता', 'हूँ'), 9),
 (('आ', 'रही', 'रवि'), 8),
 (('रही', 'रवि', 'की'), 8),
 (('रवि', 'की', 'सवारी।'), 8),
 (('पूर्व', 'चलने', 'के'), 6),
 (('चलने', 'के', 'बटोही'), 6),
 (('के', 'बटोही', 'बाट'), 6),
 (('बटोही', 'बाट', 'की'), 6),
 (('बाट', 'की', 'पहचान'), 6),
 (('की', 'पहचान', 'कर'), 6)]

In [None]:
[sent[0] for sent in (ngrams_to_learn.most_common(10))]

[('लिए', 'फिरता', 'हूँ'),
 ('आ', 'रही', 'रवि'),
 ('रही', 'रवि', 'की'),
 ('रवि', 'की', 'सवारी।'),
 ('पूर्व', 'चलने', 'के'),
 ('चलने', 'के', 'बटोही'),
 ('के', 'बटोही', 'बाट'),
 ('बटोही', 'बाट', 'की'),
 ('बाट', 'की', 'पहचान'),
 ('की', 'पहचान', 'कर')]

In [None]:
[sent[0][:-1] for sent in (ngrams_to_learn.most_common(10))]

[('लिए', 'फिरता'),
 ('आ', 'रही'),
 ('रही', 'रवि'),
 ('रवि', 'की'),
 ('पूर्व', 'चलने'),
 ('चलने', 'के'),
 ('के', 'बटोही'),
 ('बटोही', 'बाट'),
 ('बाट', 'की'),
 ('की', 'पहचान')]

In [None]:
[sent[0][1:] for sent in (ngrams_to_learn.most_common(10))]

[('फिरता', 'हूँ'),
 ('रही', 'रवि'),
 ('रवि', 'की'),
 ('की', 'सवारी।'),
 ('चलने', 'के'),
 ('के', 'बटोही'),
 ('बटोही', 'बाट'),
 ('बाट', 'की'),
 ('की', 'पहचान'),
 ('पहचान', 'कर')]

In [None]:
[[word_to_index[w] for w in sent[0]] for sent in (ngrams_to_learn.most_common(10))
    if all([w in word_to_index for w in sent[0]])]

[[35, 49, 16],
 [61, 51, 108],
 [51, 108, 4],
 [132, 98, 10],
 [98, 10, 133],
 [10, 133, 114],
 [133, 114, 4],
 [114, 4, 134],
 [4, 134, 12]]

In [None]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in (ngrams_to_learn.most_common(10))
    if all([w in word_to_index for w in sent[0]])]
X_train_2

[[35, 49],
 [61, 51],
 [51, 108],
 [132, 98],
 [98, 10],
 [10, 133],
 [133, 114],
 [114, 4],
 [4, 134]]

In [None]:
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in (ngrams_to_learn.most_common(10))
    if all([w in word_to_index for w in sent[0]])]
y_train_2

[[49, 16],
 [51, 108],
 [108, 4],
 [98, 10],
 [10, 133],
 [133, 114],
 [114, 4],
 [4, 134],
 [134, 12]]

Create a list for trigrams, we restrict the data only to 2000 to have a smaller dataset (limiting due to high training time)

In [None]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in (ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in (ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
X_train_2 = X_train_2[:2000]
y_train_2 = y_train_2[:2000]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2)

([[4, 55], [879, 20], [4, 632], [1, 824], [52, 699]],
 [[55, 481], [20, 30], [632, 633], [824, 445], [699, 52]],
 2000,
 2000)

In [None]:
word_to_index['?']

19

Checking for end of sentence as we rely on symbols signifying end, we have the following function for it

In [None]:
def check_eos(trigram):
    if trigram[1] == word_to_index['।'] or trigram[1] == word_to_index['!'] or trigram[1] == word_to_index['?']:
          return True
    return False

trigrams_eos = list(filter(check_eos, y_train_2))
len(trigrams_eos), trigrams_eos[0:5]

(13, [[158, 9], [206, 19], [145, 19], [1028, 9], [70, 19]])

Now adding ngrams upto 20

In [None]:
from tqdm import tqdm
for i in tqdm(range(1, len(ngrams_up_to_20))):
    ngrams_to_learn = ngrams_up_to_20[i]
    X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in (ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in (ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    X_train_2 = X_train_2[:2000]
    y_train_2 = y_train_2[:2000]
    X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
    X_train.extend(X_train_2)
    y_train.extend(y_train_2)

100%|██████████| 18/18 [00:00<00:00, 24.73it/s]


In [None]:
len(X_train), len(y_train)

(37528, 37528)

**Now our training dataset includes 37528 instances after including the trigrams(also includes end of sentence trigrams)**

In [None]:
print(random.sample(list(zip(X_train, y_train)), 10))

[([775, 14, 776, 777, 11, 778], [14, 776, 777, 11, 778, 779]), ([18, 252, 60, 253, 153, 60, 472], [252, 60, 253, 153, 60, 472, 187]), ([8, 52, 461, 884, 885, 1, 18, 252, 60, 253, 153, 60, 34], [52, 461, 884, 885, 1, 18, 252, 60, 253, 153, 60, 34, 6]), ([11, 805, 437, 43, 438, 24, 806, 11, 807, 101, 43, 438, 808, 439, 90, 249], [805, 437, 43, 438, 24, 806, 11, 807, 101, 43, 438, 808, 439, 90, 249, 809]), ([282, 31], [31, 686]), ([685, 17, 8, 282, 31, 686, 42], [17, 8, 282, 31, 686, 42, 8]), ([48, 437, 66, 559, 77, 5, 1137, 1138, 1139, 1, 42, 196, 5, 1140, 1141, 82, 37], [437, 66, 559, 77, 5, 1137, 1138, 1139, 1, 42, 196, 5, 1140, 1141, 82, 37, 1142]), ([880, 14, 152, 181, 462, 881, 219, 181, 462, 882, 883, 18, 389, 282, 22], [14, 152, 181, 462, 881, 219, 181, 462, 882, 883, 18, 389, 282, 22, 180]), ([282, 22, 180, 463, 8, 314, 219, 464, 8, 52], [22, 180, 463, 8, 314, 219, 464, 8, 52, 461]), ([8, 282, 31, 686, 42, 8, 224, 687, 4, 383], [282, 31, 686, 42, 8, 224, 687, 4, 383, 21])]


In [None]:
tokenized_sentences[100]

['SENTENCE_START', 'आ', 'रही', 'रवि', 'की', 'सवारी', '।']

In [None]:
len(tokenized_sentences)

215

#### Now lets learn how to end the sentence...
as the training dataset includes less end of sentence ngrams and its very important to learn how to end the sentence
Symbols to recognize end of sentence are: '!', '।' and '?'

In [None]:
[[word_to_index[w] for w in sent] for sent in tokenized_sentences if all([w in word_to_index for w in sent])][100]

[0, 61, 51, 108, 4, 109, 2]

In [None]:
X_train_full_sentences = [[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]
y_train_full_sentences = [[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]

In [None]:
print(X_train_full_sentences[0:5], y_train_full_sentences[0:5])

[[0, 607, 129, 608, 609, 129, 610, 129, 611, 25, 612, 361, 28, 212, 48, 212, 48, 212, 48, 83, 83, 83], [0, 29, 13, 613, 71, 29, 13, 273, 71, 29, 13, 614, 71, 12, 274, 12, 274, 12, 274, 83, 83, 83], [0, 21, 362, 615, 1, 84, 96, 616, 1, 275, 617, 363, 7, 276, 276, 276, 83, 83, 83], [0, 213, 5, 166, 22, 3, 22, 214, 5, 215, 22, 3, 22], [0, 30, 618, 619, 130, 216, 6, 364, 57, 167, 620, 621, 622, 217, 72, 623, 14, 42, 624, 625, 73, 3, 58, 85, 24, 57, 22, 73, 626, 32, 365, 366, 96, 20, 68, 13, 97, 42, 627, 5, 22, 628, 73, 10, 629, 3, 630, 7, 367, 368, 3, 368, 367, 168, 3, 168, 218, 631, 7, 369, 4, 632, 633, 22, 3, 22, 213, 5, 166, 22, 3, 22, 214, 5, 215, 22, 3, 22]] [[607, 129, 608, 609, 129, 610, 129, 611, 25, 612, 361, 28, 212, 48, 212, 48, 212, 48, 83, 83, 83, 2], [29, 13, 613, 71, 29, 13, 273, 71, 29, 13, 614, 71, 12, 274, 12, 274, 12, 274, 83, 83, 83, 2], [21, 362, 615, 1, 84, 96, 616, 1, 275, 617, 363, 7, 276, 276, 276, 83, 83, 83, 2], [213, 5, 166, 22, 3, 22, 214, 5, 215, 22, 3, 22, 2]

In [None]:
print(tokenized_sentences[100], tokenized_sentences[100][::-1])

['SENTENCE_START', 'आ', 'रही', 'रवि', 'की', 'सवारी', '।'] ['।', 'सवारी', 'की', 'रवि', 'रही', 'आ', 'SENTENCE_START']


In [None]:
len(tokenized_sentences)

215

In [None]:
print(random.sample(tokenized_sentences, 10))

[['SENTENCE_START', 'उसके', 'नयनों', 'का', 'जल', 'खारा', 'है', 'गंगा', 'की', 'निर्मल', 'धारा', 'पावन', 'कर', 'देगी', 'तन', '-', 'मन', 'को', 'क्षण', 'भर', 'साथ', 'बहो', '!'], ['SENTENCE_START', 'स्वप्न', 'था', 'मेरा', 'भयंकर', '!'], ['SENTENCE_START', 'मैं', 'निज', 'उर', 'के', 'उद्गार', 'लिए', 'फिरता', 'हूँ', 'मैं', 'निज', 'उर', 'के', 'उपहार', 'लिए', 'फिरता', 'हूँ', 'है', 'यह', 'अपूर्ण', 'संसार', 'ने', 'मुझको', 'भाता', 'मैं', 'स्वप्नों', 'का', 'संसार', 'लिए', 'फिरता', 'हूँ', '!'], ['SENTENCE_START', 'चल', 'मरदाने', 'सीना', 'ताने', 'हाथ', 'हिलाते', 'पांव', 'बढाते', 'मन', 'मुस्काते', 'गाते', 'गीत', '।'], ['SENTENCE_START', 'है', 'अनिश्चित', 'किस', 'जगह', 'पर', 'सरित', 'गिरि', 'गह्वर', 'मिलेंगे', 'है', 'अनिश्चित', 'किस', 'जगह', 'पर', 'बाग', 'वन', 'सुंदर', 'मिलेंगे', 'किस', 'जगह', 'यात्रा', 'खतम', 'हो', 'जाएगी', 'यह', 'भी', 'अनिश्चित', 'है', 'अनिश्चित', 'कब', 'सुमन', 'कब', 'कंटकों', 'के', 'शर', 'मिलेंगे', 'कौन', 'सहसा', 'छूट', 'जाएँगे', 'मिलेंगे', 'कौन', 'सहसा', 'आ', 'पड़े', 'कुछ', 'भी', 'र

In [None]:
len(tokenized_sentences)

215

In [None]:
import random
last_n_words = []
for i in range(3, 20):
    tokenized_sentences_200 = random.sample(list(tokenized_sentences), 200)
    for s in tokenized_sentences_200:
        last_n_words.append(s[::-1][:i][::-1])

print(random.sample(last_n_words, 10))

[['SENTENCE_START', 'तुम', 'तूफान', 'समझ', 'पाओगे', '?'], ['मैं', 'साकी', 'पीनेवाला', 'मधुशाला', '।'], ['हृदय', 'पाने', 'की', 'आशा', 'व्यर्थ', 'लगाना', 'क्या', '।'], ['SENTENCE_START', 'पूर्व', 'चलने', 'के', 'बटोही', 'बाट', 'की', 'पहचान', 'कर', 'ले', '।'], ['मुबारक', 'पीनेवाले', 'खुली', 'रहे', 'यह', 'मधुशाला', '।'], ['इस', 'पर', 'बढ़ा', 'है', 'तू', 'इसी', 'पर', 'आज', 'अपने', 'चित्त', 'का', 'अवधान', 'कर', 'ले', '।'], ['अधरों', 'पर', 'निज', 'अधरों', 'का', 'तुमने', 'रख', 'भार', 'दिया', 'था', '!'], ['मन', 'वार', 'दिया', 'था', '?'], ['SENTENCE_START', 'इतने', 'मत', 'संतप्त', 'बनो', '।'], ['आँखों', 'को', 'निद्रित', 'चकाचौध', 'करते', 'हों', 'छिद्रित', 'मुझे', 'बुझा', 'दे', 'बुझ', 'जाने', 'से', 'मुझे', 'नहीं', 'इंकार', '।']]


In [None]:
last_n_words

[['कर', 'ले', '।'],
 ['की', 'सवारी', '।'],
 ['भी', 'सीखे', '?'],
 ['फ़ौज', 'सारी', '।'],
 ['मिलेगी', 'मधुशाला', '।'],
 ['किससे', 'भयभीत', '।'],
 ['मैंने', 'रुलाया', '!'],
 ['-', 'फिर', '।'],
 ['भीड़', 'में', '।'],
 ['बोल', 'बादल', '!'],
 ['दिया', 'था', '!'],
 ['दूसरी', 'बार', '।'],
 ['की', 'सवारी', '।'],
 ['नहलाता', 'हूँ', '!'],
 ['सहलाता', 'हूँ', '!'],
 ['को', 'तैयार', '।'],
 ['कर', 'तुम्हारी', '?'],
 ['दे', 'सकेगा', '?'],
 ['नहीं', 'होती', '।'],
 ['ने', 'जाना', '?'],
 ['करेगी', 'मधुशाला', '।'],
 ['मधुमय', 'मधुशाला', '।'],
 ['ढलता', 'है', '!'],
 ['उन्मत्त', 'बनो', '!'],
 ['चिता', 'पर', '!'],
 ['किया', 'था', '?'],
 ['पोशाक', 'धारी', '।'],
 ['की', 'सवारी', '।'],
 ['सा', 'बन', '।'],
 ['फिरता', 'हूँ', '!'],
 ['सा', 'बन', '।'],
 ['रंगीली', 'मधुशाला', '।'],
 ['न', 'कहो', '!'],
 ['ज्ञान', 'भूलना', '!'],
 ['दिशा', 'को', '!'],
 ['मेरी', 'मधुशाला', '।'],
 ['संतप्त', 'बनो', '।'],
 ['बुलाती', 'मधुशाला', '।'],
 ['मेरा', 'भयंकर', '!'],
 ['क्या', 'करूँ', '?'],
 ['बहलाता', 'हूँ', '!'],
 ['दिन', 'तुम्

In [None]:
len(last_n_words)

3400

In [None]:
#EOS - End of Sentence

X_train_eos = [[word_to_index[w] for w in sent[:-1]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]
y_train_eos = [[word_to_index[w] for w in sent[1:]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]

In [None]:
X_train_eos

[[12, 26],
 [4, 109],
 [28, 793],
 [516, 517],
 [1088, 36],
 [736, 413],
 [106, 312],
 [3, 22],
 [1446, 6],
 [169, 111],
 [178, 20],
 [1244, 349],
 [4, 109],
 [1348, 16],
 [1341, 16],
 [14, 1227],
 [12, 145],
 [87, 450],
 [15, 94],
 [72, 299],
 [1135, 36],
 [1172, 36],
 [309, 1],
 [587, 165],
 [586, 8],
 [64, 20],
 [507, 508],
 [4, 109],
 [58, 93],
 [49, 16],
 [58, 93],
 [1126, 36],
 [13, 158],
 [286, 795],
 [1367, 14],
 [54, 36],
 [588, 165],
 [489, 36],
 [80, 269],
 [17, 70],
 [270, 16],
 [85, 446],
 [178, 20],
 [203, 63],
 [859, 860],
 [50, 16],
 [12, 145],
 [4, 109],
 [95, 1],
 [12, 145],
 [4, 36],
 [1452, 255],
 [57, 20],
 [1353, 16],
 [12, 145],
 [176, 62],
 [101, 1],
 [876, 151],
 [452, 36],
 [309, 1],
 [1518, 1],
 [54, 36],
 [80, 269],
 [270, 16],
 [12, 26],
 [60, 55],
 [17, 70],
 [7, 476],
 [21, 36],
 [4, 109],
 [15, 94],
 [1268, 17],
 [157, 34],
 [589, 165],
 [203, 63],
 [587, 165],
 [54, 36],
 [15, 94],
 [421, 97],
 [58, 93],
 [1, 36],
 [1251, 17],
 [203, 63],
 [14, 801],
 [

In [None]:
len(X_train_eos), len(y_train_eos)

(3399, 3399)

In [None]:
print(X_train_eos[0:10], y_train_eos[0:10] )

[[12, 26], [4, 109], [28, 793], [516, 517], [1088, 36], [736, 413], [106, 312], [3, 22], [1446, 6], [169, 111]] [[26, 2], [109, 2], [793, 19], [517, 2], [36, 2], [413, 2], [312, 9], [22, 2], [6, 2], [111, 9]]


In [None]:
len(X_train), len(y_train)

(37528, 37528)

In [None]:
X_train.extend(X_train_eos)
y_train.extend(y_train_eos)

In [None]:
len(X_train), len(y_train)

(40927, 40927)

After appening the end of sentence tokens, our training dataset is of about 40000 instances

#### Pickle our training dataset

In [None]:
import pickle

# Specify the file path where you want to save the pickle file
file_path = 'X_train.pickle'

# Open the file in binary write mode and save the array
with open(file_path, 'wb') as file:
    pickle.dump(X_train, file)

In [None]:
file_path = 'y_train.pickle'

# Open the file in binary write mode and save the array
with open(file_path, 'wb') as file:
    pickle.dump(y_train, file)

# Starting with embeddings

## Loading the embeddings

We'll use the indicnlp fast text embeddings to convert our text to numbers

In [None]:
import os
import numpy as  np

indicft = "/kaggle/input/indicnlp-embeddings/indicnlp.ft.hi.300.vec"

embeddings_index = {} #initialize dictionary
f = open(indicft, encoding='utf8')
try:
    for line in f:
        values = line.split()
        # print(values)
        word = values[0]
        # print(word)
        if isinstance(values[1],float):
            coefs = np.asarray(values[1:], dtype='float32')
        else:
            coefs = np.asarray(values[2:], dtype='float32')
        embeddings_index[word] = coefs
except:
    print(line)
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 327219 word vectors.


In [None]:
len(embeddings_index["तेरे"])

299

In [None]:
embeddings_index["है"][:299]

array([-5.15648305e-01,  7.88255513e-01, -1.65350348e-01,  5.87728620e-01,
       -4.35722142e-01,  8.15439224e-01,  4.75072384e-01, -1.89977348e-01,
        1.35939598e-01,  6.83160126e-01,  4.62868251e-02,  1.23072505e-01,
        2.99156308e-01,  2.33677983e-01, -3.07736725e-01,  8.68594795e-02,
        3.70740980e-01,  2.40825966e-01, -3.41060936e-01,  5.79155803e-01,
        1.33662358e-01,  4.38812554e-01, -2.23768085e-01,  5.56727052e-01,
        1.04942329e-01, -1.38754537e-02,  3.02238554e-01,  6.51752949e-02,
       -2.13642091e-01,  8.54419470e-02, -2.05526352e-01,  1.30182505e-01,
       -2.05939263e-01, -1.66696906e-01,  1.60899073e-01,  3.28032225e-01,
        1.40551552e-01,  2.00337589e-01,  3.86216789e-02, -8.47968832e-02,
        7.60270376e-03,  2.74946749e-01,  3.08333278e-01, -1.79145318e-02,
        2.89040267e-01,  4.71396685e-01, -5.05464494e-01,  5.11636913e-01,
        6.18044287e-02, -1.34717897e-01,  2.39699498e-01, -6.95517287e-02,
       -1.01792082e-01,  

### Embedding matrix

Our embedding matrix has a dimension of Vocab size,299 (size of embeddings for each word)

In [None]:
import numpy as np
embedding_dim = 299

embedding_matrix = np.zeros((vocabulary_size, embedding_dim))

In [None]:
# embedding_dim = 100

# embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word, i in vocab:
    embedding_vector = embeddings_index.get(word)
    if i < vocabulary_size:
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector[:299]

In [None]:
embedding_matrix.shape

(1557, 299)

In [None]:
vocab[200]

('हाथों', 5)

In [None]:
embedding_matrix[12]

array([-0.06928822,  0.05009553, -0.11586142, -0.03235678,  0.14768928,
       -0.08971786, -0.18063322,  0.20427841,  0.2664938 , -0.13803518,
        0.17435841, -0.24402934, -0.07851491, -0.01653523, -0.0032394 ,
       -0.10603179, -0.10626386, -0.17562509, -0.13812228,  0.1461287 ,
       -0.08641782,  0.62626761, -0.00425611, -0.00724298,  0.04772782,
        0.27170616,  0.25756037, -0.0016789 ,  0.28737858, -0.36272228,
       -0.08560076, -0.01432239, -0.20565511,  0.09724188,  0.13814092,
        0.09732114, -0.07468031,  0.1026254 ,  0.1394937 , -0.14822337,
       -0.00935543, -0.01739961, -0.10031028, -0.23436184,  0.21072122,
       -0.23744115, -0.22060609,  0.10076727,  0.11414731, -0.21049462,
       -0.03252088, -0.13589272, -0.25620744,  0.21658556, -0.10349192,
        0.05402563, -0.22766803,  0.3065345 ,  0.03349132, -0.13482744,
       -0.33638939,  0.2952016 ,  0.13935527, -0.22325584, -0.02498981,
        0.23622085,  0.01745104,  0.14976056, -0.04586727,  0.02

In [None]:
# embedding_vector.shape, len(vocab)

### Building the RNN model

In [None]:
class RNN:
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate

        # Randomly initialize the network parameters
        #self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        #self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))

        # Set GLOVE embeddings matrix
        self.G = embedding_matrix

Implementimg forward propagation

In [None]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # sometimes, may want to do this first:
    #x = np.vectorize(round)(x)

    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [None]:
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)

    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)

    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))

    # For each time step...
    for t in np.arange(T):
        # embedding of x[t]:
        e_t = self.G[x[t]]

        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        #s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        s[t] = np.tanh(self.U.dot(e_t) + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))

    return [o, s]

RNN.forward_propagation = forward_propagation

In [None]:
embedding_matrix.shape

(1557, 299)

In [None]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.22095755,  0.14144683, -0.13005191, ...,  0.05247998,
         0.1325019 ,  0.24468982],
       [ 0.4305422 ,  0.44622406,  0.3509393 , ..., -0.14660411,
        -0.25909096, -0.0171802 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [None]:
vocabulary_size

1557

Making sure our dimensions are right!

In [None]:
word_dim = vocabulary_size
hidden_dim = 299
embedding_dim = 299
U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
x = np.random.randint(0, high=word_dim, size=word_dim)
T = len(x)
s = np.zeros((T + 1, hidden_dim))
s_m1 = np.zeros(hidden_dim)
o = np.zeros((T, word_dim))
e_0 = embedding_matrix[x[0]]
s_0 = np.tanh(U.dot(e_0) + W.dot(s_m1))
print(s_0.shape, V.shape)
o_0 = softmax(V.dot(s_0))
o_0.shape, o_0

(299,) (1557, 299)


((1557,),
 array([0.00064226, 0.00064226, 0.00064226, ..., 0.00064226, 0.00064226,
        0.00064226]))

### Writing the predict function - to return the next word with the highest probability

In [None]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o[-1], axis=1)

RNN.predict = predict

If we feed only a standalone sequence, predicting all words of the output sequence

In [None]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNN.predict = predict

### Sample outputs

In [None]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train[1000]]), X_train[1000]))

x:

उतनी

[480]


In [None]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train[10000]]), X_train[10000]))

x:

सूख गया मधुबन की छाती

[466, 57, 467, 4, 232]


In [None]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train[20000]]), X_train[20000]))

x:

उसका लिया हर मैं रिझा जिसको न पाया गा सरल

[243, 529, 117, 11, 1040, 183, 13, 151, 197, 404]


In [None]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train[30000]]), X_train[30000]))

x:

जहाँ खड़ा था कल उस थल पर आज नहीं कल इसी जगह पर पाना मुझको

[432, 193, 20, 313, 120, 978, 8, 27, 15, 313, 140, 170, 8, 979, 65]


Trying the model, and predict for X_train[10000]

In [None]:
vocabulary_size, X_train[10000]

(1557, [466, 57, 467, 4, 232])

In [None]:
np.random.seed(17)
model = RNN(vocabulary_size)
o, s = model.forward_propagation(X_train[10000])
print (o.shape, o)

(5, 1557) [[0.00064226 0.00064226 0.00064226 ... 0.00064226 0.00064226 0.00064226]

 [0.00064226 0.00064226 0.00064226 ... 0.00064226 0.00064226 0.00064226]

 [0.00064226 0.00064226 0.00064226 ... 0.00064226 0.00064226 0.00064226]

 [0.00064699 0.00064088 0.00062739 ... 0.00064385 0.00066688 0.00061502]

 [0.0006396  0.00064023 0.00066004 ... 0.00063608 0.00064732 0.00063783]]


In [None]:
np.argmax(o[-1], axis=0)

30

The words of the predicted sequence are as follows:

In [None]:
predictions = model.predict(X_train[10000])
print(predictions.shape, predictions)

(5,) [  0   0   0 879  30]


In [None]:
print ("x:\n%s" % (" ".join([index_to_word[x] for x in predictions])))

x:

SENTENCE_START SENTENCE_START SENTENCE_START प्यारा वह


The prediction seems very bad, but the model is not to blame, we didn't even train it yet

### Calculating the loss function

In [None]:
def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNN.calculate_total_loss = calculate_total_loss
RNN.calculate_loss = calculate_loss

In [None]:
# Limit to 1000 examples to save time
print ("Expected Loss for random predictions: %f" % np.log(vocabulary_size))
print ("Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000]))

Expected Loss for random predictions: 7.350516



  N = np.sum((len(y_i) for y_i in y))


Actual loss: 7.351044


### Training backpropagation through time

In [None]:
def bptt(self, x, y):
    T = len(y)

    # Perform forward propagation
    o, s = self.forward_propagation(x)

    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.

    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)

        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))

        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:

            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])
            #dLdU[:,x[bptt_step]] += delta_t
            dLdU += np.outer(delta_t, self.G[x[bptt_step]])

            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)

    return [dLdU, dLdV, dLdW]

RNN.bptt = bptt

### Gradient Checking function

In [None]:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):

    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = model.bptt(x, y)

    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']

    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):

        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print("Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)))

        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index

            # Save the original value so we can reset it later
            original_value = parameter[ix]

            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = model.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = model.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)

            # Reset parameter to original value
            parameter[ix] = original_value

            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]

            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient) / (
                                np.abs(backprop_gradient) + np.abs(estimated_gradient))

               # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print( "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix))
                print( "+h Loss: %f" % gradplus)
                print( "-h Loss: %f" % gradminus)
                print( "Estimated_gradient: %f" % estimated_gradient)
                print( "Backpropagation gradient: %f" % backprop_gradient)
                print( "Relative Error: %f" % relative_error)
                return
            it.iternext()

        print( "Gradient check for parameter %s passed." % (pname))

RNN.gradient_check = gradient_check

In [None]:
grad_check_vocab_size = 100
np.random.seed(10)
model = RNN(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

Performing gradient check for parameter U with size 2990.

Gradient check for parameter U passed.

Performing gradient check for parameter V with size 1000.

Gradient check for parameter V passed.

Performing gradient check for parameter W with size 100.

Gradient check for parameter W passed.


### Stocastic Gradient Descent

In [None]:
# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)

    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNN.sgd_step = numpy_sdg_step

In [None]:
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0

    for epoch in range(nepoch):

        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print ("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))

            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5
                print ("Setting learning rate to %f" % learning_rate)
            sys.stdout.flush()

        # For each training example...
        for i in range(len(y_train)):

            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

In [None]:
vocabulary_size

1557

In [None]:
np.random.seed(17)
model = RNN(vocabulary_size)
%timeit model.sgd_step(X_train[1000], y_train[1000], 0.005)

2.78 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### Running for 100 rows data, for 10 Epochs (Just trying out)

In [None]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train[2000:2100], y_train[2000:2100], nepoch=10, evaluate_loss_after=1)

2023-10-26 01:26:15: Loss after num_examples_seen=0 epoch=0: 7.350558



  N = np.sum((len(y_i) for y_i in y))


2023-10-26 01:26:15: Loss after num_examples_seen=100 epoch=1: 7.342038

2023-10-26 01:26:15: Loss after num_examples_seen=200 epoch=2: 7.333494

2023-10-26 01:26:15: Loss after num_examples_seen=300 epoch=3: 7.324880

2023-10-26 01:26:16: Loss after num_examples_seen=400 epoch=4: 7.316149

2023-10-26 01:26:16: Loss after num_examples_seen=500 epoch=5: 7.307256

2023-10-26 01:26:16: Loss after num_examples_seen=600 epoch=6: 7.298153

2023-10-26 01:26:16: Loss after num_examples_seen=700 epoch=7: 7.288795

2023-10-26 01:26:17: Loss after num_examples_seen=800 epoch=8: 7.279133

2023-10-26 01:26:17: Loss after num_examples_seen=900 epoch=9: 7.269123


In [None]:
len(index_to_word)

1557

### Generating Text
We have the model ready. Its time to predict the next sentence

In [None]:
def generate_sentence(model, senten_max_length):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]

    # Repeat until we get an end token and keep our sentences to less than senten_max_length words for now
    while (not new_sentence[-1] == word_to_index['।']) and (not new_sentence[-1] == word_to_index['?']) and (not new_sentence[-1] == word_to_index['!'])  and len(new_sentence) < senten_max_length:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]

        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:

            # correcting for abnormalities
            #abs_v = [-i if i <0 else i for i in next_word_probs[-1][0]]
            #nrm_v = [i/sum(abs_v) for i in abs_v]
            #abs_v = [0 if i <0 else i for i in next_word_probs[-1][0]]
            #abs_v = [0 if i <0 else i for i in next_word_probs[0][-1]]
            #nrm_v = [i/sum(abs_v) for i in abs_v]
            #samples = np.random.multinomial(1, nrm_v)
            #sampled_word = np.argmax(samples)

            # the secret sauce of creativity
            samples = np.random.multinomial(1, next_word_probs[0][-1])

            sampled_word = np.argmax(samples)

        new_sentence.append(sampled_word)

    print(new_sentence)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    #print(sentence_str)
    return sentence_str

In [None]:
senten_max_length = 20
generate_sentence(model, senten_max_length)

[0, 1303, 297, 295, 1477, 868, 1146, 597, 696, 300, 264, 1102, 656, 198, 43, 17, 660, 1262, 1, 530]


['मानता',
 'अग्नि',
 'धारा',
 'गोपाल',
 'माँगा',
 'जलते',
 'छुई',
 'वन',
 'दाना',
 'पीनेवाले',
 'चुंबन',
 'तान',
 'बनकर',
 'तुम',
 'क्या',
 'मुसकराती',
 'अंक',
 'है']

In [None]:
num_sentences = 10
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

[0, 1073, 614, 762, 426, 404, 27, 815, 1473, 1230, 450, 969, 281, 676, 1296, 1540, 1481, 261, 1292, 1231]

बतलाने मुड़ेगा रहता छूकर सरल आज झूम बाद चकाचौध सकेगा चली होता सृष्टि अनिवार गढ बे मुझ ठोक

[0, 1072, 834, 1114, 412, 0, 989, 79, 152, 1478, 345, 199, 1243, 854, 1420, 391, 200, 947, 190, 1537]

बिता जारी अंगूरी मंज़िल SENTENCE_START बेरोक हुआ देखो पुराने मंदिर तब सकेगी चंचलता गोताखोर स्वप्नों हाथों क्रय कहा

[0, 543, 365, 1516, 411, 630, 1418, 1004, 782, 121, 1151, 994, 1463, 1150, 1320, 1434, 701, 1451, 0, 1390]

सुमधुर काली बड़ा हमारी भय डुबकियां आधे उन्माद मान छूते पलक झडी उठाती क्षीण रह जाएँगे पुट्ठे SENTENCE_START

[0, 400, 220, 230, 1090, 613, 1439, 1553, 367, 1172, 803, 739, 400, 9]

बूँद घर आने छलछल़ थकेगा संघर्ष मिलें भीत मधुमय खंडर गंग बूँद

[0, 1472, 4, 1239, 464, 99, 1423, 102, 1425, 1091, 1075, 830, 878, 1388, 937, 705, 1006, 474, 150, 1102]

बरस की खूब तारों कौन मिलते हाथ बढ़ता मधुघट बढूँ पिटारी बेहद पीडा संचित उमर दिए वाला मुझसे

[0, 278, 1386, 853, 28, 1365, 1154, 

#### Running for 30 epochs

Okay so we start our serious training of the model now, we did our training on kaggle for it to be able to run overnight. It took ~6 hrs 44 mins for 30 epochs to complete

In [None]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train, y_train, nepoch=30, evaluate_loss_after=1)


  N = np.sum((len(y_i) for y_i in y))


2023-10-26 01:27:38: Loss after num_examples_seen=0 epoch=0: 7.350857

2023-10-26 01:40:53: Loss after num_examples_seen=40927 epoch=1: 4.194901

2023-10-26 01:54:17: Loss after num_examples_seen=81854 epoch=2: 3.859026

2023-10-26 02:08:12: Loss after num_examples_seen=122781 epoch=3: 3.911279

Setting learning rate to 0.002500

2023-10-26 02:21:32: Loss after num_examples_seen=163708 epoch=4: 3.180601

2023-10-26 02:34:59: Loss after num_examples_seen=204635 epoch=5: 3.062741

2023-10-26 02:48:25: Loss after num_examples_seen=245562 epoch=6: 2.966441

2023-10-26 03:02:04: Loss after num_examples_seen=286489 epoch=7: 2.996370

Setting learning rate to 0.001250

2023-10-26 03:15:34: Loss after num_examples_seen=327416 epoch=8: 2.879501

2023-10-26 03:28:55: Loss after num_examples_seen=368343 epoch=9: 2.804133

2023-10-26 03:42:20: Loss after num_examples_seen=409270 epoch=10: 2.808129

Setting learning rate to 0.000625

2023-10-26 03:55:48: Loss after num_examples_seen=450197 epoch=11

In [None]:
import pickle


# Specify the filename where you want to save the model
filename = 'model_100.pkl'

# Open the file in binary write mode
with open(filename, 'wb') as file:
    # Use pickle to dump the model into the file
    pickle.dump(model, file)

In [None]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

[0, 1303, 297, 295, 1477, 868, 1146, 597, 696, 300, 264, 1102, 656, 198, 43, 78, 326, 206, 90, 16]

मानता अग्नि धारा गोपाल माँगा जलते छुई वन दाना पीनेवाले चुंबन तान बनकर तुम मुझे अधिकार पाओगे क्यों

[0, 1220, 1, 131, 15, 378, 1351, 6, 284, 11, 1352, 1353, 16, 9]

गायक है हाय नहीं सहज देवों में यदि मैं दुर्बल कहलाता हूँ

[0, 481, 815, 1473, 1230, 450, 969, 281, 676, 1296, 1540, 1481, 261, 1292, 1231, 1072, 834, 1114, 412, 0]

अभिलाषा झूम बाद चकाचौध सकेगा चली होता सृष्टि अनिवार गढ बे मुझ ठोक छिद्रित बिता जारी अंगूरी मंज़िल

[0, 989, 79, 152, 1478, 345, 199, 1243, 854, 1420, 391, 200, 947, 190, 1537, 543, 365, 1516, 411, 630]

बेरोक हुआ देखो पुराने मंदिर तब सकेगी चंचलता गोताखोर स्वप्नों हाथों क्रय कहा तौले सुमधुर काली बड़ा हमारी

[0, 1418, 1004, 782, 121, 1151, 994, 1463, 1150, 1320, 1434, 701, 1451, 0, 1390, 400, 220, 230, 1090, 613]

डुबकियां आधे उन्माद मान छूते पलक झडी उठाती क्षीण रह जाएँगे पुट्ठे SENTENCE_START चुप बूँद घर आने छलछल़

[0, 1439, 1553, 367, 1172, 803, 739, 400, 9]

संघर्

To be honest, the results looks good, they don't make much sense in Hindi but to certain extent they are not wrong either. One thing commendable for the model is that it is able to pick gender specific words in a sentence correctly and follows it through the sentence ie the gender doesn't change mid sentence <div>

The second sentence in the example, is grammatically correct which makes me proud (atleast in an AI generated world ;) )

[0, 1220, 1, 131, 15, 378, 1351, 6, 284, 11, 1352, 1353, 16, 9]<div>
गायक है हाय नहीं सहज देवों में यदि मैं दुर्बल कहलाता हूँ

#### Running for 60 epochs

We were daring enough to run it for 60 epochs as well, the error for sure keep on reducing but the sentence looked pretty much similar we'll analyse them going forward

In [131]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train, y_train, nepoch=60, evaluate_loss_after=1)


  N = np.sum((len(y_i) for y_i in y))


2023-10-26 19:29:18: Loss after num_examples_seen=0 epoch=0: 7.350692

2023-10-26 19:34:07: Loss after num_examples_seen=40927 epoch=1: 4.303587

2023-10-26 19:38:53: Loss after num_examples_seen=81854 epoch=2: 4.049519

2023-10-26 19:43:40: Loss after num_examples_seen=122781 epoch=3: 3.772869

2023-10-26 19:48:28: Loss after num_examples_seen=163708 epoch=4: 3.935226

Setting learning rate to 0.002500

2023-10-26 19:53:18: Loss after num_examples_seen=204635 epoch=5: 3.142704

2023-10-26 19:58:08: Loss after num_examples_seen=245562 epoch=6: 3.158575

Setting learning rate to 0.001250

2023-10-26 20:02:59: Loss after num_examples_seen=286489 epoch=7: 2.850078

2023-10-26 20:07:49: Loss after num_examples_seen=327416 epoch=8: 2.850826

Setting learning rate to 0.000625

2023-10-26 20:12:41: Loss after num_examples_seen=368343 epoch=9: 2.722673

2023-10-26 20:17:35: Loss after num_examples_seen=409270 epoch=10: 2.709067

2023-10-26 20:22:28: Loss after num_examples_seen=450197 epoch=11

We have this and the previous model pickeled, in case we want to try it again, we can!

In [136]:
import pickle


# Specify the filename where you want to save the model
filename = 'model_100.pkl'

# Open the file in binary write mode
with open(filename, 'wb') as file:
    # Use pickle to dump the model into the file
    pickle.dump(model, file)

In [133]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

[0, 1498, 1319, 1278, 855, 1161, 1483, 666, 311, 358, 1208, 1081, 1522, 1130, 578, 588, 460, 1505, 1320, 1326]

चुस्त हीन झूककर मिलने मोमिन कष्टों चिड़िया तुम्हें शहीद मृत्तिका अनुभव सही दिखाएगी पत्ते संतप्त अभी पसीने क्षीण

[0, 1062, 772, 295, 1477, 868, 1146, 597, 696, 300, 264, 1102, 656, 198, 43, 78, 100, 206, 19]

लता दहा धारा गोपाल माँगा जलते छुई वन दाना पीनेवाले चुंबन तान बनकर तुम मुझे समझ पाओगे

[0, 1480, 53, 1, 147, 307, 1, 769, 4, 632, 295, 57, 36, 101, 12, 2]

बुढिया दो है अपना वेदना है उनको की मोहिनी धारा गया मधुशाला पड़ा कर

[0, 607, 426, 404, 27, 815, 148, 598, 1225, 431, 1048, 45, 900, 75, 763, 61, 1094, 326, 3, 33]

वृक्ष छूकर सरल आज झूम विश्व औ चाहती भीतर लूँ प्याला फ़िर स्वप्न तरह आ पहुंचे अधिकार -

[0, 1387, 1410, 103, 1292, 1231, 1072, 834, 1114, 412, 0, 989, 79, 152, 1478, 345, 199, 1243, 854, 1420]

व्याकुल चढ़कर कहीं ठोक छिद्रित बिता जारी अंगूरी मंज़िल SENTENCE_START बेरोक हुआ देखो पुराने मंदिर तब सकेगी चंचलता

[0, 391, 200, 947, 190, 1537, 543, 365, 1516, 411, 6

As we saw for 30 epochs, the gender related observation still remains same for 60 epochs which is a job well done. 

Another example that i liked was the third from above - 

[0, 1480, 53, 1, 147, 307, 1, 769, 4, 632, 295, 57, 36, 101, 12, 2] <div>
बुढिया दो है अपना वेदना है उनको की मोहिनी धारा गया मधुशाला पड़ा कर

<div>
This sentence also looks completely correct from a poetic viewpoint, and we can give a pat on the back to this poetic model

### Conclusion

On a concluding note, our model certainly is able to pick a gist of Shree Harivansh Rai Bachchan Ji's magical poetry. Not a perfect ofcourse, but it is to a certain extent able to predict right words to form sentences. It has been a great learning experience for us as well, and we're able to create the training data, and eventually train the model as well.

Lastly, the famous poem •• कोशिश करने वालों की हार नहीं होती •• in voice of Amitabh bachchan [here](https://www.youtube.com/watch?v=BPpqEW31sMg&ab_channel=CAMayankKothari). It means the one who keeps persistent on trying `never` fails 