### LGM VIRTUAL INTERNSHIP PROGRAM 
### DATA SCIENCE 
### ADVANCE LEVEL TASK 2 :Next Word Prediction

In [1]:
#importing libraries
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,Bidirectional,BatchNormalization,Dropout
from keras.preprocessing.text import Tokenizer,text_to_word_sequence
from keras.utils import pad_sequences,to_categorical
import re
from bs4 import BeautifulSoup 
import os

In [2]:
#finding the input files
file_name=[i for i in os.listdir() if i.endswith('.txt')][0]

In [3]:
#reading the text 
text_string=" "
with open(file_name,'r+',encoding='utf-8') as f:
    text_string+=f.read()

In [4]:
#data cleaning 
text=[re.sub("[^a-zA-Z]"," ",i).lower().replace("  "," ") for i in text_string.split('.')]

Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.

Tokenizer is a compact pure-Python (>= 3.8) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc

In [5]:
token=Tokenizer()

In [6]:
token.fit_on_texts(text)

In [7]:
len(token.index_word)

8082

In [8]:
seq=token.texts_to_sequences(text)
len(seq)

6434

### Creating the dataset using the corpus

#### ex: This is a demo text

#### input                | target
#### this                   | is
#### this is               | a
#### this is a            |  demo
#### this is a demo  | text

In [9]:
input_seq=[]
for i in text:
    token_sentence=token.texts_to_sequences([i])[0]
    for i in range(1,len(token_sentence)):
        input_seq.append(token_sentence[:i+1])

In [10]:
#dataset length
len(input_seq)

102628

In [11]:
#finding the max len for a given input 
max_len=max([ len(i) for i in input_seq])
max_len

105

In [12]:
#padding the input to make every sample of equal length
pad_seq=pad_sequences(input_seq,maxlen=max_len,padding='pre')

In [13]:
len(pad_seq),pad_seq

(102628,
 array([[   0,    0,    0, ...,    0,  143,  129],
        [   0,    0,    0, ...,  143,  129,   42],
        [   0,    0,    0, ...,  129,   42,    1],
        ...,
        [   0,    0,    0, ...,    4,  354,   80],
        [   0,    0,    0, ...,  354,   80,  352],
        [   0,    0,    0, ...,   80,  352, 1601]]))

In [14]:
#separating target and features
X=pad_seq[:,:-1]
Y=pad_seq[:,-1]

In [15]:
Y[0]

129

In [16]:
Y=to_categorical(Y)

In [17]:
Y[0],np.argmax(Y[0])

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 129)

In [18]:
Y.shape

(102628, 8083)

### Word2Vec creates vectors of the words that are distributed numerical representations of word features – these word features could comprise of words that represent the context of the individual words present in our vocabulary.

Word embeddings eventually help in establishing the association of a word with another similar meaning word through the created vectors.

Basically Word2Vec transforms words into vectors and the values assigned to each vectors is calculated using the context. 

#### Example: This is a sample

#### this-->[0.4,2.2,0.3,2.2]  is--> [0.2,2.1,2.2,2.3]  a--> [0.4,2.2,0.1,4.2]  sample-->[0.4,2.2,5.0,1.2]


In [31]:
sentences=[i.split() for i in text]
model_word=Word2Vec(sentences=sentences,window=10,min_count=2,vector_size=300)
model_word.train(sentences,total_examples=model_word.corpus_count,epochs=model_word.epochs+100)

word_dict={}
for word in model_word.wv.index_to_key:
    word_dict[word]= model_word.wv.get_vector(word)

emb_matrix=np.zeros((8083,300))

for word,index in token.word_index.items():
    emb_vector=word_dict.get(word)
    if emb_vector is not None:
        emb_matrix[index]=emb_vector

In [32]:
model_word.wv.most_similar('crime')

[('committed', 0.5659366250038147),
 ('unique', 0.4662627875804901),
 ('nature', 0.4382766783237457),
 ('charm', 0.43705448508262634),
 ('reward', 0.4160635769367218),
 ('bizarre', 0.41171836853027344),
 ('legal', 0.41091188788414),
 ('literature', 0.40374037623405457),
 ('reasoner', 0.3944343626499176),
 ('sensational', 0.39289799332618713)]

In [44]:
model_word.wv.most_similar('mystery')

[('homely', 0.37771669030189514),
 ('bakers', 0.33959320187568665),
 ('suggestive', 0.33076027035713196),
 ('quest', 0.3159787952899933),
 ('grounds', 0.3155442476272583),
 ('compared', 0.3085801303386688),
 ('loop', 0.2920632064342499),
 ('investigation', 0.2888488173484802),
 ('fire', 0.27340424060821533),
 ('absolute', 0.2708062529563904)]

### Visualzing Word Embedding

In [21]:
#compressing the vectors into a 3-d matrix using PCA
from sklearn.decomposition import PCA

In [45]:
pca=PCA(n_components=3)

In [54]:
X = pca.fit_transform(model_word.wv.get_normed_vectors())

In [50]:
import plotly.express as px

In [75]:
#Plotting words on a 3-d hyperplane
fig=px.scatter_3d(X[400:500],x=0,y=1,z=2,color=model_word.wv.index_to_key[400:500])
fig.show()

## Building LSTM model to predict next word


Long Short-Term Memory Networks is a deep learning, sequential neural network that allows information to persist. 

It is a special type of Recurrent Neural Network which is capable of handling the vanishing gradient problem faced by RNN.

LSTM was designed by Hochreiter and Schmidhuber that resolves the problem caused by traditional rnns and machine learning algorithms. 

LSTM can be implemented in Python using the Keras library.

In [62]:
model=Sequential([
    Embedding(8083,output_dim=300,input_length=max_len-1,weights=[emb_matrix],trainable=False),
    LSTM(150),
    Dense(8083,activation='relu'),
    Dense(8083,activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [72]:
#layers in our Lstm
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 104, 300)          2424900   
                                                                 
 lstm (LSTM)                 (None, 150)               270600    
                                                                 
 dense (Dense)               (None, 8083)              1220533   
                                                                 
 dense_1 (Dense)             (None, 8083)              65342972  
                                                                 
Total params: 69,259,005
Trainable params: 66,834,105
Non-trainable params: 2,424,900
_________________________________________________________________


#### Note: Since training a LSTM model on a huge dataset takes alot of time therefore for training the model i used google colab free gpu to train the model with 300 epochs

In [65]:

#model.fit(X,Y,epochs=300)
model.load_weights('model2_1.h5')

### Predicting the next 20 words using our Model

In [77]:
#TO predict:
predict_text='a man entered '
string_text=predict_text+" "
for i in range(20):
    pred_token=token.texts_to_sequences([predict_text.split(' ')])[0]
    pred_X=pad_sequences([pred_token],maxlen=max_len-1,padding='pre')
    index=np.argmax(model.predict(pred_X,verbose=0))
    predict_text=predict_text+" "+token.index_word.get(index)
    print(predict_text)

a man entered  who
a man entered  who could
a man entered  who could hardly
a man entered  who could hardly have
a man entered  who could hardly have been
a man entered  who could hardly have been less
a man entered  who could hardly have been less than
a man entered  who could hardly have been less than six
a man entered  who could hardly have been less than six feet
a man entered  who could hardly have been less than six feet six
a man entered  who could hardly have been less than six feet six inches
a man entered  who could hardly have been less than six feet six inches in
a man entered  who could hardly have been less than six feet six inches in height
a man entered  who could hardly have been less than six feet six inches in height with
a man entered  who could hardly have been less than six feet six inches in height with the
a man entered  who could hardly have been less than six feet six inches in height with the chest
a man entered  who could hardly have been less than six feet

In [None]:
#A man entered
#Our visitor glanced