<a href="https://colab.research.google.com/github/ritwiks9635/Chatbot_uasing_HF/blob/main/ChatBot_using_LSTM_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**ChatBot Model**

[Dataset](https://www.kaggle.com/datasets/kausr25/chatterbotenglish)

In [2]:
from zipfile import ZipFile
data = "/content/https:/www.kaggle.com/datasets/kausr25/chatterbotenglish/chatterbotenglish.zip"
with ZipFile(data,"r") as zip:
  zip.extractall("Chatbot_data/data")
  print("the data has been extracted ")

the data has been extracted 


In [3]:
import yaml

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [30]:
dir_path = '/content/Chatbot_data/data'
files_list = os.listdir(dir_path + os.sep)

questions = []
answers = []
for filepath in files_list:
    stream = open(dir_path + os.sep + filepath, "rb")
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len(con) > 2 :
            questions.append(con[0])
            replies = con[1 :]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append(ans)
        elif len(con) > 1:
            questions.append(con[0])
            answers.append(con[1])

answers_with_tags = []
for i in range(len(answers)):
    if type(answers[i]) == str:
        answers_with_tags.append(answers[i])
    else:
        questions.pop(i)

answers = []
for i in range(len(answers_with_tags)) :
    answers.append('<start> ' + answers_with_tags[i] + ' <end>')

tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(questions + answers)
VOCAB_SIZE = len(tokenizer.word_index)+1
print('VOCAB SIZE : {}'.format(VOCAB_SIZE))

VOCAB SIZE : 1894


In [31]:
from gensim.models import Word2Vec
import re

In [32]:
vocab = []
for word in tokenizer.word_index:
    vocab.append(word)


q_tokenized = tokenizer.texts_to_sequences(questions)
q_max_length = max([len(x) for x in q_tokenized])
q_padded = keras.preprocessing.sequence.pad_sequences(q_tokenized, maxlen = q_max_length, padding  = "post")
encoder_input_data = np.array(q_padded)
print(encoder_input_data.shape, q_max_length)


a_tokenized = tokenizer.texts_to_sequences(answers)
a_max_length = max([len(x) for x in a_tokenized])
a_padded = keras.preprocessing.sequence.pad_sequences(a_tokenized, maxlen = a_max_length, padding  = "post")
decoder_input_data = np.array(a_padded)
print(decoder_input_data.shape, a_max_length)


a_tokenized = tokenizer.texts_to_sequences(answers)
for i in range(len(a_tokenized)):
    a_tokenized[i] = a_tokenized[i][1:]
#a_max_length = max([len(x) for x in a_tokenized])
a_padded = keras.preprocessing.sequence.pad_sequences(a_tokenized, maxlen = a_max_length, padding  = "post")
one_hot_a = keras.utils.to_categorical(a_padded, VOCAB_SIZE)
decoder_output_data = np.array(one_hot_a)
print(decoder_output_data.shape)

(564, 22) 22
(564, 74) 74
(564, 74, 1894)


##**Defining the Encoder-Decoder model**
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.
- 2 Input Layers : One for encoder_input_data and another for decoder_input_data.
- Embedding layer : For converting token vectors to fix sized dense vectors. ( Note : Don't forget the mask_zero=True argument here )
- LSTM layer : Provide access to Long-Short Term cells.

Working :

1. The encoder_input_data comes in the Embedding layer ( encoder_embedding ).
2. The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( h and c which are encoder_states )
3. These states are set in the LSTM cell of the decoder.
4. The decoder_input_data comes in through the Embedding layer.
5.The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.

In [33]:
encoder_inputs = layers.Input(shape=(q_max_length, ))
encoder_embedding = layers.Embedding(VOCAB_SIZE, 200, mask_zero=True)(encoder_inputs)
encoder_outputs ,state_h ,state_c = layers.LSTM(200 , return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = layers.Input(shape=(a_max_length, ))
decoder_embedding = layers.Embedding(VOCAB_SIZE, 200, mask_zero=True)(decoder_inputs)
decoder_lstm = layers.LSTM(200 , return_state=True , return_sequences=True)
decoder_outputs, _ , _ = decoder_lstm (decoder_embedding , initial_state=encoder_states)
decoder_dense = layers.Dense(VOCAB_SIZE , activation= "softmax")
output = decoder_dense(decoder_outputs)

model = keras.Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer=keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "model_13"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_15 (InputLayer)       [(None, 22)]                 0         []                            
                                                                                                  
 input_16 (InputLayer)       [(None, 74)]                 0         []                            
                                                                                                  
 embedding_2 (Embedding)     (None, 22, 200)              378800    ['input_15[0][0]']            
                                                                                                  
 embedding_3 (Embedding)     (None, 74, 200)              378800    ['input_16[0][0]']            
                                                                                           

In [None]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=150 )
model.save( 'model.h5' )

##**Defining inference models**
We create inference models which help in predicting answers.

**Encoder inference model** : Takes the question as input and outputs LSTM states ( h and c ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the <start> tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [35]:
def make_inference_models():

    encoder_model = keras.Model(encoder_inputs, encoder_states)

    decoder_state_input_h = layers.Input(shape=(200 ,))
    decoder_state_input_c = layers.Input(shape=(200 ,))

    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

    decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = keras.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)

    return encoder_model , decoder_model

In [36]:
def str_to_tokens(sentence : str):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append(tokenizer.word_index[word])
    return keras.preprocessing.sequence.pad_sequences([tokens_list], maxlen = q_max_length, padding='post')

1. First, we take a question as input and predict the state values using enc_model.
2. We set the state values in the decoder's LSTM.
3.Then, we generate a sequence which contains the <start> element.
4. We input this sequence in the dec_model.
5. We replace the <start> element with the element which was predicted by the dec_model and update the state values.
6. We carry out the above steps iteratively till we hit the <end> tag or the maximum answer length.

In [None]:
enc_model , dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word

        if sampled_word == 'end' or len(decoded_translation.split()) > a_max_length:
            stop_condition = True

        empty_target_seq = np.zeros( ( 1 , 1 ) )
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ]

    print( decoded_translation )