# Build a Simple Chatbot from Scratch in Python using NLTK

Create a very basic chatbot utlising the Python's NLTK library.It's a very simple bot with hardly any cognitive skills,but it is a good way to get into NLP and get to know about chatbots.

## Natural Language Processing (NLP)
Natural language processing is a way for computers to analyze, understand and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition and so on.

## Import necessary libraries

In [1]:
import io
import random
import string #to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

## Download and Install Natural Language Toolkit (NLTK)
Natural Language Toolkit is a leading platform for building Python programs to work with human language data.

In [2]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

[nltk_data] Downloading package punkt to /Users/okguser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/okguser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Reading in a collection of written texts
Use the Wikipedia page for chatbots. Copy the contents from the page and place it in a text file name 'chatbox.txt'.

In [6]:
f=open('chatbot.txt', 'r', errors = 'ignore') #r is read mode
plaintxt=f.read()
plaintxt=plaintxt.lower() #converts to lowercase

The issue with the text data is that it is in strings format. Machine learning algorithms need some sort of numerial feature vector in order to perform the task. 

Hence, we need to pre-process it to make it ideal for working before starting with any NLP project. 

Basic text pre-processing includes:
- converting the entire text into lowercase, so that algorithm does not treat the same words in different cases
- tokenization: process of converting the normal text strings into a list of tokens (words that we want)
  - sentence tokenizer, find the list of setences
  - word tokenizer, find the list of words

NLTK data package includes a pre-trained tokenizer for English
- Removing Noise, everything that isn't in a standard number or letter
- Removing Stop words, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely
- Stemming, process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.
- Lemmatization, slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words.

## Tokenisation

In [7]:
sentence_tokens = nltk.sent_tokenize(plaintxt) #converts to list of sentences
word_tokens = nltk.word_tokenize(plaintxt) #converts to list of words

## Pre-processing
Define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [8]:
lemmer = nltk.stem.WordNetLemmatizer() #WordNet is a semantically-oriented dictionary of English included in NLTK.

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

#Overall, these functions is to preprocess text by tokenizing, removing punctuation, converting to lowercase, and 
#lemmatizing tokens using NLTK'sWordNetLemmatizer.

## Keyword Matching

Define a function for a greeting by the bot. If user's input is a greeting, the bot will return a greeting response.

In [9]:
greeting_inputs = ("hello", "hi", "greetings", "hey")
greeting_responses = ("hello", "hi", "hey", "hi there!", "hi, how may i help you?")

def greeting(sentence): #define fucntion, take sentence as input
    for word in sentence.split(): #using .split() function to split the sentence into word
        if word.lower() in greeting_inputs: #check if the lowercase version is present in the greeting inputs
            return random.choice(greeting_responses) #if found, randomly select a response from the greeting responses

## Generating Response

### Bag of Words
After preprocessing, transform the text into a meaningful vector/array of numbers. Bag of words is a prepresentation of text that describes the occurrence of words within a document. It involves two things.
- a vocabulary of known words
- a measure of the presence of known words

It is called a "bag" of words because any information about the order or structure of words in the document is discarded and the model is only concerned with whether the known words occur in the document, not where they occur in the document.

### TF-IDF Aproach
Problem with bag of words is that frequent words will start to dominate in the document but may not contain as much "information content".

One of the approach is rescale the frequency of words by how often they appear in all documents so that the scores for frequent words across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency(TF-IDF).



To generate a response from our bot for input questions, the concept of document similarity will be used. Define a function response which searches the user's utterance for once or more known keywords and returns one of several possible responses. If doesn't find the input matching any of the keywords, it will return a response.

In [10]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

We will feed the lines that we want our bot to say while starting and ending a conversation depending upon user's input.

In [11]:
flag = True
print("ROBOT: Hi, my name is ROBOT. I will answer your queries about ChatBots. When you want to exit, input Bye.")
while(flag==True):
    user_response = input()
    user_response = user_response.lower()
    if(user_response != 'bye'):
        if(user_response=='thanks' or user_response=='thank you'):
            flag = False 
            print("ROBOT: You're Welcome!")
        else:
            if(greeting(user_response)!=None):
                print("ROBOT: " + greeting(user_response))
            else:
                print("ROBOT: ", end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBOT: Bye, have a nice day!")

ROBOT: Hi, my name is ROBOT. I will answer your queries about ChatBots. When you want to exit, input Bye.
hi
ROBOT: hey
hi
ROBOT: hi, how may i help you?
bye
ROBOT: Bye, have a nice day!
