# Chat bot based on Text Classification using NLP Algorithm.

Human language is surprisingly complicated and varied. We sometimes misspell or abbreviate terms when we speak, or omit punctuation. Around us there is a lot of unstructured data. 

Natural language processing helps computers interact in their own language with humans and scales other functions related to language. NLP, for instance, enables computers to read text, interpret it, determine sentiment, and decide which aspects are significant.

Understanding this will allow you to construct the core component of any chatbot for conversation. The main engine of a conversational chatbot is this.

A core aspect of Natural Language Processing is identifying patterns. Words ending in -ed appear to be tense verbs from the past. The repeated use of will means that the news text is (3). These measurable patterns, word structure and word frequency, correlate with specific aspects of context, such as tense and subject matter.
But how did we understand where to start searching, which elements of form to apply to which elements of meaning? We will learn to generate the core engine of a chat bot in this series. Using the techniques of natural language processing, we can learn text classification.


More generally, we are interested in taking some predetermined body of text and performing some fundamental analysis and transformations on it in order to be left with objects that will be much more useful for subsequently performing some more important analytical task. Our main text mining or natural language processing job will be this further assignment.

So, as mentioned above, it seems as though there are 3 main components of text preprocessing:

--tokenization
--normalization
--subsitution
As we lay out a framework for approaching preprocessing, we should keep these concepts in mind.

# Import useful libraries

In [1]:
import nltk

### Install NLTK components:
    
nltk.download_gui()

#The below command will open a GUI
Select the below mentioned packages:

    stopwords from Corpa
    averaged_perceptron_tagger from All corpus
    wordnet
    
OR you can download all the nltk components by:
    nltk.download()
    
Please Note: If you decide to download all the nlkt components,it will take much time (20-60mins depending on Internet speed)

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
#Importing required dependencies and files for the project
import re
import os
import csv
from nltk.stem.snowball import SnowballStemmer
import random
from nltk.classify import SklearnClassifier
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import numpy as np

import pandas as pd

In [4]:
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 

## Ignore all warnings 
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [5]:
## Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Preprocess

Instead of Using the entire data, to understand, we use examples for pre processing:

In [6]:
sentence = "The Big brown fox jumped over a lazy dog."
sentence2 = "What couses does Prof Sundeep Rangan teaches and how can I contact him ?"

In [7]:
#convert sentence to lower case
'This' == 'this'
print('AbcdEFgH'.lower())
sentence.lower()
sentence2.lower()

False

abcdefgh


'the big brown fox jumped over a lazy dog.'

'what couses does prof sundeep rangan teaches and how can i contact him ?'

### Tokenize - extract individual words

Tokenization is a process to convert chunk of data into smaller units called tokens. They can be words or sub/split words. They can also be tokenized based on characters. It essentially creates a vocabulary.

In [8]:
tokenizer = RegexpTokenizer(r'\w+')  #give some regular expression
tokens = tokenizer.tokenize(sentence)
tokens
tokens2 = tokenizer.tokenize(sentence2)
tokens2

['The', 'Big', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

['What',
 'couses',
 'does',
 'Prof',
 'Sundeep',
 'Rangan',
 'teaches',
 'and',
 'how',
 'can',
 'I',
 'contact',
 'him']

### Stopwords : Filter words to remove non-useful words

In [9]:
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
filtered_words

['The', 'Big', 'brown', 'fox', 'jumped', 'lazy', 'dog']

In [10]:
filtered_words = [w for w in tokens2 if not w in stopwords.words('english')]
filtered_words

['What', 'couses', 'Prof', 'Sundeep', 'Rangan', 'teaches', 'I', 'contact']

In [11]:
def preprocess(sentence):
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    filtered_words = [w for w in tokens if not w in stopwords.words('english')]
    return filtered_words

In [12]:
preprocessed_sentence = preprocess(sentence)
print(preprocessed_sentence)

['big', 'brown', 'fox', 'jumped', 'lazy', 'dog']


In [13]:
preprocess(sentence2)

['couses', 'prof', 'sundeep', 'rangan', 'teaches', 'contact']

## Tagging

Process of Classifying words into their Parts of Speech(POS) and labelling them accordingly. Also known as pos tagging. NLTK has a function which is used. Example List of POS tags is shown below:

In [14]:
tags = nltk.pos_tag(preprocessed_sentence)
print(tags)

[('big', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('lazy', 'JJ'), ('dog', 'NN')]


In [15]:
tags = nltk.pos_tag(preprocess(sentence2))
print(tags)

[('couses', 'NNS'), ('prof', 'VBP'), ('sundeep', 'JJ'), ('rangan', 'NN'), ('teaches', 'NNS'), ('contact', 'NN')]


## Extracting only Nouns and Verb nouns

In [16]:
def extract_tagged(sentences):
    features = []
    for tagged_word in sentences:
        word, tag = tagged_word
        if tag=='NN' or tag == 'VBN' or tag == 'NNS' or tag == 'VBP' or tag == 'RB' or tag == 'VBZ' or tag == 'VBG' or tag =='PRP' or tag == 'JJ':
            features.append(word)
    return features

In [17]:
extract_tagged(tags)

['couses', 'prof', 'sundeep', 'rangan', 'teaches', 'contact']

## Lemmatize words

Process of Grouping together, different inflicted forms of word to be analysed as a single item, bringing more context to words.

In [18]:
lmtzr = WordNetLemmatizer()
print(lmtzr.lemmatize('cacti'))
print(lmtzr.lemmatize('willing'))
print(lmtzr.lemmatize('feet'))
print(lmtzr.lemmatize('stemmed'))

print(lmtzr.lemmatize('cactus'))

cactus
willing
foot
stemmed
cactus


## Stem words

Stemming is a kind of normalization of words. Words having similar meaning having variations, according to context, are normalized.

In [19]:
words_for_stemming = ['stem', 'stemming', 'stemmed', 'stemmer', 'stems','feet','willing']

In [20]:
stemmer = SnowballStemmer("english")
[stemmer.stem(x) for x in words_for_stemming]

['stem', 'stem', 'stem', 'stemmer', 'stem', 'feet', 'will']

## Putting it all together

Creating a function threading everything together, which gives us our final output. Print statements are commented out as we want only final output. If you want to know details of each step, you can uncomment them to display the process in step.

In [21]:
def extract_feature(text):
    words = preprocess(text)
#     print('words: ',words)
    tags = nltk.pos_tag(words)
#     print('tags: ',tags)
    extracted_features = extract_tagged(tags)
#     print('Extracted features: ',extracted_features)
    stemmed_words = [stemmer.stem(x) for x in extracted_features]
#     print(stemmed_words)

    result = [lmtzr.lemmatize(x) for x in stemmed_words]
   
    return result

In [22]:
words = extract_feature(sentence)
print(words)

['big', 'brown', 'fox', 'lazi', 'dog']


In [23]:
words = extract_feature(sentence2)
print(words)

['cous', 'prof', 'sundeep', 'rangan', 'teach', 'contact']


## Implementing bag of words

In simple terms, it’s a collection of words to represent a sentence, disregarding the order in which they appear.

In [24]:
def word_feats(words):
    return dict([(word, True) for word in words])

In [25]:
word_feats(words)

{'cous': True,
 'prof': True,
 'sundeep': True,
 'rangan': True,
 'teach': True,
 'contact': True}

## Parsing the whole document

Parsing a document for our purpose which is NYU-ECE department specific function. For that, a chatbot engine is required, whic used the data. We need to curate the data in a particular manner. As it is a chatbot, it will be a ques-ans format having questions categorised.
Hence this format is used:
"this is the input text from the user','category','answer to give"
You can check the data.txt file to check the data format. The seperation of data type is done using pipe symbol '|'.

In [26]:
def extract_feature_from_doc(data):
    result = []
    corpus = []
    # The responses of the chat bot
    answers = {}
    for (text,category,answer) in data:

        features = extract_feature(text)

        corpus.append(features)
        result.append((word_feats(features), category))
        answers[category] = answer

    return (result, sum(corpus,[]), answers)

In [27]:
extract_feature_from_doc([['this is the input text from the user','category','answer to give']])

([({'input': True, 'user': True}, 'category')],
 ['input', 'user'],
 {'category': 'answer to give'})

In [28]:
def get_content(filename):
    doc = os.path.join(filename)
    with open(doc, 'r') as content_file:
        lines = csv.reader(content_file,delimiter='|')
        data = [x for x in lines if len(x) == 3]
        return data

Chatbot Name: BobCat ==> Official Mascot of NYU.
For demonstration purpose, only a certain type of NYU-ECE department data has been added. It has basic structure:

            -- Greetings Remarks
            
            -- Morning,
            
            -- Afternoon,
            
            --Evening,
            
            --Opening Remarks,
            
            --Help,
            
            --NO Help,
            
            --Closing Remarks,
            
            --Location of the department,
            
            --Timing/working hours of Department,
            
            --Contact department,
            
            --Head/Chair of Department,
            
            --Details of Professors:
            
                Prof Ivan Selesnick,
                
                Prof Sundeep Rangan,
                
                Prof Yury Dvorkin,
                
                Prof Ludovic Righetti.

In [29]:
filename = 'data.txt'
data = get_content(filename)

In [30]:
data

[['Hello',
  'Greetings',
  'Hello. Greeting For the day! I am Bobcat. I will serve your NYU-Tandon ECE department enquiries.'],
 ['hi hello',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hi ',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hi',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hi',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hey',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hello, hi',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'],
 ['hey',
  'Greetings',
  'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon EC

In [31]:
features_data, corpus, answers = extract_feature_from_doc(data)

In [32]:
print(features_data[50])

({'professor': True, 'ivan': True, 'selesnick': True}, 'Department-Head')


In [33]:
corpus

['hello',
 'hi',
 'hello',
 'hi',
 'hi',
 'hi',
 'hey',
 'hello',
 'hi',
 'hey',
 'hey',
 'hi',
 'hey',
 'hello',
 'good',
 'morn',
 'good',
 'afternoon',
 'good',
 'even',
 'good',
 'night',
 'today',
 'want',
 'help',
 'need',
 'help',
 'help',
 'want',
 'help',
 'want',
 'assist',
 'help',
 'great',
 'talk',
 'great',
 'thank',
 'help',
 'thank',
 'thank',
 'much',
 'thank',
 'thank',
 'much',
 'ece',
 'depart',
 'locat',
 'ece',
 'depart',
 'ece',
 'depart',
 'ece',
 'ece',
 'oper',
 'hour',
 'ece',
 'contact',
 'ece',
 'depart',
 'contact',
 'depart',
 'call',
 'depart',
 'call',
 'depart',
 'phone',
 'number',
 'depart',
 'contact',
 'contact',
 'phone',
 'number',
 'depart',
 'phone',
 'phone',
 'number',
 'depart',
 'depart',
 'head',
 'depart',
 'head',
 'h',
 'depart',
 'chair',
 'depart',
 'ivan',
 'selesnick',
 'professor',
 'ivan',
 'selesnick',
 'ivan',
 'selesnick',
 'prof',
 'ivan',
 'room',
 'professor',
 'ivan',
 'occupi',
 'professor',
 'ivan',
 'offic',
 'hour',
 'm

In [34]:
answers

{'Greetings': 'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.',
 'Morning': 'Good Morning. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.',
 'Afternoon': 'Good afternoon. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.',
 'Evening': 'Good evening. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.',
 'Goodbye': 'Good night. Take care.',
 'Opening': "I'm fine! Thank you. How may I help you?",
 'Help': 'How may I help you?',
 'No-Help': 'Ok sir/madam. No problem. Have a nice day.',
 'Closing': "It's glad to know that I have been helpful. Have a good day!",
 'Location': 'The ECE Department located is on the 8th Floor at 370 Jay Street & 2nd Floor at 5 Metrotech Center.',
 'timings': 'There is always someone to attend you from 9am till 5pm.',
 'Contact-Department': 'You can contact department virtually via by calling -

# Train a model using these fetures

In [35]:
## split data into train and test sets
split_ratio = 0.75

In [36]:
def split_dataset(data, split_ratio):
    random.shuffle(data)
    data_length = len(data)
    train_split = int(data_length * split_ratio)
    return (data[:train_split]), (data[train_split:])

In [37]:
training_data, test_data = split_dataset(features_data, split_ratio)

In [38]:
training_data

[({'want': True, 'help': True}, 'No-Help'),
 ({'hello': True, 'hi': True}, 'Greetings'),
 ({'call': True, 'sundeep': True}, 'Sundeep-Rangan'),
 ({'meet': True, 'professor': True, 'sundeep': True}, 'Sundeep-Rangan'),
 ({'great': True}, 'Closing'),
 ({'email': True, 'prof': True, 'ivan': True}, 'Department-Head'),
 ({'want': True, 'help': True}, 'Help'),
 ({'ludov': True}, 'Ludovic-Righetti'),
 ({'email': True, 'prof': True, 'yuri': True}, 'Yury-Dvorkin'),
 ({'ece': True, 'depart': True}, 'Location'),
 ({'prof': True, 'ludov': True}, 'Ludovic-Righetti'),
 ({'meet': True, 'prof': True, 'yuri': True}, 'Yury-Dvorkin'),
 ({'help': True}, 'Help'),
 ({'sundeep': True, 'rangan': True}, 'Sundeep-Rangan'),
 ({'need': True, 'help': True}, 'Help'),
 ({'rangan': True}, 'Sundeep-Rangan'),
 ({'prof': True, 'dvorkin': True}, 'Yury-Dvorkin'),
 ({'depart': True, 'head': True}, 'Department-Head'),
 ({'help': True}, 'No-Help'),
 ({'professor': True, 'yuri': True, 'dvorkin': True}, 'Yury-Dvorkin'),
 ({'meet

In [39]:
# save the data
np.save('training_data', training_data)
np.save('test_data', test_data)

## Classification using Decision tree

It works by creating a structure where each node corresponds to feature name and branches correspond to feature values. Tracing down the branches, you get to the leaves of the tree which are nothing but the classification Labels. 

In [40]:
training_data = np.load('training_data.npy',allow_pickle = True )
test_data = np.load('test_data.npy',allow_pickle = True)

In [41]:
def train_using_decision_tree(training_data, test_data):
    
    classifier = nltk.classify.DecisionTreeClassifier.train(training_data, entropy_cutoff=0.6, support_cutoff=6)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    print('training set accuracy: ', training_set_accuracy)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    print('test set accuracy: ', test_set_accuracy)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [42]:
dtclassifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_decision_tree(training_data, test_data)

training set accuracy:  0.9642857142857143
test set accuracy:  0.7142857142857143


## Classification using Naive Bayes

Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with Kernel density estimation, they can achieve higher accuracy levels and it is evident here.

In [43]:
def train_using_naive_bayes(training_data, test_data):
    classifier = nltk.NaiveBayesClassifier.train(training_data)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [44]:
classifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_naive_bayes(training_data, test_data)
print(training_set_accuracy)
print(test_set_accuracy)
print(len(classifier.most_informative_features()))
classifier.show_most_informative_features()

0.9761904761904762
0.8214285714285714
80
Most Informative Features
                 contact = True           timing : Depart =      7.5 : 1.0
                    yuri = None           Ludovi : Yury-D =      5.4 : 1.0
                   ludov = None           Depart : Ludovi =      4.4 : 1.0
                    ivan = None           Ludovi : Depart =      4.2 : 1.0
                    help = True             Help : Closin =      4.1 : 1.0
                 sundeep = None           Ludovi : Sundee =      3.3 : 1.0
                  depart = None           Ludovi : Locati =      3.2 : 1.0
                  depart = True           timing : Depart =      3.2 : 1.0
                   hello = None           Ludovi : Greeti =      2.7 : 1.0
                   thank = None           Ludovi : Closin =      2.7 : 1.0


In [45]:
classifier.classify(({'head': True, 'depart': True, 'seles': True}))

'Department-Head'

In [46]:
extract_feature("hello")

['hello']

In [47]:
word_feats(extract_feature("hello"))

{'hello': True}

In [48]:
input_sentence = "Contact professor Sundeep"
classifier.classify(word_feats(extract_feature(input_sentence)))

'Sundeep-Rangan'

In [49]:
def reply(input_sentence):
    category = dtclassifier.classify(word_feats(extract_feature(input_sentence)))
    return answers[category]
    
    

# Test:

In [50]:
reply('Hi')

'Hello. I am Bobcat. Greeting For the day! I will serve your NYU-Tandon ECE department enquiries.'

In [51]:
reply('Contact professor Sundeep')

'Prof. Sundeep Rangan is an Associate Professor and director at NYU Wireless. You can find more details on him at: https://wireless.engineering.nyu.edu/sundeep-rangan/'

In [52]:
reply('Which room does professor Ludovic occupy')

'Prof.Ludovic Righetti is an Associate Professor in the ECE & MAE Department and, a Senior Researcher at the Max-Planck Institute for Intelligent Systems (MPI-IS).  You can find more details on him at: https://wp.nyu.edu/machinesinmotion/'

In [53]:
reply('What is location of ECE department?')

'The ECE Department located is on the 8th Floor at 370 Jay Street & 2nd Floor at 5 Metrotech Center.'

In [54]:
reply('how to contact the department?')

'You can contact department virtually via by calling - 646-997-3878'

In [55]:
reply('Who is the department head?')

'Prof. Ivan Selesnick is the Head of ECE Department. You can find more details on him at: https://eeweb.engineering.nyu.edu/iselesni/'

In [56]:
reply('when can I contact ECE department?')

'There is always someone to attend you from 9am till 5pm.'

In [57]:
reply('Thanks!')

"It's glad to know that I have been helpful. Have a good day!"

In [58]:
reply('What couses does Prof Sundeep Rangan teaches and how can I contact him ?')

'Prof. Sundeep Rangan is an Associate Professor and director at NYU Wireless. You can find more details on him at: https://wireless.engineering.nyu.edu/sundeep-rangan/'

# Conclusion:

Once the model has been developed using an algorithm that gives an acceptable accuracy, this model can be called using to any chatbot UI framework after developing a constructive dataset.