

---

# Import necessary libraries

In [3]:
import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Downloading and installing NLTK
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems.

In [4]:
pip install nltk



# Installing NLTK Packages

In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

True

# Reading in the corpus

In [7]:
folder=open('college website.txt','r',errors = 'ignore')
raw=folder.read()
raw = raw.upper()# converts to upper case

The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing includes:

Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the same words in different cases as different

Tokenization: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

**The NLTK data package includes a pre-trained Punkt tokenizer for English..**
* Removing Noise i.e everything that isn’t in a standard number or letter.
*   Removing the Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

**Stemming**:Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate”.
 **Lemmatization**Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.



# Tokenisation

In [8]:
senttokens = nltk.sent_tokenize(raw)# converts to list of sentences 
wordtokens = nltk.word_tokenize(raw)# converts to list of words

# Preprocessing

We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [9]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
dict_remove_punct = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(dict_remove_punct)))

# Keyword matching
Here we define a function for a greeting by our ATROBOT i.e if a user’s input is a greeting, the ATROBOT shall return a greeting response.

In [2]:
GREETING_INPUTS = ("hello", "hi","Namaste", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey","hello","Namaste","hi how can i help  you"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

# Generating Response
**Bag of Words**

 After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document.

 The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).
# TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:
**Term Frequency: is a scoring of the frequency of the word in the current document.**

TF = (Number of times term t appears in a document)/(Number of terms in the document)
**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
# Cosine Similarity
Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
where d1,d2 are two non zero vectors.

To generate a response from our atrobot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”


In [10]:
def response(user_response):
    ATROBOT_response=''
    senttokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(senttokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        ATROBOT_response=ATROBOT_response+"I am sorry! I don't understand you"
        return ATROBOT_response
    else:
        ATROBOT_response =  ATROBOT_response+senttokens[idx]
        return  ATROBOT_response

Finally, feed the lines that we want our atrobot to say while starting and ending a conversation depending upon user’s input.

In [None]:
flag=True
print("ATROBOT: Hi..Welcome to ATRIA INSTITUTE OF TECHNOLOGY..I am your virtual assistant ATROBOT. Let me answer your queries regarding this institution. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ATROBOT: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ATROBOT: "+greeting(user_response))
            else:
                print("ATROBOT: ",end="")
                print(response(user_response))
                senttokens.remove(user_response)
    else:
        flag=False
        print("ATROBOT: Bye! take care..")

ATROBOT: Hi..Welcome to ATRIA INSTITUTE OF TECHNOLOGY..I am your virtual assistant ATROBOT. Let me answer your queries regarding this institution. If you want to exit, type Bye!
hi
ATROBOT: hi
what is the name of the college
ATROBOT: NAME OF THE COLLEGE
ATRIA INSTITUTE OF TECHNOLOGY BANGLORE.
hostel facility
ATROBOT: HOSTEL FACILITIES
WE HAVE SEPERATE HOSTEL FOR BOYS AND GIRLS WITH SEMI FURNISHED ROOM,STUDENT FRIENDLY WARDENS,WI-FI CONECTIVITY,CCTV CAMERAS,INDOOR GAMES,24/7 SECURITY.
tell me about placements
ATROBOT: PLACEMENTS
AT ATRIA,WE HAVE PLACEMENT AS AN INTEGRAL PART OF THE EDUCATION PROCESS OF A STUDENT;PLACEMENT PREPARATION & READINESS STARTS SOON AFTER ADMISSIONS - FOR EACH STUDENT, IDENTIFYING THE BASIC SKILLS AND IMPROVEMENT AREAS,CONDUCTING FOUNDATION, ADD-ON, AND ADVANTAGE COURSES,MONITORING THE PROGRESS AND ENHANCING THE READINESS TO FACE THE PLACEMENT SEASON WITH CONFIDENCE,IS AN INTERDISCIPLINARY ACTIVITY WITH OVERSIGHT BY THE ACADEMIC HEAD AND THE PLACEMENT HEAD OF TH