# Let's program a chatbot! 

## What is a chatbot?

A **chatbot** is a software application:
- used to conduct an on-line chat conversation via text or text-to-speech
- supposed to replace a live human agent
- designed to convincingly simulate the way a human would behave as a conversational partner
- require continuous tuning and testing




** Applications**:
- customer service
- request routing
- information gathering


![Picture title](img/image-20200917-165530.png)

**Types**:
- using extensive word-classification processes, natural language processors, and sophisticated AI
- simply scanning for general keywords and generating responses using common phrases obtained from an associated library or database

1- **HOMEWORK QUESTION** (do your own research and answer): What is the difference between a chatbot, a socialbot and a virtual assistant?

<p style="color:green;"><b>Type your response here<b><p>

## Turing test

The **Turing test**, originally called **the imitation game** by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human.

**IF YOU FANCY**: 
- Watch a great movie about Alan Turing and Enigma code: https://www.imdb.com/title/tt2084970/
- Food for thought: would an AI purposefully hide its superintelligence in fear of being destroyed? Maybe it is intentionally failing Turing test? (if you are interested in this topic, "Philosophy of AI" is the general field this question belongs to)


**Please watch a video about Turing test**: https://www.youtube.com/watch?v=3wLqsRLvV-c

** Please watch a short video about Eliza**: https://www.youtube.com/watch?v=RMK9AphfLco


**IF YOU FANCY**: Chat with Eliza if you feel like: https://web.njit.edu/~ronkowit/eliza.html

![Picture title](image-20200917-152808.png)

# Edit distance

**Edit distance** is a way of quantifying how dissimilar two strings (e.g. words) are to one another by counting the minimum number of operations required to transform one string into the other. **Levenshtein distance** is the most common metric and often used interchangeably with edit distance.

![Picture title](img/image-20200917-174435.png)
Vladimir Levenstein, a Russian scientist who developed the Levenstein Distance algorithm in 1965.

![Picture title](img/image-20200917-174109.png)
(Image source (and a good read): https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0)

**How do we use edit distance?**

- computational biology/bioinformatics (comparing DNA sequences)
- correction of spelling mistakes or OCR errors
- approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected (e.g. encryption).
- auto suggestions of words
- information rertieval
- machine translation



### Spell Check example
The following code queries using  name of a director on the ** IMDB dataset** and shows the **titles of their movies**.
If the user inserts a wrong spelling it tries to match it to the directors it already knows using the Levenshtein Distance implemented in NLTK:

https://python.gotrained.com/nltk-edit-distance-jaccard-distance/



In [None]:
!pip install nltk
import pandas as pd #import Pandas library for data manipulation and analysis
import nltk

movies = pd.read_csv("IMDb movies.csv")#read the file containing movies data

#find directors with >10 movies , (to reduce search time)
#create an array of unique values from the 'director' column
directors = movies[movies.groupby('director')['director'].transform('size') > 10]['director'].unique()

quit =False
while(not quit):
    
    input_name = input("Enter a director's full name(example: David Lynch):Press q to quit)")
    quit = (input_name=="q")
    if(not quit):
        if(input_name in directors):
            #find movies of the specified director
            movie_names = movies.loc[movies['director'] == input_name]['original_title']
            print("Here are the name of movies I know directed by " +input_name +":")
            display(movie_names.reset_index(drop=True))

        #handle the misspelled director's name:
        else:
          

            distances = [nltk.edit_distance(input_name,director) for director in directors] #computing edit distance of the input name if not found in the list with all directors
            

            guessed = directors[distances.index(min(distances))] #getting minimum distance director's name with the input
            answer = input("Did you mean "+ guessed+"?(y/n))") # asking feedback from the user if we found the correct name
            if(answer.lower() in ["yes","y"]):
                movie_names = movies.loc[movies['director'] == guessed]['original_title'] # looking for movies from the director if the guess was correct
                print("Here are the name of movies I know directed by " +guessed +":")
                display(movie_names.reset_index(drop=True))
            elif(answer =="q"):# checking if user already wants to exit program
                quit = True
                continue;
            else:
                print("I might not know this director! sorry") # Giving up on first wrong guess!


2- **Observe and Reflect**: What happens if we only look for the last name of a director? For example : _Linch_ instead of _Lynch_

- how do you suggest we fix problem?

<p style="color:green;"><b>Type your response here<b><p>


We can look for names and last names separately and match them to our dataset.

3- **Observe and Reflect**: 
Read about Jaccard distance [here](https://python.gotrained.com/nltk-edit-distance-jaccard-distance/)

Can you think of an example where **Jaccard distance** might give you the correct guess but **Levenshtein distance** doesn't? 
Explain why.

(Go ahead and try it in code)


<p style="color:green;"><b>Type your response here<b><p>

Jaccard distance does not consider the position of the characters in computing the distance, so if the characters all exist but they are in the wrong order Jaccard distance find the closest match , however since the Levenshtein distance only considers insert , delete and substitute as actions , if the characters are all in wrong positions the edit distance would be very high and not close to the match.


consider this pattern for example: Chyn DaLdiv

``` distances = [nltk.jaccard_distance(set(input_name),set(director)) for director in directors] #computing edit distance of the input name if not found in the list with all directors ```

---


## Regex Warm up

There are platforms to test regex such as:
https://regexr.com/




### regex.findall() 
**regex.findall** function finds all patterns matching the regex.


See an example of tokenization using regex in the following code cell:

In [None]:
import regex
text = "Hello, I am studying computer science and this is my NLP course material."
regex.findall("\w+",text)

### regex.sub
Function **regex.sub** replaces a text with the patterns matched to the regex.

See an example in the following code cell:

In [None]:
import regex
text = "Hello, I am studying computer science and this is my NLP course material. Yay!! "
text = regex.sub(r"[\W]"," ",text) #replace non alphabetical and numbers with space
text =regex.sub(r"\s+"," ",text)#replace every sequences of spaces with only one space

print(text)


### regex.match 
Function **regex.match** puts parts of the matched regex which are distinguished by paranthesis in different groups.

See an example identifying a specific time pattern in the following code cell:

In [None]:
time = "12:05 PM Tuesday, September 01, 2020 (GMT+2)"

groups = regex.match("(\d{2}):(\d{2})",time)# \d matches digit and using {} you can choose the number of repetitions for instance \w{2-5} mins minimum 2 alphabetical characters and maximum 5 alphabetical characters
matched = groups[0]
hour = groups[1]
minutes = groups[2]
print(hour,"h",minutes)

4- **Observe and Reflect**  Please give an example of a pattern that can be considered by the regex correct in the example above (when extracting time) even though they are in fact not correct?


<p style="color:green;"><b>Type your response here<b><p>


5- **CODEIT** Edit the regex in the above code cell to  fix the problem you found with the pattern?

In [None]:

time = "12:05 PM Tuesday, September 01, 2020 (GMT+2)"

groups = regex.match("([0-1][0-9]|[2][0-3]):([0-5][0-9])",time)

matched = groups[0]
hour = groups[1]
minutes = groups[2]
print(hour,"h",minutes)


---

6- **CODE IT** In the following text from "Alice in wonderland" find where in text someone is quoted and their quote (as best as you can) by filling the regex in the match function.

 Note: make sure you use correct quotation characters

In [None]:
text ='''“Ugh!” said the Lory, with a shiver.

“I beg your pardon!” said the Mouse, frowning, but very politely: “Did
you speak?”

“Not I!” said the Lory hastily.

“I thought you did,” said the Mouse. “—I proceed. ‘Edwin and Morcar,
the earls of Mercia and Northumbria, declared for him: and even
Stigand, the patriotic archbishop of Canterbury, found it advisable—’”

“Found _what_?” said the Duck.

“Found _it_,” the Mouse replied rather crossly: “of course you know
what ‘it’ means.”

“I know what ‘it’ means well enough, when _I_ find a thing,” said the
Duck: “it’s generally a frog or a worm. The question is, what did the
archbishop find?”

The Mouse did not notice this question, but hurriedly went on, “‘—found
it advisable to go with Edgar Atheling to meet William and offer him
the crown. William’s conduct at first was moderate. But the insolence
of his Normans—’ How are you getting on now, my dear?” it continued,
turning to Alice as it spoke.

“As wet as ever,” said Alice in a melancholy tone: “it doesn’t seem to
dry me at all.”

“In that case,” said the Dodo solemnly, rising to its feet, “I move
that the meeting adjourn, for the immediate adoption of more energetic
remedies—”

“Speak English!” said the Eaglet. “I don’t know the meaning of half
those long words, and, what’s more, I don’t believe you do either!” And
the Eaglet bent down its head to hide a smile: some of the other birds
tittered audibly.

“What I was going to say,” said the Dodo in an offended tone, “was,
that the best thing to get us dry would be a Caucus-race.”

“What _is_ a Caucus-race?” said Alice; not that she wanted much to
know, but the Dodo had paused as if it thought that _somebody_ ought to
speak, and no one else seemed inclined to say anything.

“Why,” said the Dodo, “the best way to explain it is to do it.” (And,
as you might like to try the thing yourself, some winter day, I will
tell you how the Dodo managed it.)'''


def findSpeakerAndQuote(sentence):
    speaker = ""
    quote = ""
    try:
        print(sentence)
        #CODEIT: insert regex in match function below to find the speaker and the quotes in the above text as best as you can
        grouped = regex.match(r"(.*?) said ((the\s)?(\w+))",sentence)
        speaker = grouped[2]
        quote = grouped[1]
        print(speaker+" : "+quote)
        print("_"*20)
        
    except:
        pass
    
        
import regex


paragraphs = text.split("\n\n")
speaker_quotes=[findSpeakerAndQuote(paragraph) for paragraph in paragraphs]

7- **CODE IT**  The following code is supposed to extract hashtags from a tweet. Find all hashtags in the tweet using "findHashtags" regex and return them as a list.




In [None]:
import regex
import pandas as pd #importing pandas to work with dataframes

def findHashtags(tweet): 
    #CODEIT: Insert Regex pattern to return the list of hashtags in tweet
    
    hashtags = regex.findall("#\w+",tweet)
    
    return hashtags
    
tweets = pd.read_csv("Tweets.csv")['Tweet'].tolist()
text = tweets[0]
print(text)
hashtags=findHashtags(text)
hashtags

If if complete the exercise above correctly, the following code will create 2 wordclouds of the hashtags used by Republicans and Democrates separately. Run the code below and wait.

In [None]:
!pip install wordcloud==1.8.0

In [None]:
import pandas as pd #importing pandas to work with dataframes


tweets = pd.read_csv("Tweets.csv")
democrats_tweets = tweets.loc[tweets['Party'] == "Democrat"]['Tweet'].tolist()
republicans_tweets = tweets.loc[tweets['Party'] == "Republican"]['Tweet'].tolist()

import regex

all_democrat_hashtags=[]
for tweet in democrats_tweets:
    all_democrat_hashtags.extend(findHashtags(tweet))
    

all_republican_hashtags=[]
for tweet in republicans_tweets:
    all_republican_hashtags.extend(findHashtags(tweet))
    


from wordcloud import WordCloud
import matplotlib.pyplot as plt
 
!pip install wordcloud
wordcloud_rep = WordCloud().generate(" ".join(all_republican_hashtags))
wordcloud_dem = WordCloud().generate(" ".join(all_democrat_hashtags))
 
# Display the generated image:
fig = plt.figure()
ax = fig.add_subplot(2,1,1)
ax.imshow(wordcloud_rep, interpolation='bilinear')
ax = fig.add_subplot(2,1,2)
ax.imshow(wordcloud_dem, interpolation='bilinear')
plt.axis("off")
plt.margins(x=5, y=5)
plt.show()

## Create your own chatbot with Python


The following code creates a chatbot using nltk library.

Chat with Chatty a bit:

In [None]:
from nltk.chat.util import Chat,reflections
pairs = [[r"my name is (.*)",["Hello %1, How are you today ?",] ],\
                                                    [r"what is your name ?", ["My name is Chatty and I'm a chatbot ?",]],\
                                                    [r"how are you ?",["I'm doing good\nHow about You ?",]],\
                                                    [r"sorry (.*)",  ["It's alright","It's OK, never mind",]],\
                                                    [r"i'm not( .*)? doing good",["I'm sorry",]],\

                                                    [r"i'm (.*) doing good",["Nice to hear that","Alright :)",]],\
                                                    [r'(.*) (hungry|sleepy)', ["Are you saying %1 %2 ?"]],\
                                                    [r"hi|hey|hello",["Hello", "Hey there",]],\
                                                    [r"(.*) age?",["I'm a computer program dude\nSeriously you are asking me this?",]],\
                                                    [r"what (.*) want ?",["Make me an offer I can't refuse",]],\
                                                    [r"how is weather in (.*)?",["Weather in %1 is awesome like always","Too hot man here in %1","Too cold man here in %1","Never even heard about %1"]],\
                                                    [r"i work in (.*)?", ["%1 is an Amazing company, I have heard about it. But they are in huge loss these days.",]],\
                                                    [r"(.*)raining in (.*)", ["No rain since last week here in %2","Damn its raining too much here in %2"]],\
                                                    [r"how (.*) health(.*)",["I'm a computer program, so I'm always healthy ",]],\
                                                    [r"quit",["BBye take care. See you soon :) ","It was nice talking to you. See you soon :)"]]]
reflections = {  "i am"       : "you are",  "i was"      : "you were",  "i"          : "you",  "i'm"        : "you are",  "i'd"        : "you would",  "i've"       : "you have",  "i'll"       : "you will",  "my"         : "your",  "you are"    : "I am",  "you were"   : "I was",  "you've"     : "I have",  "you'll"     : "I will",  "your"       : "my",  "yours"      : "mine",  "you"        : "me",  "me"         : "you"}
def chatty():
    print("Hi, I'm Chatty and I chat alot ;)\nPlease type lowercase English language to start a conversation. Type quit to leave ") #default message at the start 
    chat = Chat(pairs, reflections)
    chat.converse()

chatty()

There is a chance if you tell chatty :"i'm not doing good" , it will response with: "Nice to hear that" . Try it!

8- **CODE IT** add a pair to the pairs list of chatty in the above code to fix that with feeling sorry for someone who is not feeling good.

---

9- **Homework Exercise:**
Choose one of the following chatbot scenarios and create it using the nltk library in python.

- Movie recommendation chatbot : Your chatbot should know at least 100 movies from the IMDB Movie dataset in different genres.
> Sample Coversation for Movie chatbot:
``` User: Hello chatty
    Chatty: Hi there
    User: How are you?
    Chatty: I am doing fine, How are you?
    User: Fine, Fine.
    Chatty: Glad to hear.
    User: Can you recommend me a movie to watch? (I want to watch a film.)(Can you recommend a movie?)(I am looking for a good movie)
    Chatty: of course, What genre do you fancy? (Do you have a specific genre in mind?) 
    User: Comdey?
    Chatty: How about Sweet and Lowdown? It's a comedy movie directed by Woody Allen in 1999
    User: I have seen this one, do you no any other comedy movies?
    Chatty: How about The Angels' Share? It's a comedy movie directed by Ken Loach in 2012
    User: Thanks Chatty
    Chatty: You are very welcome.
    User: Bye
```




- Travel Agency chatbot :Use a list of cities with some explanations about the city extracted from a list.
> Sample Coversation for Travel Agency chatbot:
```User: Hello chatty
    Chatty: Hi there
    User: How are you?
    Chatty: I am doing fine, How are you?
    User: Fine, Fine.
    Chatty: Glad to hear.
    User: I want to go on a trip. (I am thinking of traveling.)
    Chatty: How nice, where are you going?
    User: I don't know yet. Where do you recommend? (Where do you suggest?)
    Chatty: How about Paris? You can spend a lot of time in museums.
    User: I have been there before. Anywhere else? (where else?) (where else do you know about?)
    Chatty: How about Montreal? it's a beautiful city in canda with night life.
    User: Sounds good/fine/super.
    Chatty: I'm happy you like my recommendation.
    User: I want to leave on Wednesday 12th of August.(on Sunday, Next weekend , 12/08/2020, ...(anyother time format that comes to your mind...))
    Chatty: Ok I'll contact the agency for a ticket on 12th of August and get back to you(fake promise!).
    User: Bye
```

