## Simple retrieval-based chatbot. <br>

Start with loading needed packages


In [49]:
#import libraries
from newspaper import Article
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import numpy as np
import random
from nltk.tokenize import sent_tokenize, word_tokenize


In [50]:
#loading a predefined vocabulary of punctuation symbols to remove later
nltk.download('punkt', quiet=True) # Download the punkt package

True

## Getting the text source for retrieval base
<br>
Here you can add your file or article. Once you have your text in string or array if strings, then input that to nltk.sent_tokenize

In [51]:
# Get the text-source
text = Article('https://www.metoffice.gov.uk/weather/climate-change/causes-of-climate-change')
text.download() # Load URL to the engine
text.parse() # Retrieve text from the URL
text.nlp() # Apply NLP tokenization and filtering
corpus = text.text #Store the article text into corpus

#some filtering and additional pre-processing. Only specific to that article
corpus = corpus.replace('\n\n','. ')

sentence_list = nltk.sent_tokenize(str(corpus))# txt to a list of sentences 

Parsing some articles from the web may lead to many small sentences, like headlines, so, we'll select only those sentences that are at least 50 characteers long

In [52]:
sentence_list = [x for x in sentence_list if len(x) > 50]
sentence_list[:5]

['The climate on Earth has been changing since it formed 4.5 billion years ago.',
 'Until recently, natural factors have been the cause of these changes.',
 "Natural influences on the climate include volcanic eruptions, changes in the orbit of the Earth, and shifts in the Earth's crust (known as plate tectonics).. Over the past one million years, the Earth has experienced a series of ice-ages ('glacial periods') and warmer periods ('interglacial').",
 "Glacial and interglacial periods cycle roughly every 100,000 years, caused by changes in Earth's orbit around the sun.",
 'For the past few thousand years, Earth has been in an interglacial period with a constant temperature..']

In [53]:
#Function to return a random greeting response to a users greeting
def greeting_response(text):
  #Convert the text to be all lowercase
  text = text.lower()
  # Keyword Matching
  #Greeting responses back to the user from the bot
  bot_greetings = ["Hi there","Hi, ask me something", "hey", "hi",  "hello"]
  #Greeting input from the user
  user_greetings = ["hi", "hello",  "start", "let'sgo",  "what's up","hey bot"] 
  
  #If user's input is a greeting, return a randomly chosen greeting response
  for word in text.split():
    if word in user_greetings:
        return random.choice(bot_greetings)


Edit here the similarity function calculation here if you want

In [54]:
# Generate the response
def bot_response(user_input):
    
    user_input = user_input.lower()#User input to lower case
    
    sentence_list.append(user_input)#Append the users sentence to the list of known sentences
    count_vectorizer = CountVectorizer(ngram_range = (1,2)).fit_transform(sentence_list) #Create unigram and bigram vocabularies
    
    #similarity function change here
    similarity_scores = cosine_similarity(count_vectorizer[-1], count_vectorizer) #Get metrics how similar sentences from the base sentences with uesr input
    
    flattened = similarity_scores.flatten() #from 2d array to 1d array of values
    enumerated_scores = [(index,score) for index,score in enumerate(flattened)]
    scores_sorted = sorted(enumerated_scores, key=lambda results: results[1], reverse = True) #sorting by probabilities
    
    response_tuple = scores_sorted[1] #take the second best sentence, because the best sentence score would be user input sentence to each other
    
    if response_tuple[1] > 0 :
        bot_response = sentence_list[response_tuple[0]]
    else:
        bot_response = "I don't know what you mean, sorry"
        
    #we don't want to keep user input in our data     
    sentence_list.remove(user_input) #Remove the users response from the sentence tokens
       
    return bot_response

## Bot. 
<br>
To stop it, type 'bye' or anything else from exit_list phrases.
<br>
User input sentence is put through similarity scoring with sentences from article sentences

In [58]:
#Start the chat
print("Climate TalkBot: Hi there. I will try to answer your questions about Climate Change. If you want to stop chatting   just type 'bye'")
exit_list = ['exit', 'see you later','bye', 'quit', 'abort','stop']
while(True):
    user_input = input("You:")
    
    if(user_input.lower() in exit_list):
      print("Climate TalkBot: Chat with you later !")
      break
        
    else:
        if(greeting_response(user_input)!= None):
            print("TalkBot: " + greeting_response(user_input))
            print('++++++++++++++++++++++++++++++++++++++++++')
        else:
            print("TalkBot: " + bot_response(user_input))
            ('++++++++++++++++++++++++++++++++++++++++++')


Climate TalkBot: Hi there. I will try to answer your questions about Climate Change. If you want to stop chatting   just type 'bye'
You:Hi
TalkBot: hi
++++++++++++++++++++++++++++++++++++++++++
You:tell me about deforestation 
TalkBot: Deforestation – Forests remove and store carbon dioxide from the atmosphere.
You:what will happen when trees burn? 
TalkBot: Not only that, trees release the carbon they stored when we burn them.
You:kjcldjnsjnvjdnjo
TalkBot: I don't know what you mean, sorry
You:Choo choo choo
TalkBot: I don't know what you mean, sorry
You:exit
Climate TalkBot: Chat with you later !
