# Corona Chatbot

### Table of contests
1. [Imports](#Imports)
2. [Input: speech recording and STT](#Input_speech_recording_and_STT)
3. [Semantic parsing](#Semantic_parsing)
4. [Data binding](#Data_binding)
5. [Output and TTS](#Output_and_TTS)
6. [Dialog manager](#Dialog_manager)

### <a id="Imports">Imports</a>

In [1]:
import requests
import eliza
import os
import sounddevice as sd
import soundfile as sf
import numpy as np
from scipy.io.wavfile import write
import json
from wendel_util import file_update #creates / updates a json file with data from Corona API
from google.cloud import speech
import io
import random
import pickle
import self
import pyttsx3
import nltk #used for word tokenization

import spacy #used for NER - named entity recognition
nlp = spacy.load("de_core_news_lg")

from HanTa import HanoverTagger as ht #used for lemmatization; performs better than spaCy for German
hannover = ht.HanoverTagger('morphmodel_ger.pgz')

from tensorflow.keras.models import load_model

### <a id="Input_speech_recording_and_STT">Input: speech recording and STT</a>

In [2]:
#record the input question (3s) and save it as a wav file
def record_file():
    filename = 'myfile.wav'
    sr = 16000  #sample rate
    seconds = 3  #duration of recording
    data = sd.rec(int(seconds * sr), samplerate=sr, channels=1)
    sd.wait()  #wait until recording is finished
    #convert `data` to 16 bit integers:
    y = (np.iinfo(np.int16).max * (data/np.abs(data).max())).astype(np.int16) 
    write(filename, sr, y)

In [3]:
credentials='dazzling-trail-316220-8e991214f1c8.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=credentials

In [4]:
#transcribe the audio file and return the transcription
def transcribe():
    filename = 'myfile.wav'
    client = speech.SpeechClient()
    with io.open(filename, "rb") as audio_file:
        content = audio_file.read()
    audio = speech.RecognitionAudio(content = content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        language_code="de-DE",
    )
    response = client.recognize(config=config, audio=audio)
    for result in response.results:
        for index, alternative in enumerate(result.alternatives):
            print("User: {}".format(alternative.transcript))
            return alternative.transcript

In [5]:
#record a query and return a transcription
def speech_input():
    record_file()
    text = transcribe()
    return text

### <a id="Semantic_parsing">Semantic parsing</a>

`intents.json` contains a list of dictionaries which is used as a basis for training the bot to recognize a type of query. There's one dictionary for each of five possible topics (new cases, incidence, deaths, vaccinations and recovered). Each dictionary consists of three categories: <b>tag</b> - the topic of the question; <b>patterns</b> - some example questions; <b>responses</b> - a unique id number which will later be used to get specific data from the API.

The bot is trained to predict a query type using `training.py` file. Detailed comments are in the file.

In [6]:
intents = json.loads(open("intents.json").read())
#load words, classes and model created via training.py
words = pickle.load(open("words.pkl", "rb"))
classes = pickle.load(open("classes.pkl", "rb"))
model = load_model("chatbot_model.h5")

In [7]:
#tokenize and lemmatize input
def clean_up_sentence(text):
    sentence_words = nltk.word_tokenize(text)
    sentence_words = [hannover.analyze(word)[0] for word in sentence_words]
    return sentence_words

In [8]:
#convert input into a bag of words
def bag_of_words(text): 
    sentence_words = clean_up_sentence(text)
    bag = [0] * len(words)
    for w in sentence_words:
        for i, word in enumerate(words):
            if word == w:
                bag[i] = 1
    return np.array(bag)

In [9]:
#predict class based on the input text
def predict_class(text):
    bow = bag_of_words(text) #create a bag of words
    res = model.predict(np.array([bow]))[0] #predict the result based on the bag of words
    ERROR_THRESHOLD = 0.85 #specify how much uncertainty we allow
    results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD] #enumerate the results to get an index and a class

    results.sort(key=lambda x: x[1], reverse=True) #sort by probability in reverse order
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])}) #return a list with categories and their probabilities
        print("Prediction accuracy: " + str(r[1]))
    return return_list

In [10]:
#get a unique id number of the predicted class
def get_response(intents_list, intents_json):
    tag = intents_list[0]["intent"]
    list_of_intents = intents_json["intents"]
    for i in list_of_intents:
        if i["tag"] == tag:
            result = i["responses"][0]
            break
    return result

In [11]:
#sum up two previos functions: query to predicted class
def query(text):
    ints = predict_class(text)
    res = get_response(ints, intents)
    return res

In [12]:
#search for location names in the input text using spaCy
def ner(text):
    doc = nlp(text)
    if doc.ents:
        for ent in doc.ents:
            if ent.label_ == "LOC":
                ner = str(ent.text)
                return ner

In [13]:
phrases = {'hello':'Willkommen bei der Corona Impfauskunft. Was möchtest du wissen?', 
    'goodbye':'Vielen Dank für das nette Gespräch. Bis zum nächsten Mal!'}

continue_phrases = ['Hast du weitere Fragen?', 'Kann ich dir noch behilflich sein?', 'Wie kann ich dir noch helfen']

farewells = {"fertig":"done", "tschüss":"done", "danke":"done", "auf wiedersehen":"done", "nein":"done"}

states_shortened = {"Baden-Württemberg":"BW","Bayern":"BY","Berlin":"BE","Brandenburg":"BB",
                   "Bremen":"HB","Hamburg":"HH","Hessen":"HE","Mecklenburg-Vorpommern":"MV",
                   "Niedersachsen":"NI","Nordrhein-Westfalen":"NW","Rheinland-Pfalz":"RP",
                   "Saarland":"SL","Sachsen":"SN","Sachsen-Anhalt":"ST","Schleswig-Holstein":"SH",
                   "Thüringen":"TH","Deutschland":"DE"}

states_full = {"BW":"Baden-Württemberg","BY":"Bayern","BE":"Berlin","BB":"Brandenburg",
              "HB":"Bremen","HH":"Hamburg","HE":"Hessen","MV":"Mecklenburg-Vorpommern",
              "NI":"Niedersachsen","NW":"Nordrhein-Westfalen","RP":"Rheinland-Pfalz",
              "SL":"Saarland","SN":"Sachsen","ST":"Sachsen-Anhalt","SH":"Schleswig-Holstein",
              "TH":"Thüringen","DE":"Deutschland"}

vaccines_d = {'biontech':'biontech', 'bajonczak':'biontech', 
              'moderna':'moderna', 'moderner':'moderna', "moderne":"moderna",
              'janssen':'janssen', 'johnson':'janssen',
              'astraZeneca':'astraZeneca', 'astra':'astraZeneca', 'zeneca':'astraZeneca'}

vaccine_names = {'biontech':'Biontech', 'moderna':'Moderna', 'janssen':'Janssen',
              'astraZeneca':'AstraZeneca'}

In [14]:
#check if the sentence contains any location name or vaccine name
def semantic(input_s, ner):
    semantics = {'state':'', "unknown_location":"", 'vaccine':'', 'answer':0}
    for key in states_shortened.keys():
        if ner == key:
            semantics['state'] = states_shortened[key]
        else:
            semantics['unknown_location'] = ner
    for key in vaccines_d.keys():
        if key in input_s:
            semantics['vaccine'] = vaccines_d[key]
    return semantics

In [15]:
#initialize Eliza in German
def init_eliza():
    elz = eliza.Eliza()
    elz.load("deutsch.txt")
    return elz

### <a id="Data_binding">Data binding</a>

`wendel_util.py` was updated so that it can create / update a json file with data for any endpoint from the Corona API.

In [16]:
#create/update a file with data for Germany
module = 'germany'
file_update(module)
data = open(module + '.json')
data_germany = json.load(data)

Up To Date


In [17]:
#create/update a file with data regarding vaccinations
module = 'vaccinations'
file_update(module)
data = open(module + '.json')
data_vaccinations = json.load(data)

Up To Date


In [18]:
#create/update a file with data for single German states
module = 'states'
file_update(module)
data = open(module + '.json')
data_states = json.load(data)

Up To Date


In [19]:
#get data from the API based on the predicted class and given information (location, vaccine)
def data(semantics, res):
    s = semantics['state']
    u = semantics["unknown_location"]
    v = semantics['vaccine']
    if s: #state given
        if s != "DE": #and state does not equal "Germany"
            if res == "0": #new cases
                semantics['answer'] = data_states["data"][s]["delta"]["cases"]
            if res == "1": #incidence
                semantics['answer'] = data_states["data"][s]["weekIncidence"]
            if res == "2": #deaths
                semantics['answer'] = data_states["data"][s]["deaths"]
            if res == "3":  #vaccinated
                if v: #vaccine given
                    semantics['answer'] = data_vaccinations["data"]["states"][s]['vaccination'][v]
                else: #no vaccine given
                    semantics['answer'] = data_vaccinations["data"]["states"][s]["vaccinated"]
            if res == "4": #recovered
                semantics['answer'] = data_states["data"][s]["recovered"]
        if s == "DE": #and equals "Germany"
            if res == "0": #new cases
                semantics['answer'] = data_germany["delta"]["cases"]
            if res == "1": #incidence
                semantics['answer'] = data_germany["weekIncidence"]
            if res == "2": #deaths
                semantics['answer'] = data_germany["deaths"]
            if res == "3":
                if v: #vaccine given
                    semantics['answer'] = data_vaccinations["data"]['vaccination'][v]
                else: # no vaccine given
                    semantics['answer'] = data_vaccinations["data"]["vaccinated"]
            if res == "4": #recovered
                semantics['answer'] = data_germany["recovered"]
    elif u: #query about unknown location
        semantics['answer'] = 0
    else: #no state given
        if res == "0": #new cases
            semantics['answer'] = data_germany["delta"]["cases"]
        if res == "1": #incidence
            semantics['answer'] = data_germany["weekIncidence"]
        if res == "2": #deaths
            semantics['answer'] = data_germany["deaths"]
        if res == "3":
            if v: #vaccine given
                semantics['answer'] = data_vaccinations["data"]['vaccination'][v]
            else: # no vaccine given
                semantics['answer'] = data_vaccinations["data"]["vaccinated"]
        if res == "4": #recovered
            semantics['answer'] = data_germany["recovered"]
    return semantics

### <a id="Output_and_TTS">Output and TTS</a>

In [20]:
#prepare a final answer
def output(semantics, res):
    ret = ''
    s = semantics['state']
    u = semantics["unknown_location"]
    v = semantics['vaccine']
    a = semantics['answer']
    if s: #state given
        s = states_full[s]
        if res == "0": #new cases
            ret = 'Die Anzahl von Neuinfektionen von gestern für {} ist {}'.format(s, a)
        if res == "1": #incidence
            ret = 'Die aktuelle Sieben-Tage-Inzidenz für {} beträgt {:.2f}'.format(s, a)
        if res == "2": #deaths
            ret = 'Die aktuelle Todesanzahl für {} beträgt {}'.format(s, a)
        if res == "3": #vaccinated
            if v: #vaccine given
                v = vaccine_names[v]
                ret = 'Die Impfungen für {} mit {} sind {}'.format(s, v, a)
            else: #no vaccine given
                ret = 'Die Impfungen für {} sind {}'.format(s, a)
        if res == "4": #recovered
            ret = 'Die Anzahl von Genesenen für {} beträgt {}'.format(s, a)
    elif u: #query about unknown location
        ret = 'Die Informationen über {} habe ich leider nicht. Besuche die Seite von der Weltgesundheitsorganisation unter www.who.int oder suche nach der Antwort in Google. Viel Erfolg!'.format(u)
    else: # no state
        if res == "0": #new cases
            ret = 'Die Anzahl von Neuinfektionen von gestern für Deutschland ist {}'.format(a)
        if res == "1": #incidence
            ret = 'Die aktuelle Sieben-Tage-Inzidenz für Deutschland beträgt {:.2f}'.format(a)
        if res == "2": #deaths
            ret = 'Die aktuelle Todesanzahl für Deutschland beträgt {}'.format(a)
        if res == "3": #vaccinated
            if v: #vaccine given
                v = vaccine_names[v]
                ret = 'Die Impfungen für Deutschalnd mit {} sind {}'.format(v, a)
            else: #no vaccine given
                ret = 'Die Impfungen für Deutschland sind {}'.format(a)
        if res == "4": #recovered
            ret = 'Die Anzahl von Genesenen für Deutschland beträgt {}'.format(a)
    return ret

In [21]:
#convert text answer into speech and print it
def tts(text):
    engine = pyttsx3.init()
    engine.setProperty('voice', 'german')
    engine.setProperty('rate', 300)
    engine.say(text)
    engine.runAndWait()
    print("Bot: " + text)

### <a id="Dialog_manager">Dialog manager</a>

In [22]:
#this function manages the whole chatbot
#if the bot recognizes one of five query types with prediction accuracy over 85%, it give an answer
#otherwise Eliza is switched on
def dialogmanager():
    tts(phrases['hello'])
    input_s = speech_input()
    while input_s and input_s.lower() not in farewells.keys():
        try:
            res = query(input_s) # predict the type of query
            location = ner(input_s) # searching for location names in the input string
            semantics = semantic(input_s, location)
            semantics = data(semantics, res)
            out_string = output(semantics, res)
            tts(out_string)
            tts(random.choice(continue_phrases))
            input_s = speech_input()
        except:
            tts(elz.respond(input_s))
            tts(random.choice(continue_phrases))
            input_s = speech_input()
    tts(phrases['goodbye'])

In [44]:
#initialize dialogmanager and eliza
elz = init_eliza()
dialogmanager()

Bot: Willkommen bei der Corona Impfauskunft. Was möchtest du wissen?
User: aktuelle Inzidenz für Hamburg
Prediction accuracy: 0.999938
Bot: Die aktuelle Sieben-Tage-Inzidenz für Hamburg beträgt 82.32
Bot: Kann ich dir noch behilflich sein?
User: wie viele Menschen haben Corona überstanden
Prediction accuracy: 0.8507149
Bot: Die Anzahl von Genesenen für Deutschland beträgt 3731886
Bot: Kann ich dir noch behilflich sein?
User: wie viele Menschen sind in Bremen gestorben
Prediction accuracy: 0.9998779
Bot: Die aktuelle Todesanzahl für Bremen beträgt 498
Bot: Hast du weitere Fragen?
User: Menschen sind in Berlin mit biontech geimpft
Prediction accuracy: 0.99999976
Bot: Die Impfungen für Berlin mit Biontech sind 1663647
Bot: Wie kann ich dir noch helfen
User: neue Fälle wurden in Niedersachsen gemeldet
Prediction accuracy: 0.99999905
Bot: Die Anzahl von Neuinfektionen von gestern für Niedersachsen ist 308
Bot: Kann ich dir noch behilflich sein?
User: ist die aktuelle Inzidenz für London
Pre