<a href="https://colab.research.google.com/github/shrutikamokashi/Covid19_Projects/blob/master/Covid_19_Chatbot_updt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Covid-19 Chatbot & spread of the coronavirus in United States

* This notebook tracks and Analyse the spread of the coronavirus in world and mostly in US.


* Down the line using various data resources we have created a Chatbot.



Content :
*  Data preprocessing
*  Data visualization with Plotly and folium to track and analyze the spread of the virus
*  Chatbot which takes data from various sources and responds user.

## Spread of Corona virus across United States

Install below packages to use them in this notebook. 

In [None]:
#!pip install folium --user
#!pip install keras==2.3.0

# Installing below two packages as it is if the OS is Windows OS.
#!pip install tensorflow==1.15
#!pip install tensorflow-gpu==1.15

# For all other OS, below installation would suffice 
#!pip install tensorflow

Import all the necessary libraries needed for the analysis and computation. Libraries imported are,


*   pandas: To deal with dataframes.
*   numpy: For scientific computing with Python
*   KMeans: For an unsupervised ML algorithm; to classify data without having first been trained with labeled data
*   MinMaxScaler: To transform features by scaling each feature to a given range
*   folium: To visualize geospatial data using latitude and longitude (provided in world-countries.json)
*   graph_objects: To create beautiful interactive web-based visualizations
*   make_subplots: Return an instance of plotly.graph_objs.Figure with the subplots domain set in 'layout'.
*   warnings: To supress any unwanted warnings.

In [None]:
import pandas as pd 
import numpy as np 
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import folium
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")
mms = MinMaxScaler()

Data Sources:


*   world-countries.json is taken from : https://www.kaggle.com/ktochylin/world-countries

This dataset contains, countries and all the latitute and longitude.

*   Countries_usefulFeatures.csv is taken from : https://www.kaggle.com/ishivinal/covid19-useful-features-by-country

This dataset contains, various features of contries like population, tourism, first death date, first confimed case, mean age, type of lockdown, etc.

*   train_w4.csv is taken from : https://www.kaggle.com/c/covid19-global-forecasting-week-4

This dataset contains, various features of contries with respect to date like Confirmed cases, deaths etc.

*   train_w5.csv is taken from : https://www.kaggle.com/c/covid19-global-forecasting-week-5

This dataset contains, various features of contries with respect to date like Population, weight, Target i.e. Confirmed cases/deaths and it's number.

*   tested_worldwide.csv is taken from : https://www.kaggle.com/lin0li/covid19testing

This dataset contains, various features of contries with respect to date like positive cases, active cases, recovered cases, deaths, daily tested, etc.


Read all the files mentioned above in dataframes.
train_w5 is most latest file which was updated on 10th May 2020.

Create a data frame named 'max_d_test' which contains max daily tested for every country.

In [None]:
Cntry_uf = pd.read_csv("Countries_usefulFeatures.csv")
train_w4 = pd.read_csv("train_w4.csv")
train_w5 = pd.read_csv("train_w5.csv")
country_geo = "world-countries.json"
testing = pd.read_csv("tested_worldwide.csv")
max_d_test=testing.groupby(["Country_Region"]).agg({"daily_tested":"max"}).reset_index()


*   Replaces names of few countries which are mentioned below in dictionary named 'r'.
*   Replace these countries mentioned in max_d_test using 'r'



In [None]:
r = {'Czech Republic': 'Czechia','DR Congo': 'Congo (Brazzaville)','Democratic Republic of the Congo': 'Congo (Kinshasa)','Ivory Coast': "Cote d'Ivoire",'Palestine': 'West Bank and Gaza','South Korea': 'Korea, South','Taiwan': 'Taiwan*','United States': 'US',}
max_d_test.Country_Region=max_d_test.Country_Region.replace(to_replace=r)
Cntry_uf.columns

Index(['Country_Region', 'Population_Size', 'Tourism', 'Date_FirstFatality',
       'Date_FirstConfirmedCase', 'Latitude', 'Longtitude', 'Mean_Age',
       'Lockdown_Date', 'Lockdown_Type', 'Country_Code'],
      dtype='object')

* Copy Countries_usefulFeatures.csv to dataframe named 'df_cluster'
* Keep only those 7 columns 'Population_Size', 'Tourism', 'Date_FirstFatality','Date_FirstConfirmedCase', 'Latitude', 'Longtitude', 'Mean_Age' in df_cluster.
* Check for null values if any.

In [None]:
df_cluster = Cntry_uf.copy()
df_cluster = df_cluster[["Country_Region","Population_Size","Tourism","Date_FirstFatality","Date_FirstConfirmedCase","Latitude","Longtitude","Mean_Age"]]
df_cluster.isnull().sum()

Country_Region              0
Population_Size             0
Tourism                     0
Date_FirstFatality         28
Date_FirstConfirmedCase     0
Latitude                    0
Longtitude                  0
Mean_Age                    0
dtype: int64

* Replace Null values in df_cluster in Date_FirstFatality column with '2222-11-11'. Then convert the same column into Python Date time object. Later covert that Python Date time object to integer.
* Convert the Date_FirstConfirmedCase column into Python Date time object. Later covert that Python Date time object to integer.
* Drom columnnamed "Country_Region" from data_to_cluser.
* Compute the minimum and maximum to be used for later scaling for the dataframe.
* Scale features of data_to_cluser according to feature_range.

In [None]:
df_cluster.Date_FirstFatality.fillna("2222-11-11",inplace=True)
df_cluster.Date_FirstFatality=pd.to_datetime(df_cluster.Date_FirstFatality)
df_cluster.Date_FirstFatality = df_cluster.Date_FirstFatality.astype(np.int64)
df_cluster.Date_FirstConfirmedCase=pd.to_datetime(df_cluster.Date_FirstConfirmedCase)
df_cluster.Date_FirstConfirmedCase = df_cluster.Date_FirstConfirmedCase.astype(np.int64)
df_cluster.drop(["Country_Region"],axis=1,inplace=True)
mms.fit(df_cluster)
data_transformed = mms.transform(df_cluster)

* Set the range of 'K' from 1 - 90.
* For every 'K' in above range, do below things,
* Form clusters, where number of clusters = value of 'K' at that iteration.
* Train multiple models using a different number of clusters and storing the value of the inertia_ property (Sum_of_sd) every time.

In [None]:
Sum_of_sd = []
K = range(1,90)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data_transformed)
    Sum_of_sd.append(km.inertia_)

* Next, we’ll categorize the data using the optimum number of clusters (55) we determined in the last step.
* Train multiple models using data_transformed
* Save Labels of each point in column name "cluster" of dataframe named "Cntry_uf".
* Drop "Province_State" column of train_w4 dataframe.
* Then merge, train_w4 and Cntry_uf on column named "Country_Region" and save it in train_w4 itself.



In [None]:
km = KMeans(n_clusters=55,random_state=1995)
km = km.fit(data_transformed)
Cntry_uf["cluster"] = km.labels_
train_w4.drop("Province_State",axis=1,inplace=True)
train_w4 = pd.merge(train_w4,Cntry_uf,on='Country_Region',how="left")

* Create a dataframe with name sel_data, which created by using train_w4 dataframe, gouped using columns "Country_Region" & "Date" which has sum of "ConfirmedCases" & "Fatalities". 
* Give column names to the columns of sel_data.
* Drop duplicates from train_w4 from "Country_Region" & "Date" columns.
* Drop "Province_State" column of train_w4 dataframe.
* Then merge, train_w4 and sel_data on columns named 
"Country_Region" & "Date" and save it in train_w4 itself.
* Drop "ConfirmedCases" & "Fatalities" columns of train_w4 dataframe.
* Then rename few columns given below,
> * "ConfirmedCases_i":"ConfirmedCases"
> * "Fatalities_i":"Fatalities"
> * "clusters":"cluster"



In [None]:
sel_data = train_w4.groupby(["Country_Region","Date"]).agg({"ConfirmedCases":"sum","Fatalities":"sum"}).reset_index()
sel_data.columns = ["Country_Region","Date","ConfirmedCases_i","Fatalities_i"]
train_w4.drop_duplicates(["Country_Region","Date"],inplace=True)
train_w4 = pd.merge(train_w4,sel_data,on=['Country_Region',"Date"],how="left")
train_w4.drop(["ConfirmedCases","Fatalities"],axis=1,inplace=True)
train_w4.rename(columns={"ConfirmedCases_i":"ConfirmedCases","Fatalities_i":"Fatalities","clusters":"cluster"},inplace=True)

* Replace Null values in train_w4 in Date_FirstFatality column with '2222-11-11'. 
* And check for null values in train_w4 if any.

In [None]:
train_w4.Date_FirstFatality.fillna("2222-11-11",inplace=True)
train_w4.isnull().sum()

Id                            0
Country_Region                0
Date                          0
Population_Size               0
Tourism                       0
Date_FirstFatality            0
Date_FirstConfirmedCase       0
Latitude                      0
Longtitude                    0
Mean_Age                      0
Lockdown_Date              3630
Lockdown_Type              3630
Country_Code                  0
cluster                       0
ConfirmedCases                0
Fatalities                    0
dtype: int64

* In cell below, Calculate 'Fatalities_rate' by dividing value of Fatalities column of train_w4 dataframe by value of ConfirmedCases column of train_w4 and multiply by 100.
* Calculate Difference between Lockdown Date and First Confirmed Case Date by subtracting value of Lockdown_Date column of train_w4 dataframe by value of Date_FirstConfirmedCase column of train_w4. Save it to column named "diff_FC_LD".

In [None]:
train_w4.Date_FirstFatality.fillna("2222-11-11",inplace=True)
train_w4.isnull().sum()
train_w4["Fatalities_rate"]=(train_w4.Fatalities * 100) / train_w4.ConfirmedCases
#Difference between Lockdown Date and First Confirmed Case Date
train_w4["diff_FC_LD"]=(train_w4.Lockdown_Date.astype('datetime64') - train_w4.Date_FirstConfirmedCase.astype('datetime64'))
#Difference between Lockdown Date and First Confirmed Fatality
train_w4["diff_FF_LD"]=(train_w4.Lockdown_Date.astype('datetime64') - train_w4.Date_FirstFatality.astype('datetime64'))

* In cell below, Calculate Difference between Lockdown Date and First Confirmed Fatality by subtracting value of Lockdown_Date column of train_w4 dataframe from value of Date_FirstFatality column of train_w4. Save it to column named "diff_FF_LD".

In the cell below, we would be dealing with train_w5.

* Create a dataframe with name tempd2, which created by using train_w5 dataframe, using columns "Country_Region" & "Weight" with Target as "ConfirmedCases". 
* Craete a column named "Weight_F" which has value of Column "Weight" of train_w5 with Target as "Fatalities". 
* Give column names to the columns of tempd2.
* Drop duplicates from tempd2 from "Country_Region" column.
* Then merge, train_w4 and tempd2 on columns named 
"Country_Region" withe the columns "Country_Region", "Weight_C" & "Weight_F" of tempd2 and save it in train_w4 itself.
*  Then merge, train_w4 and max_d_test on columns named "Country_Region" withe the columns "Country_Region" & "daily_tested" of max_d_test and save it in train_w4 itself.
* Replace Null values in train_w4 in daily_tested column with values of daily_tested column which are greater than 100 atleast. 

In [None]:
tempd2=train_w5.loc[(train_w5.Target=="ConfirmedCases"),["Country_Region","Weight"]]
tempd2["Weight_F"] =  train_w5.loc[(train_w5.Target=="Fatalities"),["Weight"]].values
tempd2.columns = ["Country_Region","Weight_C","Weight_F"]
tempd2=tempd2.drop_duplicates(["Country_Region"])
train_w4 = pd.merge(train_w4,tempd2[["Country_Region","Weight_C","Weight_F"]],on=['Country_Region'],how="left")
train_w4 = pd.merge(train_w4,max_d_test[["Country_Region","daily_tested"]],on=['Country_Region'],how="left")
train_w4.daily_tested.fillna(train_w4.loc[train_w4.daily_tested>100,"daily_tested"].min(),inplace=True)

In [None]:
train_w4

Unnamed: 0,Id,Country_Region,Date,Population_Size,Tourism,Date_FirstFatality,Date_FirstConfirmedCase,Latitude,Longtitude,Mean_Age,Lockdown_Date,Lockdown_Type,Country_Code,cluster,ConfirmedCases,Fatalities,Fatalities_rate,diff_FC_LD,diff_FF_LD,Weight_C,Weight_F,daily_tested
0,1,Afghanistan,2020-01-22,37172386,14000,2020-03-23,2020-02-25,33.939110,67.709953,17.3,2020-03-24,Full,AFG,31,0.0,0.0,,28 days,1 days,0.058359,0.583587,104.0
1,2,Afghanistan,2020-01-23,37172386,14000,2020-03-23,2020-02-25,33.939110,67.709953,17.3,2020-03-24,Full,AFG,31,0.0,0.0,,28 days,1 days,0.058359,0.583587,104.0
2,3,Afghanistan,2020-01-24,37172386,14000,2020-03-23,2020-02-25,33.939110,67.709953,17.3,2020-03-24,Full,AFG,31,0.0,0.0,,28 days,1 days,0.058359,0.583587,104.0
3,4,Afghanistan,2020-01-25,37172386,14000,2020-03-23,2020-02-25,33.939110,67.709953,17.3,2020-03-24,Full,AFG,31,0.0,0.0,,28 days,1 days,0.058359,0.583587,104.0
4,5,Afghanistan,2020-01-26,37172386,14000,2020-03-23,2020-02-25,33.939110,67.709953,17.3,2020-03-24,Full,AFG,31,0.0,0.0,,28 days,1 days,0.058359,0.583587,104.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20345,35674,Zimbabwe,2020-05-06,14439018,2580000,2020-03-24,2020-03-21,-19.015438,29.154857,19.0,2020-03-30,Full,ZWE,23,34.0,4.0,11.764706,9 days,6 days,0.060711,0.607106,104.0
20346,35675,Zimbabwe,2020-05-07,14439018,2580000,2020-03-24,2020-03-21,-19.015438,29.154857,19.0,2020-03-30,Full,ZWE,23,34.0,4.0,11.764706,9 days,6 days,0.060711,0.607106,104.0
20347,35676,Zimbabwe,2020-05-08,14439018,2580000,2020-03-24,2020-03-21,-19.015438,29.154857,19.0,2020-03-30,Full,ZWE,23,34.0,4.0,11.764706,9 days,6 days,0.060711,0.607106,104.0
20348,35677,Zimbabwe,2020-05-09,14439018,2580000,2020-03-24,2020-03-21,-19.015438,29.154857,19.0,2020-03-30,Full,ZWE,23,35.0,4.0,11.428571,9 days,6 days,0.060711,0.607106,104.0


* Drop column named County & Province_State from train_w5 and set it's index in column 'Id'.
* Create two datasets, df1 & df2. df1 has all the data of train where Target = 'ConfirmedCases'. And df2 has all the data of train where Target not equal to 'ConfirmedCases'.
* Set index of df1 & df2 on column 'Date'.
* Create df3 by concatinating df1 & df2. Drop columns with indices equal to 3, 5, 6, 7, 8.
* Rename columns of df3.


In [None]:
train_w5.drop('County',axis=1,inplace=True)
train_w5.drop('Province_State',axis=1,inplace=True)
train_w5.set_index('Id',inplace=True)

#ANALYSING TRENDS IN US
df1=train_w5[train_w5['Target']=='ConfirmedCases']
df2=train_w5[train_w5['Target']!='ConfirmedCases']
df1.set_index('Date',inplace=True)
df2.set_index('Date',inplace=True)
df3=pd.concat([df1,df2],axis=1,ignore_index=True)
df3.drop([3,5,6,7,8],inplace=True,axis=1)
df3.rename(columns={0:'Country_Region',1:'Population',2:'Weight',4:'Confirmed',9:'Fatalities'},inplace=True)


* Create dataframe df_US from df3 which has all data only for United States. And then reset the index.
* Create dataframe final_US from df_US by grouping on the basis of 'Date', which has columns 'Date','Population','Weight','Confirmed','Fatalities'. And then reset the index.

In cell below, we are plotting the Spread of Corona Virus over time in US.
Using make_subplots we divide the figure in two plots.
First plot has Confirmed cases v/s Date in US.

Second plot has Deaths v/s Date in US.

In [None]:
df_US = df3[df3['Country_Region'] == "US"].reset_index()
final_US = df_US.groupby('Date')['Date','Population','Weight','Confirmed','Fatalities'].sum().reset_index()
figure = make_subplots(rows = 1, cols = 2, subplot_titles = ("Confirmed","Fatalities"))
a1 = go.Scatter(x=final_US['Date'],y=final_US['Confirmed'], name = "Confirmed", line_color = 'firebrick', mode = 'lines+markers')
a2 = go.Scatter(x=final_US['Date'],y=final_US['Fatalities'], name = "Deaths", line_color = 'green', mode = 'lines+markers')
figure.append_trace(a1, 1, 1)
figure.append_trace(a2, 1, 2)
figure.update_layout(template="plotly",title_text = 'Spread of Corona Virus over time in US')
figure.show()

In cell below, we are trying to extract the list of dates in US after sorting the final_US on the column of 'Fatalities' which would give us the date of max deaths and minimum deaths in United states.

We even are creating a dataframe named 'tempdf1' using train_w4 depending on Date column of train_w4.

In [None]:
fatalities_US = []
for x in final_US.sort_values('Fatalities')['Date']:
    fatalities_US.append(x)
tempdf1 =train_w4[(train_w4.Date == max(train_w4.Date)) ]

In Map below, with the help of folium, we are ploting to Fatalities_rate. Where for every country we would display a circle. Which would have all below parameters if we hower over it,
* Country Name
* Confirmed cases
* Fatality rate
* Lockdown date
* Date of first confirmed case
* Mean age

The radius of circle would be dependant on Fatalities_rate.

In [None]:
m = folium.Map(location=[0, 0], tiles='cartodbpositron',min_zoom=1, max_zoom=8, zoom_start=1.5)

for i in range(0, len(tempdf1)):
    folium.Circle(
        location=[tempdf1.iloc[i]['Latitude'], tempdf1.iloc[i]['Longtitude']],
        color='blue', fill='blue',
        tooltip =   '<li><bold>Country : '+str(tempdf1.iloc[i]['Country_Region'])+
                    '<li><bold>Confirmed : '+str(tempdf1.iloc[i]['ConfirmedCases'])+
                    '<li><bold>Death_rate : '+str(tempdf1.iloc[i]['Fatalities_rate'])+
                    '<li><bold>Deaths : '+str(tempdf1.iloc[i]['Fatalities'])+
                    '<li><bold>lockdown date : '+str(tempdf1.iloc[i]['Lockdown_Date'])+
                    '<li><bold>first case date : '+str(tempdf1.iloc[i]['Date_FirstConfirmedCase'])+
                    '<li><bold>mean age : '+str(tempdf1.iloc[i]['Mean_Age'])
        ,
        radius=int(tempdf1.iloc[i]['Fatalities_rate']*10000)).add_to(m)

m

In Map below, with the help of folium, we are ploting to ConfirmedCases. Where for every country we would display a circle. Which would have all below parameters if we hower over it,
* Country Name
* Confirmed cases
* Fatality rate
* Fatalities
* Lockdown date
* Date of first confirmed case
* Mean age

The radius of circle would be dependant on ConfirmedCases.

For example, US had max number of cases as compared to many contries and hence have a bigger circle.

In [None]:
tempdf1.daily_tested = tempdf1.daily_tested.astype("float")
m = folium.Map(location=[0, 0], tiles='cartodbpositron',
               min_zoom=1, max_zoom=8, zoom_start=1.5)

for i in range(0, len(tempdf1)):
    folium.Circle(
        location=[tempdf1.iloc[i]['Latitude'], tempdf1.iloc[i]['Longtitude']],
        color='green', fill='green',
        tooltip =   '<li><bold>Country : '+str(tempdf1.iloc[i]['Country_Region'])+
                    '<li><bold>Confirmed : '+str(tempdf1.iloc[i]['ConfirmedCases'])+
                    '<li><bold>Death_rate : '+str(tempdf1.iloc[i]['Fatalities_rate'])+
                    '<li><bold>Deaths : '+str(tempdf1.iloc[i]['Fatalities'])+
                    '<li><bold>lockdown date : '+str(tempdf1.iloc[i]['Lockdown_Date'])+
                    '<li><bold>first case date : '+str(tempdf1.iloc[i]['Date_FirstConfirmedCase'])+
                    '<li><bold>mean age : '+str(tempdf1.iloc[i]['Mean_Age'])
        ,
        radius=int(tempdf1.iloc[i]['ConfirmedCases']*1.1)).add_to(m)

m

* In the cell below, all the NAN values of diff_FC_LD will be replaced by 62 and all the NAN values of diff_FF_LD will be replaced by 42.
* Then, these two colums will be converted to integer after doing some operations on it.

In [None]:
tempdf1.diff_FC_LD.replace({np.NAN:"62"},inplace=True)
tempdf1.diff_FF_LD.replace({np.NAN:"42"},inplace=True)
tempdf1.diff_FC_LD=tempdf1.diff_FC_LD.astype(str)
tempdf1.diff_FF_LD=tempdf1.diff_FF_LD.astype(str)
tempdf1.diff_FC_LD=tempdf1.diff_FC_LD.str[:2]
tempdf1.diff_FF_LD=tempdf1.diff_FF_LD.str[:2]
tempdf1.diff_FC_LD = tempdf1.diff_FC_LD.astype(int)
tempdf1.diff_FF_LD = tempdf1.diff_FF_LD.astype(int)

Calculate certain lists to be used for chatbots.
* deathrate : To display country with highest death rate.
* lockdown : To display country with highest lockdown.


In [None]:
deathrate = []
lockdown = []
for x in tempdf1.sort_values('Fatalities_rate')['Country_Region']:
    deathrate.append(x)
temp_L = tempdf1.sort_values('ConfirmedCases').tail(10)
for x in temp_L.sort_values('diff_FC_LD')['Country_Region']:
  lockdown.append(x)
  x.rstrip("\n");

* Sort train_w4 for countries
* Create tmp_df using train_w4 for unique values of countries
* Merge tmp_df with train_w4 to create tmp_df on column Date of train_w4 with two columns of "ConfirmedCases" & "Country_Region".

In below cell, we are trying to find the list of top 5 contries with confimed cases.

In [None]:
train_w4.sort_values(["Country_Region"],inplace=True)
tmp_df = pd.DataFrame([train_w4.Country_Region.unique()]).T
tmp_df.columns =["Country_Region"]
tmp_df=pd.merge(tmp_df, train_w4.loc[(train_w4.Date == max(train_w4.Date)),["ConfirmedCases","Country_Region"]],how="left",on="Country_Region")
tmp_df[tmp_df.Country_Region.isin(["Morocco","Egypt","Algeria","Tunisia","France","Spain","Italy","Korea, South"])]
tmp_df1 = tmp_df.sort_values(by=['ConfirmedCases'])
positions = []
for x in tmp_df1['Country_Region'].tail():
    x.rstrip("\n");
    positions.append(x)
listToStr = ','.join([str(elem) for elem in positions]) 

In below cell, we are trying to find the list of top 5 contries with daily tested cases.

In [None]:
temp1 = tempdf1.sort_values(by=['daily_tested'])
most_test = []
for x in temp1['Country_Region'].tail():
    x.rstrip("\n");
    most_test.append(x)
listToStrC = ','.join([str(elem) for elem in most_test]) 
most_test[-1]
listToStrC

'Germany,India,Spain,Russia,US'

In below cell, we are trying to find the max and min deaths.

In [None]:
fatalities = []
for x in tempdf1.sort_values('Fatalities')['Country_Region'].tail():
    x.rstrip("\n");
    fatalities.append(x)
listToStrD = ','.join([str(elem) for elem in fatalities]) 

## Covid-19 Chatbot 

## NLTK Tool kit:

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

* **nltk** is a development library for the natural language processing as we are going to give the general questions and responses as the intents so that the chatbot can respond to the relevant questions.

In [None]:
!pip install nltk
import nltk



## Installing h5py library:
* As we have created the chatbot model and saved it into the .h5 file, lets see what type of data are present in the file.

In [None]:
!pip install h5py
import h5py



In [None]:
!python --version

Python 3.6.9


## Installing "punkt" and " wordnet"

## PUNKT:
Punkt  is a Sentence Tokenizer. This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
## WORDNET:

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples.

In [None]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import random

* We have a whole bunch of libraries like nltk (Natural Language Toolkit), which contains a whole bunch of tools for cleaning up text and preparing it for deep learning algorithms, json, which loads json files directly into Python, pickle, which loads pickle files, numpy, which can perform linear algebra operations very efficiently, and keras, which is the deep learning framework we’ll be using.

## Installation of Kears and Tensorflow packages

### Keras: 
Keras is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

### Tensroflow:
It is an open source artificial intelligence library, using data flow graphs to build models. It allows developers to create large-scale neural networks with many layers. TensorFlow is mainly used for: Classification, Perception, Understanding, Discovering, Prediction and Creation.

For now install commands are commented as the libraries are already installed.


In [None]:
#!pip install keras==2.3.0
#!pip install tensorflow
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

Using TensorFlow backend.


In [None]:
from IPython.display import JSON
JSON('intents.json')

<IPython.core.display.JSON object>

In [None]:
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()

Replace keywords from JSON file by the values calculated above.

In [None]:
data_file = data_file.replace('top_affected_countries', listToStr)
data_file = data_file.replace('highest_cases_country', positions[-1])
data_file = data_file.replace('top_testing_countries', listToStrC)
data_file = data_file.replace('highest_testing_countries', most_test[-1])
data_file = data_file.replace('top_death_countries', listToStrD)
data_file = data_file.replace('highest_death_country', fatalities[-1])
data_file = data_file.replace('max_death_us', fatalities_US[-1])
data_file = data_file.replace('min_death_us', fatalities_US[1])
data_file = data_file.replace('max_deathrate', deathrate[-1])
data_file = data_file.replace('min_deathrate', deathrate[1])
data_file = data_file.replace('good_lockdown_counrty', lockdown[1])
data_file = data_file.replace('poor_lockdown_counrty', lockdown[-1])

* We use the json module to load in the file and save it as the variable intents.
* Here we are going to use the neted if loop as this intents file has the sub-objects within the objects. For expample, patterns is an attribute in the intents. So this nested if loop takes out all the words within the patterns and add them to word file.
* And add the words available in the tag to classes.

In [None]:
intents = json.loads(data_file)
with open('intents_up.json', 'w') as json_file:
    json.dump(intents, json_file)

In [None]:
for intent in intents['intents']:
    for pattern in intent['patterns']:

        # take each word and tokenize it
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        # adding documents
        documents.append((w, intent['tag']))

        # adding classes to our class list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

* Next, we will take the words list and lemmatize and lowercase all the words inside. In case you don’t already know, lemmatize means to turn a word into its base meaning, or its lemma. For example, the words “walking”, “walked”, “walks” all have the same lemma, which is just “walk”. The purpose of lemmatizing our words is to narrow everything down to the simplest level it can be. It will save us a lot of time and unnecessary error when we actually process these words for machine learning. This is very similar to stemming, which is to reduce an inflected word down to its base or root form.

In [None]:
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))

classes = sorted(list(set(classes)))

print (len(documents), "documents")

print (len(classes), "classes", classes)

print (len(words), "unique lemmatized words", words)


pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

320 documents
56 classes ['Germany,India,Spain,Russia,US', 'adverse_drug', 'animals_corona', 'ans_admitted', 'antibiotics_for_corona', 'any_other_querie', 'ask_and_get_info', 'blood_pressure', 'blood_pressure_search', 'cause_corona', 'corona_and_other_flu', 'corona_infected_ans', 'corona_infected_que', 'corona_synonym', 'defination_corona', 'emergency_symptoms', 'faq_symptoms_question', 'first_symptom_quetion', 'good_lockdown', 'goodbye', 'greeting', 'handle_answer', 'highest_death_US', 'highest_deathrate_US', 'hospital_search', 'incubation_corona', 'location_identifier', 'lowest_death_US', 'lowest_deathrate_US', 'mask_and_corona', 'matching_symptoms', 'options', 'patient_age', 'patient_background', 'person_admitted_q', 'pharmacy_search', 'poor_lockdown', 'prevention_corona', 'query', 'query_newborn', 'query_pregnancy', 'query_public', 'query_smoking', 'reason_behind_corona', 'risk_catching_corona', 'sars_and_corona', 'send_the_correct_response', 'spread_corona', 'starting_place_corona

* Hence the three files - intents_up.json, words.pkl and classes.pkl got crerated and all the data is fed into them making it ready to build the deep learning model.

# Creating the Deep Learning Model for Chatbot

In [None]:
# initializing training data
training = []
output_empty = [0] * len(classes)
for doc in documents:
    # initializing bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Training data created


In [None]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)

print("model created")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [None]:
with h5py.File("chatbot_model.h5", "r") as hdf:
#with h5py.File("/Users/sasankkantana/Desktop/Applications_of_AI/Project/chatbot_model.h5", "r") as hdf:
    ls = list(hdf.keys())
    print("The data present in the file: \n", ls)
    data1=hdf.get("model_weights")
    data2=hdf.get("optimizer_weights")
    dataset1=np.array(data1)
    dataset2=np.array(data2)

The data present in the file: 
 ['model_weights', 'optimizer_weights']


In [None]:
dataset1

array(['dense_1', 'dense_2', 'dense_3', 'dropout_1', 'dropout_2'],
      dtype='<U9')

In [None]:
dataset2

array(['SGD', 'moment_0:0', 'moment_1:0', 'moment_2:0', 'moment_3:0',
       'moment_4:0', 'moment_5:0'], dtype='<U10')

## Chatbot Graphic User Interface creation

* For the purpose of buildingt he GUI we need to again restore and load all the files, model that we have created and saved.

In [None]:
from keras.models import load_model
model = load_model('chatbot_model.h5')
intents = json.loads(open('intents_up.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

In [None]:
def clean_up_sentence(sentence):
    sentence_words = nltk.word_tokenize(sentence)
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words

# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence

def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))

def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    if result.startswith('http'):
        root = tk.Tk()
        result = tk.Label(root, text=result,fg="blue", cursor="hand")
        result.bind("<Button-1>",change_case)
        result.bind("<Enter>",red_text)
        result.bind("<Leave>",black_text)

        result.grid()
    return result

def chatbot_response(msg):
    ints = predict_class(msg, model)
    res = getResponse(ints, intents)
    return res


* We have the clean_up_sentence() function which cleans up any sentences that are inputted. This function is used in the bow() function, which takes the sentences that are cleaned up and creates a bag of words that are used for predicting classes (which are based off the results we got from training our model earlier).
* In our predict_class() function, we use an error threshold of 0.25 to avoid too much overfitting. This function will output a list of intents and the probabilities, their likelihood of matching the correct intent. The function getResponse() takes the list outputted and checks the json file and outputs the most response with the highest probability.
* Finally our chatbot_response() takes in a message (which will be inputted through our chatbot GUI), predicts the class with our predict_class() function, puts the output list into getResponse(), then outputs the response. What we get is the foundation of our chatbot. We can now tell the bot something, and it will then respond back.

# Building Chatbot GUI using tkinter
### What is a tkinter?

* Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python interface to the Tk GUI toolkit, and is Python's de facto standard GUI. Tkinter is included with standard Linux, Microsoft Windows and Mac OS X installs of Python. The name Tkinter comes from Tk interface.
* So we are going to use the tkinter to create the chatbot structure.
* We will import the necessary libraries and functions required.

In [None]:
#Creating GUI with tkinter
import tkinter
import tkinter as tk
import webbrowser
from tkinter import *


def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)

    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))

        res = chatbot_response(msg)
        ChatLog.insert(END, "Assistant: " + res + '\n\n')

        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)


base = Tk()
base.title("Covid-19 Assistant")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)

#Create Chat window for by providing the required dimensions
ChatLog = Text(base, bd=0, bg="grey", fg="white", height="8", width="50", font="Arial",)

ChatLog.config(state=DISABLED)

#Bind scrollbar to Chat window to check the previous and next responses
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set


#Create the box to enter message for the user to give a query to the bot.
EntryBox = Text(base, bd=0, bg="white",width="50", height="8", font="Arial")
EntryBox.bind("<Return>", send)

#Create Button to send message to send the query given by user to bot in order to make it respond
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='black',
                    command= send )

#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=300)
SendButton.place(x=6, y=401, height=90)

base.mainloop()

TclError: ignored