# Project 1

DOMAIN: Digital content management

CONTEXT: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

DATA DESCRIPTION: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected
posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many,
industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions.
Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url
link. 

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

PROJECT OBJECTIVE: The need is to build a NLP classifier which can use input text parameters to determine the label/s of the blog.

Steps and tasks:
1. Import and analyse the data set.
2. Perform data pre-processing on the data:
• Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.
• Target/label merger and transformation
• Train and test split
• Vectorisation, etc.
3. Design, train, tune and test the best text classifier.
4. Display and explain detail the classification report
5. Print the true vs predicted labels for any 5 entries from the dataset.

Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new
approaches to design the best model.

In [237]:
#Set tensorflow version
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

'2.4.1'

In [238]:
# Initialize the random number generator
import random
random.seed(0)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

In [239]:
import pandas as pd
import matplotlib.pyplot as plt # data visualization library
%matplotlib inline
import seaborn as sns

In [240]:
import re
import nltk

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, recall_score


from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer #word stemmer class
lemma = WordNetLemmatizer()
from wordcloud import WordCloud, STOPWORDS
from nltk import FreqDist

In [241]:
#Let us load the data
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [242]:
#Import the data and analyse the dataset
blog_data = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/NLP/blogtext.csv')
blog_data.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [243]:
blog_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,681284.0,2397802.0,1247723.0,5114.0,1239610.0,2607577.0,3525660.0,4337650.0
age,681284.0,23.93233,7.786009,13.0,17.0,24.0,26.0,48.0


In [244]:
#Analyze the dataset
blog_data.shape

(681284, 7)

We can see that the dataset has 681284 rows and 7 columns

Since the dataset is huge we will perform analysis on a subset of the entire data, say 10000 records.

In [245]:
blog_data_sample=blog_data.head(10000)

In [246]:
blog_data.isna().any()

id        False
gender    False
age       False
topic     False
sign      False
date      False
text      False
dtype: bool

Step 2 - Perform data pre-processing on the data: • Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase. • Target/label merger and transformation • Train and test split • Vectorisation, etc.

We can see that there is no missing values in the dataset

In [247]:
blog_data_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   gender  10000 non-null  object
 2   age     10000 non-null  int64 
 3   topic   10000 non-null  object
 4   sign    10000 non-null  object
 5   date    10000 non-null  object
 6   text    10000 non-null  object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


Let us drop id and date columns from the dataset as they do not add much value

In [248]:
blog_data_sample.drop(['id','date'], axis=1, inplace=True)

In [249]:
blog_data_sample.head()

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,..."
1,male,15,Student,Leo,These are the team members: Drewe...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...


In [250]:
blog_data_sample['age']=blog_data_sample['age'].astype('object')

In [251]:
blog_data_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   gender  10000 non-null  object
 1   age     10000 non-null  object
 2   topic   10000 non-null  object
 3   sign    10000 non-null  object
 4   text    10000 non-null  object
dtypes: object(5)
memory usage: 390.8+ KB


Now all columns are of type object

Let us now remove all unwanted text from the text column

In [252]:
#remove unwanted characters
blog_data_sample['processed_data']=blog_data_sample['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x))

In [253]:
#change everything to lowercase
blog_data_sample['processed_data']=blog_data_sample['processed_data'].apply(lambda x: x.lower())

In [254]:
#strip spaces
blog_data_sample['processed_data']=blog_data_sample['processed_data'].apply(lambda x: x.strip())

In [255]:
print("Actual data => {}".format(blog_data_sample['text'][1]))

Actual data =>            These are the team members:   Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering (me)          urlLink mail          


In [256]:
print("Processed data => {}".format(blog_data_sample['processed_data'][1]))

Processed data => these are the team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail


Let us now remove all stop words

In [257]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading packag

True

In [258]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

In [259]:
blog_data_sample['processed_data']=blog_data_sample['processed_data'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))

In [260]:
blog_data_sample['processed_data'][7]

'anything korea country extremes everything seems fad based think may come korea history invaded reported times years time got independence imagine move quickly get next level next war occupation lately well really lately japanese occupation ended korean war occurred turmoil park chung hee took dictator president elections everyone encouraged vote still dictator assassination next leaders basically ilk president park amazing things time however took incredibly backward country set road industrialization japan stripped korea resources people even language culture many buildings palaces razed japanese official language president park determined change orchestrated han river miracle han river hangang main river seoul korea korea made terrific strides expense civil liberties fastforward present point see korea world wired nation canada finland way beyond u craze pc pc bangs rooms everywhere country well instead playstation like games players go computer one two people korean gamers always 

Let us merge all the other columns into labels columns

In [261]:
blog_data_sample['labels']=blog_data_sample.apply(lambda col: [col['gender'],str(col['age']),col['topic'],col['sign']], axis=1)

In [262]:
blog_data_sample.head()

Unnamed: 0,gender,age,topic,sign,text,processed_data,labels
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [263]:
blog_data_sample=blog_data_sample[['processed_data','labels']]

In [264]:
blog_data_sample.head()

Unnamed: 0,processed_data,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [265]:
blog_data_sample.shape

(10000, 2)

Train and test split

In [266]:
X=blog_data_sample['processed_data']

In [267]:
Y=blog_data_sample['labels']

Let us now perform count vectorizer with bi-gram and tri-gram models to get count vectors of X

In [268]:
from sklearn.feature_extraction.text import CountVectorizer

In [269]:
vectorizer=CountVectorizer(binary=True, ngram_range=(1,2))

In [270]:
X=vectorizer.fit_transform(X)

In [271]:
vectorizer.get_feature_names()[:10]

['aa',
 'aa amazing',
 'aa anger',
 'aa compared',
 'aa keeps',
 'aa nice',
 'aa sd',
 'aaa',
 'aaa come',
 'aaa discount']

In [272]:
label_counts=dict()

for labels in blog_data_sample.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label]+=1
        else:
            label_counts[label]=1

In [273]:
label_counts

{'13': 42,
 '14': 212,
 '15': 602,
 '16': 440,
 '17': 1185,
 '23': 253,
 '24': 655,
 '25': 386,
 '26': 234,
 '27': 1054,
 '33': 136,
 '34': 553,
 '35': 2315,
 '36': 1708,
 '37': 33,
 '38': 46,
 '39': 79,
 '40': 1,
 '41': 20,
 '42': 14,
 '43': 6,
 '44': 3,
 '45': 16,
 '46': 7,
 'Accounting': 4,
 'Aquarius': 571,
 'Aries': 4198,
 'Arts': 45,
 'Automotive': 14,
 'Banking': 16,
 'BusinessServices': 91,
 'Cancer': 504,
 'Capricorn': 215,
 'Communications-Media': 99,
 'Consulting': 21,
 'Education': 270,
 'Engineering': 127,
 'Fashion': 1622,
 'Gemini': 150,
 'HumanResources': 2,
 'Internet': 118,
 'InvestmentBanking': 70,
 'Law': 11,
 'LawEnforcement-Security': 10,
 'Leo': 301,
 'Libra': 491,
 'Marketing': 156,
 'Museums-Libraries': 17,
 'Non-Profit': 71,
 'Pisces': 454,
 'Publishing': 4,
 'Religion': 9,
 'Sagittarius': 1097,
 'Science': 63,
 'Scorpio': 971,
 'Sports-Recreation': 80,
 'Student': 1137,
 'Taurus': 812,
 'Technology': 2654,
 'Telecommunications': 2,
 'Virgo': 236,
 'female': 4

In [274]:
#Let us preprocess the labels now
from sklearn.preprocessing import MultiLabelBinarizer
binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

In [275]:
Y=binarizer.fit_transform(blog_data_sample.labels)

Let us now split the data into training and test

In [276]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)

Step 3 Design, train, tune and test the best text classifier

1. Logistic regression

In [277]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [278]:
model=LogisticRegression(solver='lbfgs')

In [279]:
model=OneVsRestClassifier(model)

In [280]:
model.fit(X_train,Y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [281]:
Y_pred=model.predict(X_test)

In [282]:
Y_pred_inversed = binarizer.inverse_transform(Y_pred)
y_test_inversed = binarizer.inverse_transform(Y_test)

Print the true vs predicted labels for any 5 entries from the dataset.

In [283]:
for i in range(5):
    print('Text:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(Y_pred_inversed[i])
    ))

Text:	  (0, 312628)	1
  (0, 525350)	1
  (0, 611860)	1
  (0, 375481)	1
  (0, 170072)	1
  (0, 72612)	1
  (0, 562615)	1
  (0, 374716)	1
  (0, 294992)	1
  (0, 447643)	1
  (0, 314469)	1
  (0, 556303)	1
  (0, 127378)	1
  (0, 357996)	1
  (0, 431617)	1
  (0, 426777)	1
  (0, 23711)	1
  (0, 15856)	1
  (0, 171062)	1
  (0, 527604)	1
  (0, 377796)	1
  (0, 61125)	1
  (0, 166113)	1
  (0, 302921)	1
  (0, 431705)	1
  :	:
  (0, 612148)	1
  (0, 167243)	1
  (0, 166156)	1
  (0, 525557)	1
  (0, 302923)	1
  (0, 18515)	1
  (0, 378394)	1
  (0, 377921)	1
  (0, 153480)	1
  (0, 532672)	1
  (0, 61130)	1
  (0, 15996)	1
  (0, 134048)	1
  (0, 23730)	1
  (0, 72794)	1
  (0, 401862)	1
  (0, 147909)	1
  (0, 12621)	1
  (0, 489058)	1
  (0, 527770)	1
  (0, 448064)	1
  (0, 444983)	1
  (0, 128588)	1
  (0, 556595)	1
  (0, 491223)	1
True labels:	35,Aries,Technology,male
Predicted labels:	male


Text:	  (0, 312628)	1
  (0, 319058)	1
  (0, 449060)	1
  (0, 383735)	1
  (0, 541965)	1
  (0, 131554)	1
  (0, 192891)	1
  (0, 564667)	1
 

2. Naive Bayes Classifier

In [284]:
from sklearn.model_selection import train_test_split
X_train_n,X_test_n,Y_train_n,Y_test_n=train_test_split(X,Y,test_size=0.2)

In [285]:
X_train_n.shape

(8000, 643302)

In [286]:
Y_train.shape

(8000, 64)

In [287]:
Y_train_n.shape

(8000, 64)

In [288]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_model.fit(X_train_n, Y_train_n)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='distance')

In [289]:
Y_pred_n=knn_model.predict(X_test_n)

In [290]:
Y_pred_n_inversed = binarizer.inverse_transform(Y_pred_n)
y_test_n_inversed = binarizer.inverse_transform(Y_test_n)

In [291]:
for i in range(5):
    print('Text:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_n_inversed[i]),
        ','.join(Y_pred_n_inversed[i])
    ))

Text:	  (0, 312628)	1
  (0, 525350)	1
  (0, 611860)	1
  (0, 375481)	1
  (0, 170072)	1
  (0, 72612)	1
  (0, 562615)	1
  (0, 374716)	1
  (0, 294992)	1
  (0, 447643)	1
  (0, 314469)	1
  (0, 556303)	1
  (0, 127378)	1
  (0, 357996)	1
  (0, 431617)	1
  (0, 426777)	1
  (0, 23711)	1
  (0, 15856)	1
  (0, 171062)	1
  (0, 527604)	1
  (0, 377796)	1
  (0, 61125)	1
  (0, 166113)	1
  (0, 302921)	1
  (0, 431705)	1
  :	:
  (0, 612148)	1
  (0, 167243)	1
  (0, 166156)	1
  (0, 525557)	1
  (0, 302923)	1
  (0, 18515)	1
  (0, 378394)	1
  (0, 377921)	1
  (0, 153480)	1
  (0, 532672)	1
  (0, 61130)	1
  (0, 15996)	1
  (0, 134048)	1
  (0, 23730)	1
  (0, 72794)	1
  (0, 401862)	1
  (0, 147909)	1
  (0, 12621)	1
  (0, 489058)	1
  (0, 527770)	1
  (0, 448064)	1
  (0, 444983)	1
  (0, 128588)	1
  (0, 556595)	1
  (0, 491223)	1
True labels:	24,Capricorn,Education,female
Predicted labels:	Aries,male


Text:	  (0, 312628)	1
  (0, 319058)	1
  (0, 449060)	1
  (0, 383735)	1
  (0, 541965)	1
  (0, 131554)	1
  (0, 192891)	1
  (0, 

In [292]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: ', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: ', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: ', recall_score(Ytest, Ypred, average='micro'))

In [293]:
print_evaluation_scores(Y_test, Y_pred)

Accuracy score:  0.3115
F1 score:  0.6348340154615735
Average precision score:  0.451887141413169
Average recall score:  0.5235


In [294]:
print_evaluation_scores(Y_test_n, Y_pred_n)

Accuracy score:  0.051
F1 score:  0.2885366047956664
Average precision score:  0.138394418858445
Average recall score:  0.226375


# **Conclusion**
We have used two classifiers Logistic regression and KNN Classifier. Comparing the accuracies of both, we can clearly see that logistic regression classifier has a much better accuracy of 32% than that of  KNN of only 5%.

# Project 2

DOMAIN: Customer support

CONTEXT: Great Learning has a an academic support department which receives numerous support requests every day throughout the
year. Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to
heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a
proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can
interact with the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request
to an actual human support executive if the request is complex or not in it’s database.

DATA DESCRIPTION: A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics
skills.

PROJECT OBJECTIVE: Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for.
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.
Please use the sample chatbot demo video for reference.

EVALUATION: GL evaluator will use linguistics to twist and turn sentences to ask questions on the topics described in DATA DESCRIPTION
and check if the bot is giving relevant replies.

In [295]:
import nltk
import random
import string
import re, string, unicodedata
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from collections import defaultdict
import warnings
warnings.filterwarnings("ignore")
nltk.download('punkt') 
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [296]:
import json
with open('/content/gdrive/MyDrive/Colab Notebooks/NLP/GL Bot.json') as file:
    bot_data = json.load(file)

In [297]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import random

words=[]
classes = []
documents = []
ignore_words = ['?', '!']
with open('/content/gdrive/MyDrive/Colab Notebooks/NLP/GL Bot.json') as file:
    bot_data = json.load(file)

Preprocess data

Here we iterate through the patterns and tokenize the sentence using nltk.word_tokenize() function and append each word in the words list. We also create a list of classes for our tags.

In [298]:
for intent in bot_data['intents']:
    for pattern in intent['patterns']:

        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))

        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Now we will lemmatize each word and remove duplicate words from the list.

In [299]:
# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)

pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

128 documents
8 classes ['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket']
158 unique lemmatized words ['a', 'able', 'access', 'activation', 'ada', 'adam', 'aifl', 'aiml', 'am', 'an', 'ann', 'anyone', 'are', 'artificial', 'backward', 'bad', 'bagging', 'batch', 'bayes', 'belong', 'best', 'blended', 'bloody', 'boosting', 'bot', 'buddy', 'classification', 'contact', 'create', 'cross', 'cya', 'day', 'deep', 'did', 'diffult', 'do', 'ensemble', 'epoch', 'explain', 'first', 'for', 'forest', 'forward', 'from', 'function', 'good', 'goodbye', 'gradient', 'great', 'hate', 'have', 'hell', 'hello', 'help', 'helped', 'hey', 'hi', 'hidden', 'hour', 'how', 'hyper', 'i', 'imputer', 'in', 'intelligence', 'is', 'jerk', 'joke', 'knn', 'later', 'layer', 'learner', 'learning', 'leaving', 'link', 'listen', 'logistic', 'lot', 'machine', 'me', 'ml', 'my', 'naive', 'name', 'nb', 'net', 'network', 'neural', 'no', 'not', 'of', 'olympus', 'olypus', 'on', 'online', 'operation', 'opertions', 'otimi

We have created pickle dumps for words and classes for this data

Create train and test data

Here, we will create the training data in which we will provide the input and the output. Our input will be the pattern and output will be the class our input pattern belongs to. We will convert text into numbers so that the computer can understand.

In [300]:
# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Training data created


Let us now build the neural network for the chatbot

In [301]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

#fitting and saving the model 
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('gl_chatbot_model.h5', hist)

print("model created")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

We see that we have got a very good accuracy on the training data of 100%. Let us predict using test data now.

1. We will create a new file test_chatbot.py
2. Load the ‘words.pkl’ and ‘classes.pkl’ pickle files which we have created when we trained our model

In [302]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np

from keras.models import load_model
model = load_model('gl_chatbot_model.h5')
import json
import random
intents = json.loads(open('/content/gdrive/MyDrive/Colab Notebooks/NLP/GL Bot.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

In [309]:
#Let us now perform some predictions
def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence

def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words) 
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))

def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

In [310]:
def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result

def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Creating a GUI for the bot


In [311]:
#Creating GUI with tkinter
import tkinter
from tkinter import *


def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)

    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))

        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')

        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)

base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)

#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)

ChatLog.config(state=DISABLED)

#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set

#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )

#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)


#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)

base.mainloop()

TclError: ignored

In [312]:
#to run the chatbot we run the test_chatbot.py file
python test_chatbot.py

SyntaxError: ignored