# Part A

**• DOMAIN:** Digital content management

**CONTEXT:** Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.
are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a
classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

**• DATA DESCRIPTION:** Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

• 8240 "10s" blogs (ages 13-17),

• 8086 "20s" blogs(ages 23-27) and

• 2994 "30s" blogs (ages 33-47


 For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

**• PROJECT OBJECTIVE:** To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.



**Steps and tasks:**

#1. Read and Analyse Dataset.


In [20]:
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download("stopwords")
from google.colab import drive
drive.mount("/content/drive")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from zipfile import ZipFile

file_name = "/content/drive/My Drive/Colab Notebooks/blogs.zip"
  
with ZipFile(file_name, 'r') as zip:
    zip.printdir()
    print('Extracting all the files now...')
    zip.extractall()
    print('Done!')


File Name                                             Modified             Size
blogtext.csv                                   2019-09-20 22:33:20    800419647
Extracting all the files now...
Done!


In [22]:
blog_data = pd.read_csv('/content/blogtext.csv')

In [23]:
print("Dimesions of data : ", blog_data.shape)

Dimesions of data :  (681284, 7)


In [24]:
#selecting subset of the data due to memory issues and notebook crashing
blog_data = pd.read_csv('/content/blogtext.csv',nrows = 10000,index_col=False) 
blog_data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [25]:
blog_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   gender  10000 non-null  object
 2   age     10000 non-null  int64 
 3   topic   10000 non-null  object
 4   sign    10000 non-null  object
 5   date    10000 non-null  object
 6   text    10000 non-null  object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


In [26]:
blog_data.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [27]:
blog_data['topic'].value_counts()

indUnk                     3287
Technology                 2654
Fashion                    1622
Student                    1137
Education                   270
Marketing                   156
Engineering                 127
Internet                    118
Communications-Media         99
BusinessServices             91
Sports-Recreation            80
Non-Profit                   71
InvestmentBanking            70
Science                      63
Arts                         45
Consulting                   21
Museums-Libraries            17
Banking                      16
Automotive                   14
Law                          11
LawEnforcement-Security      10
Religion                      9
Accounting                    4
Publishing                    4
HumanResources                2
Telecommunications            2
Name: topic, dtype: int64

In [28]:
blog_data['gender'].value_counts()

male      5916
female    4084
Name: gender, dtype: int64

**A. Clearly write outcome of data analysis**

-We have huge dataset of 6 lakhs+ records and 7 attributes.
-No null values present in the dataset.

-ID and date columns can be dropped since these do not have a significant use.

-datatypes can be changed based on the requirement i.e int to object for all columns.

-Since the dataset is huge, we can select a small chunk for analysis



Dropping date and ID column

In [29]:
blog_data.drop(labels=['id','date'], axis=1,inplace=True)

In [30]:
blog_data['age']=blog_data['age'].astype('object') #changing dtype to object for age column

**B. Clean the Structured Data**


i. Missing value analysis and imputation

In [31]:
print('Missing/Null values:',blog_data.isnull().sum())

Missing/Null values: gender    0
age       0
topic     0
sign      0
text      0
dtype: int64


In [32]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993243 sha256=a24e70b87ce7e29a5bca1ad794ef8e123df8f20506c8eac59d16e4c154dbf1b1
  Stored in directory: /root/.cache/pip/wheels/d1/c1/d9/7e068de779d863bc8f8fc9467d85e25cfe47fa5051fff1a1bb
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


ii. Eliminate Non-English textual data.

In [33]:
from langdetect import detect_langs

for text in blog_data['text']:
    try:
        lang = detect_langs(text)[0].lang
        if lang == 'en':
      
            pass
        else:
        
            blog_data['text'].remove(text)
    except:
        pass

In [34]:
blog_data.shape

(10000, 5)

#2. Preprocess unstructured data to make it consumable for model training.

A. Eliminate All special Characters and Numbers 

In [35]:
import re
blog_data['clean_text'] = blog_data['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x))

In [36]:
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",Info has been found pages and MB of pdf files...
1,male,15,Student,Leo,These are the team members: Drewe...,These are the team members Drewes van der Laa...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde MAAK JE E...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo s Toolbar I can now capture t...


B. Lowercase all textual data

In [37]:
blog_data['clean_text'] = blog_data['clean_text'].apply(lambda x: x.lower())
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info has been found pages and mb of pdf files...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members drewes van der laa...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde maak je e...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo s toolbar i can now capture t...


C. Remove all Stopwords

In [38]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

blog_data['clean_text']=blog_data['clean_text'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...


D. Remove all extra white spaces

In [39]:
blog_data['clean_text']=blog_data['clean_text'].apply(lambda x: x.strip())
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...


# 3. Build a base Classification model

A. Create dependent and independent variables

In [40]:
data = blog_data[['clean_text','topic']]
data.head()

Unnamed: 0,clean_text,topic
0,info found pages mb pdf files wait untill team...,Student
1,team members drewes van der laag urllink mail ...,Student
2,het kader van kernfusie op aarde maak je eigen...,Student
3,testing testing,Student
4,thanks yahoo toolbar capture urls popups means...,InvestmentBanking


In [41]:
data['CategoryId'] = data['topic'].factorize()[0]
data.head()

Unnamed: 0,clean_text,topic,CategoryId
0,info found pages mb pdf files wait untill team...,Student,0
1,team members drewes van der laag urllink mail ...,Student,0
2,het kader van kernfusie op aarde maak je eigen...,Student,0
3,testing testing,Student,0
4,thanks yahoo toolbar capture urls popups means...,InvestmentBanking,1


In [42]:
x = data['clean_text']
y = data['CategoryId']

In [43]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(y)

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [44]:
x.head()

0    info found pages mb pdf files wait untill team...
1    team members drewes van der laag urllink mail ...
2    het kader van kernfusie op aarde maak je eigen...
3                                      testing testing
4    thanks yahoo toolbar capture urls popups means...
Name: clean_text, dtype: object

B. Split data into train and test.

In [45]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2) #splitting into 80(train) and 20(test)

C. Vectorize data using any one vectorizer

In [46]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 3), stop_words = 'english')

corpus = list(X_train)+list(X_test)

In [47]:
count_vect.fit(corpus)

vectXtrain = count_vect.transform(X_train)
vectXtest = count_vect.transform(X_test)

In [48]:
count_vect.get_feature_names_out()[:10]

array(['aa', 'aa amazing', 'aa amazing things', 'aa anger',
       'aa anger management', 'aa compared', 'aa compared tougher',
       'aa keeps', 'aa keeps saying', 'aa nice'], dtype=object)

In [49]:
label_counts = {}

for label in data.topic:
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

label_counts

{'Student': 1137,
 'InvestmentBanking': 70,
 'indUnk': 3287,
 'Non-Profit': 71,
 'Banking': 16,
 'Education': 270,
 'Engineering': 127,
 'Science': 63,
 'Communications-Media': 99,
 'BusinessServices': 91,
 'Sports-Recreation': 80,
 'Arts': 45,
 'Internet': 118,
 'Museums-Libraries': 17,
 'Accounting': 4,
 'Technology': 2654,
 'Law': 11,
 'Consulting': 21,
 'Automotive': 14,
 'Religion': 9,
 'Fashion': 1622,
 'Publishing': 4,
 'Marketing': 156,
 'LawEnforcement-Security': 10,
 'HumanResources': 2,
 'Telecommunications': 2}

D. Build a base model for Supervised Learning - Classification.

In [50]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(vectXtrain, y_train)
pred = rfc.predict(vectXtest)

E. Clearly print Performance Metrics. 

In [51]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, f1score, support = score(y_test, pred, average='micro')

print('Accuracy score: ', accuracy_score(y_test, pred))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.5015
Precision score: 0.5015
F1 score:  0.5015
Recall score:  0.5015


# 4. Improve Performance of model.

A. Experiment with other vectorisers. 

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(min_df=3,  max_features=None, 
             strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
             ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
             stop_words = 'english')



tf_idf.fit(list(X_train) + list(X_test))
Xtrain_tf =  tf_idf.transform(X_train) 
Xtest_tf = tf_idf.transform(X_test)

B. Build classifier Models using other algorithms than base model

In [53]:
from sklearn.linear_model import LogisticRegression

lrmodel=LogisticRegression(solver='lbfgs')

lrmodel.fit(Xtrain_tf, y_train)

lr_pred = lrmodel.predict(Xtest_tf) 

In [54]:
precision, recall, f1score, support = score(y_test, lr_pred, average='micro')
print('Accuracy score: ', accuracy_score(y_test, lr_pred))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.627
Precision score: 0.627
F1 score:  0.627
Recall score:  0.627


C. Tune Parameters/Hyperparameters of the model/s.


In [55]:
from sklearn.model_selection import RandomizedSearchCV


model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l1','l2']
c_values = [100, 10, 1.0, 0.1]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
random_search = RandomizedSearchCV(model, grid, scoring='accuracy')
grid_result = random_search.fit(Xtrain_tf, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.659750 using {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10}


In [56]:
lrmodel2=LogisticRegression(solver='lbfgs', penalty = 'l2', C = 100)

lrmodel2.fit(Xtrain_tf, y_train)

lr_pred2 = lrmodel2.predict(Xtest_tf) 

D. Clearly print Performance Metrics.
Hint: Accuracy, Precision, Recall, ROC-AUC

In [57]:
precision, recall, f1score, support = score(y_test, lr_pred2, average='micro')
print('Accuracy score: ', accuracy_score(y_test, lr_pred2))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.6885
Precision score: 0.6885
F1 score:  0.6885
Recall score:  0.6885


# 5. Share insights on relative performance comparison.

**A. Which vectorizer performed better? Probable reason?**

Answer: TF-IDF vectorizer performed better as per model performances above on both, the accuracy improved from 50.15% to 62.7% and to 68.85% after hyperparametere tuning. Reason can be the fact the unlike count vectorizer, TF-IDF does not only focus on word count but also with the importance of words in the corpus. This way we can neglect/remove words with less importance which would reduce the input diamensions leading to a less complex model than we would get with countvectorizer

**B. Which model outperformed? Probable reason?**

Answer - Logistic regression model performed better because of the change in vectorizer probably since we used TF-IDF as vectorizer for in order to build this model instead of count vectorizer used for initial random forest model.

**C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?**

Answer: The model performance improved after we got following hyperparameters after hyperparameter tuning using grid search 'solver': 'lbfgs', 'penalty': 'l2', 'C': 100. The probable reason could be based on the C value since the solver was the same for the model before hyperparameter tuning and L2 is the default penalty for logistic regression models. The only new value was for hyperparameter C as 100 instead of 1.0.

**D. According to you, which performance metric should be given most importance, why?.**

Answer: I feel the metric importance should be based on the type of problem or data we are dealing with. For example accuracy generally should be used for classification problems or however in case of imbalances classes we should give importance to precision, recall, F1 score or AUC-ROC. While for regression problems i think it should MAE, MSE/RMSE etc

*****************************************

# Part B

**• DOMAIN:** Customer support

**• CONTEXT:** Great Learning has a an academic support department which receives numerous support requests every day throughout the year.
Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy
workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper
resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with
the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human
support executive if the request is complex or not in it’s database.

**• DATA DESCRIPTION: **A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.


**• PROJECT OBJECTIVE:** Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for. 
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.

In [58]:
import json
from google.colab import drive
drive.mount("/content/drive")
f = open('/content/drive/My Drive/Colab Notebooks/GL Bot.json')
data = json.load(f)


#Display corpus
print(data)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
{'intents': [{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?'], 'context_set': ''}, {'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye'], 'context_set': ''}, {'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working'

In [59]:

!pip install -q --upgrade ipython
!pip install -q --upgrade ipykernel
     

!pip install nltk --quiet
     

!pip install scikit-multilearn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.4/796.4 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires ipython~=7.34.0, but you have ipython 8.12.0 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.0/150.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires ipykernel~=5.5.6, but you have ipykernel 6.22.0 which is incompatible.
google-colab 1.0.0 requires ipython~=7.34.0, but you have ipyth

In [60]:

#Importing all the necessary libraries
import os
import json 
import string
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from skmultilearn.problem_transform import ClassifierChain
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

import re
import nltk
nltk.download('stopwords')  #downloading stopwords
nltk.download("punkt")  #downloading sentence tokenizer
nltk.download("wordnet") #downloading english dictionary corpus
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import SnowballStemmer

import tensorflow as tf 
from tensorflow.keras import Sequential 
from tensorflow.keras.layers import Dense, Dropout

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [61]:
# initializing lemmatizer to get stem of words
lemmatizer = WordNetLemmatizer()
# Each list to create
words = []
classes = []
doc_X = []
doc_y = []
# Loop through all the intents
# tokenize each pattern and append tokens to words, the patterns and
# the associated tag to their associated list
for intent in data["intents"]:
    for pattern in intent["patterns"]:
        tokens = nltk.word_tokenize(pattern)
        words.extend(tokens)
        doc_X.append(pattern)
        doc_y.append(intent["tag"])
    
    # add the tag to the classes if it's not there already 
    if intent["tag"] not in classes:
        classes.append(intent["tag"])
# lemmatize all the words in the vocab and convert them to lowercase
# if the words don't appear in punctuation
words = [lemmatizer.lemmatize(word.lower()) for word in words if word not in string.punctuation]
# sorting the vocab and classes in alphabetical order and taking the # set to ensure no duplicates occur
words = sorted(set(words))
classes = sorted(set(classes))

In [62]:
print("Words:\n", words)

Words:
 ['a', 'able', 'access', 'activation', 'ada', 'adam', 'aifl', 'aiml', 'am', 'an', 'ann', 'anyone', 'are', 'artificial', 'backward', 'bad', 'bagging', 'batch', 'bayes', 'belong', 'best', 'blended', 'bloody', 'boosting', 'bot', 'buddy', 'classification', 'contact', 'create', 'cross', 'cya', 'day', 'deep', 'did', 'diffult', 'do', 'ensemble', 'epoch', 'explain', 'first', 'for', 'forest', 'forward', 'from', 'function', 'good', 'goodbye', 'gradient', 'great', 'hate', 'have', 'hell', 'hello', 'help', 'helped', 'hey', 'hi', 'hidden', 'hour', 'how', 'hyper', 'i', 'imputer', 'in', 'intelligence', 'is', 'jerk', 'joke', 'knn', 'later', 'layer', 'learner', 'learning', 'leaving', 'link', 'listen', 'logistic', 'lot', 'machine', 'me', 'ml', 'my', 'naive', 'name', 'nb', 'net', 'network', 'neural', 'no', 'not', 'of', 'olympus', 'olypus', 'on', 'online', 'operation', 'opertions', 'otimizer', 'parameter', 'piece', 'please', 'pm', 'problem', 'propagation', 'random', 'regression', 'relu', 'screw', 's

In [63]:
print("Labels:\n", classes)

Labels:
 ['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket']


In [64]:
print(doc_X)

['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time', 'thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy', 'olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of problem with olympus', 'olypus is not a good tool', 'lot of problems with olympus', 'how to use olympus', 'teach me olympus', 'i am not able to understand svm', 'explain me how machine learning works', 'i am not able to understand naive bayes', 'i am n

In [65]:
print(doc_y)

['Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Tick

In [66]:
# list for training data
training = []
out_empty = [0] * len(classes)
# creating the bag of words model
for idx, doc in enumerate(doc_X):
    bow = []
    text = lemmatizer.lemmatize(doc.lower())
    for word in words:
        bow.append(1) if word in text else bow.append(0)
    # mark the index of class that the current pattern is associated
    # to
    output_row = list(out_empty)
    output_row[classes.index(doc_y[idx])] = 1
    # add the one hot encoded BoW and associated classes to training 
    training.append([bow, output_row])
# shuffle the data and convert it to an array
random.shuffle(training)
training = np.array(training, dtype=object)
# split the features and target labels
train_X = np.array(list(training[:, 0]))
train_y = np.array(list(training[:, 1]))# list for training data
training = []
out_empty = [0] * len(classes)
# creating the bag of words model
for idx, doc in enumerate(doc_X):
    bow = []
    text = lemmatizer.lemmatize(doc.lower())
    for word in words:
        bow.append(1) if word in text else bow.append(0)
    # mark the index of class that the current pattern is associated to
    output_row = list(out_empty)
    output_row[classes.index(doc_y[idx])] = 1
    # add the one hot encoded BoW and associated classes to training 
    training.append([bow, output_row])
# shuffle the data and convert it to an array
random.shuffle(training)
training = np.array(training, dtype=object)
# split the features and target labels
train_X = np.array(list(training[:, 0]))
train_y = np.array(list(training[:, 1]))

In [67]:
# defining some parameters
input_shape = (len(train_X[0]),)
output_shape = len(train_y[0])
epochs = 200

#Clear any existing model in memory
tf.keras.backend.clear_session()

# the deep learning model

#Initialize model
model = Sequential()

# Input layer 
model.add(Dense(128, input_shape=input_shape, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.3))

#Output layer
model.add(Dense(output_shape, activation = "softmax"))

#Defining optimizer
adam = tf.keras.optimizers.legacy.Adam(learning_rate=0.01, decay=1e-6)
#Configuring the model for training
model.compile(loss='categorical_crossentropy',
              optimizer=adam,
              metrics=["accuracy"])

# Model summary
print(model.summary())

# Training the model
model.fit(x=train_X, y=train_y, epochs=200, verbose=1)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               20352     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 8)                 520       
                                                                 
Total params: 29,128
Trainable params: 29,128
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200

<keras.callbacks.History at 0x7fa4baf98c10>

In [68]:
#Functions for Chatbot Sessions
def clean_text(text): 
  tokens = nltk.word_tokenize(text)
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  return tokens

def bag_of_words(text, vocab): 
  tokens = clean_text(text)
  bow = [0] * len(vocab)
  for w in tokens: 
    for idx, word in enumerate(vocab):
      if word == w: 
        bow[idx] = 1
  return np.array(bow)

def pred_class(text, vocab, labels): 
  bow = bag_of_words(text, vocab)
  result = model.predict(np.array([bow]))[0]
  thresh = 0.2
  y_pred = [[idx, res] for idx, res in enumerate(result) if res > thresh]

  y_pred.sort(key=lambda x: x[1], reverse=True)
  return_list = []
  for r in y_pred:
    return_list.append(labels[r[0]])
  return return_list

def get_response(intents_list, intents_json): 
  tag = intents_list[0]
  list_of_intents = intents_json["intents"]
  for i in list_of_intents: 
    if i["tag"] == tag:
      result = random.choice(i["responses"])
      break
  return result

In [69]:
# Running the chatbot
print("BOT : Chat with the bot[Type 'quit' to stop] !")
print("\nBOT : If answer is not  right[Type '*'] !")
while True:
  #Reading Input
  message = input("\n\nYou: ")
  #Correcting chat]
  if message.lower() == "*":
    print("\nBOT:Please rephrase your question and try again")
  #Stopping Chat
  if message.lower() == "quit":
    break
  #Predicting and printing response  
  intents = pred_class(message, words, classes)
  result = get_response(intents, data)
  print("\nBOT : ", result)

BOT : Chat with the bot[Type 'quit' to stop] !

BOT : If answer is not  right[Type '*'] !


You: hi

BOT :  Hello! how can i help you ?


You: can you help me to access olympus

BOT :  Link: Olympus wiki


You: what is neural network

BOT :  Link: Neural Nets wiki


You: see you

BOT :  I hope I was able to assist you, Good Bye


You: bye

BOT :  Hello! how can i help you ?


You: quit
