# NLP Project

# PART A

## Domain:
Digital content management

## Context:
Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.
are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a
classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

## Data Description:
Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)
• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

## Project Objective:
To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

# Digital content management

## 1. Read and Analyse Dataset.

### A. Clearly write outcome of data analysis

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd "/content/drive/MyDrive/AIML"

/content/drive/MyDrive/AIML


In [3]:
project_path = '/content/drive/MyDrive/AIML/'

**Importing the Libraries**

In [4]:
!pip install langdetect
import pandas as pd
import numpy as np 
import re
from nltk.corpus import stopwords
from langdetect import detect
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import SnowballStemmer
import warnings
warnings.filterwarnings('ignore')
import pandas_profiling as pp
import seaborn as sns
import matplotlib as plt
%matplotlib inline

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 5.4 MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=f71bbe8ef7473c6ff0db0ac3efa72b34cd1b3d6a7006a4e304f69645bf7cbaa5
  Stored in directory: /root/.cache/pip/wheels/c5/96/8a/f90c59ed25d75e50a8c10a1b1c2d4c402e4dacfa87f3aff36a
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


**Extract the contents of zip file**

In [5]:
from zipfile import ZipFile

# specifying the zip file name
file_name = project_path + "blogs.zip"
  
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
    # printing all the contents of the zip file
    # zip.printdir()
  
    # extracting all the files
    print('Extracting all the files now...')
    zip.extractall()
    print('Done!')

Extracting all the files now...
Done!


**Read the csv using pandas**

In [6]:
filepath = project_path + "blogtext.csv"

In [7]:
blog_df = pd.read_csv(filepath)

**Get the names of the columns**

In [8]:
blog_df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [9]:
# check the shape of the data frame by using the shape attribute of the data frame
blog_df.shape

(681284, 7)

In [10]:
#check if the data frame is properly loaded using the sample() method
blog_df.sample(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
74649,3105869,female,23,Student,Capricorn,"12,June,2004",urlLink Somewhere in Vatican&nbsp; ...
411179,1461401,female,16,Student,Scorpio,"20,October,2003",Today was another bad day. There ...
380520,1417798,female,35,indUnk,Scorpio,"22,September,2003",Hey Gals! I feel SOOOOOOO ter...
239898,449628,male,34,indUnk,Aries,"07,February,2004",2004 Reading Jamboree Keisha a...
324273,3172762,male,16,Student,Scorpio,"22,June,2004",urlLink flowers&nbsp; urlLink


In [11]:
#Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
#Limiting the data and using fewer rows as the data size is large
#blog_df = blog_df.head(10000)

#blog_df = pd.read_csv(filepath,nrows=100000)
blog_df = pd.read_csv(filepath,nrows=3000)

In [12]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [13]:
blog_df.tail()

Unnamed: 0,id,gender,age,topic,sign,date,text
2995,589736,male,35,Technology,Aries,"05,August,2004",but that zoo exhibit thing was much...
2996,589736,male,35,Technology,Aries,"05,August,2004",my fave song for the day: Aimee Man...
2997,589736,male,35,Technology,Aries,"05,August,2004",urlLink America's Best Zoo Exhibit...
2998,589736,male,35,Technology,Aries,"05,August,2004",'The less one makes declaritive sta...
2999,589736,male,35,Technology,Aries,"05,August,2004",While his status as a media persona...


In [14]:
blog_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3000 non-null   int64 
 1   gender  3000 non-null   object
 2   age     3000 non-null   int64 
 3   topic   3000 non-null   object
 4   sign    3000 non-null   object
 5   date    3000 non-null   object
 6   text    3000 non-null   object
dtypes: int64(2), object(5)
memory usage: 164.2+ KB


In [15]:
blog_df.gender.value_counts()

male      2272
female     728
Name: gender, dtype: int64

In [16]:
blog_df.topic.value_counts()

Technology              1607
indUnk                   452
Student                  403
Engineering              119
Education                118
Sports-Recreation         75
InvestmentBanking         70
Non-Profit                46
Science                   33
BusinessServices          21
Internet                  20
Banking                   16
Communications-Media      14
Arts                       2
Museums-Libraries          2
Accounting                 2
Name: topic, dtype: int64

### B. Clean the Structured Data

**i. Missing value analysis and imputation.**

In [17]:
#chceck for na values
blog_df.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [18]:
#chceck for null values
blog_df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

**ii. Eliminate Non-English textual data.**

In [19]:
def detect_english(text):
  try:
    return detect(text) == 'en'
  except:
    return False

In [20]:
blog_df = blog_df[blog_df['text'].apply(detect_english)]


In [21]:
blog_df.shape

(2820, 7)

In [22]:
blog_df.gender.value_counts()

male      2113
female     707
Name: gender, dtype: int64

In [23]:
blog_df

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
...,...,...,...,...,...,...,...
2994,589736,male,35,Technology,Aries,"05,August,2004","hey, how is everyone doing? i want..."
2995,589736,male,35,Technology,Aries,"05,August,2004",but that zoo exhibit thing was much...
2996,589736,male,35,Technology,Aries,"05,August,2004",my fave song for the day: Aimee Man...
2998,589736,male,35,Technology,Aries,"05,August,2004",'The less one makes declaritive sta...


## 2. Preprocess unstructured data to make it consumable for model training.

### A. Eliminate All special Characters and Numbers

In [24]:
# Select only alphabets
import re
blog_df.text = blog_df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

### B. Lowercase all textual data

In [25]:
blog_df.text = blog_df.text.apply(lambda x: x.lower())

### C. Remove all Stopwords

In [26]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))
blog_df.text = blog_df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### D. Remove all extra white spaces

In [27]:
blog_df.text = blog_df.text.apply(lambda s: s.strip())

In [28]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...


In [29]:
blog_df.reset_index(inplace= True, drop= True)

In [30]:
blog_df

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...
...,...,...,...,...,...,...,...
2815,589736,male,35,Technology,Aries,"05,August,2004",hey everyone want go game still looking job st...
2816,589736,male,35,Technology,Aries,"05,August,2004",zoo exhibit thing mucho mucho funny
2817,589736,male,35,Technology,Aries,"05,August,2004",fave song day aimee mann pavlov bell album los...
2818,589736,male,35,Technology,Aries,"05,August,2004",less one makes declaritive statements less apt...


**Drop unnecessary columns**

In [31]:
# drop id and date columns
blog_df.drop(labels=['id','date'], axis=1,inplace=True)

In [32]:
blog_df.head()

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing testing
4,male,33,InvestmentBanking,Aquarius,thanks yahoo toolbar capture urls popups means...


## 3. Build a base Classification model

### A. Create dependent and independent variables

In [33]:
blog_df.head()

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing testing
4,male,33,InvestmentBanking,Aquarius,thanks yahoo toolbar capture urls popups means...


In [34]:
#drop  gender,age,topic & sign as they are already merged to labels column
blog_df.drop(columns=['gender','age','sign'], axis=1, inplace=True)

In [35]:
blog_df.head()

Unnamed: 0,topic,text
0,Student,info found pages mb pdf files wait untill team...
1,Student,team members drewes van der laag urllink mail ...
2,Student,het kader van kernfusie op aarde maak je eigen...
3,Student,testing testing
4,InvestmentBanking,thanks yahoo toolbar capture urls popups means...


In [36]:
X= blog_df.text

In [37]:
y = blog_df.topic

### B. Split data into train and test.

In [38]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=2,test_size = 0.2)

In [39]:
print(X_train.shape)
print(y_train.shape)

(2256,)
(2256,)


In [40]:
print(X_test.shape)
print(y_test.shape)

(564,)
(564,)


In [41]:
X_test

1181    cant believe im actually making another online...
2020                ees thees say ah yes karl maggie show
2628    oh jesus pulled livejournal friend ex boyfr oh...
221     urllink church bells rang failures rows empty ...
2122    must take issue sweeping statement one cultura...
                              ...                        
503     remember name remember first time met autism c...
1413    going move tree save space sofa going write ma...
156     toaster let nature hard work nice crispiness b...
807     yeah days since updated lots stuff going thoug...
2589    times become hyper aware color wondered colors...
Name: text, Length: 564, dtype: object

In [42]:
y_test

1181       Student
2020    Technology
2628    Technology
221        Student
2122    Technology
           ...    
503      Education
1413    Technology
156     Non-Profit
807        Student
2589    Technology
Name: topic, Length: 564, dtype: object

### C. Vectorize data using any one vectorizer.

 **Create a Bag of Words using count vectorizer**

**i. Use ngram_range=(1, 2)**

**ii. Vectorize training and testing features**

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

**Have a look at some feature names**

In [44]:
vectorizer.get_feature_names()[:5]

['aa', 'aa anger', 'aa compared', 'aaa', 'aaa take']

**View term-document matrix**

In [45]:
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

**Create a dictionary to get label counts**

In [46]:
label_counts=dict()

for labels in blog_df.topic.values:
    for label in labels:
        if label in label_counts:
            label_counts[str(label)]+=1
        else:
            label_counts[str(label)]=1

**Print the dictionary**

In [47]:
label_counts

{'-': 136,
 'A': 4,
 'B': 95,
 'C': 14,
 'E': 235,
 'I': 90,
 'L': 2,
 'M': 16,
 'N': 45,
 'P': 45,
 'R': 75,
 'S': 503,
 'T': 1474,
 'U': 437,
 'a': 307,
 'b': 2,
 'c': 1754,
 'd': 955,
 'e': 2537,
 'f': 45,
 'g': 1800,
 'h': 1474,
 'i': 1094,
 'k': 523,
 'l': 1474,
 'm': 100,
 'n': 3753,
 'o': 3334,
 'p': 75,
 'r': 349,
 's': 203,
 't': 1285,
 'u': 533,
 'v': 79,
 'y': 1474}

**Multi label binarizer**

**Load a multilabel binarizer and fit it on the labels.**

In [48]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

###Build a base model for Supervised Learning - Classification.

**Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.**

In [49]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

**Fit the classifier**

In [50]:
clf.fit(X_train_bow, y_train)

OneVsRestClassifier(estimator=LogisticRegression())

**Make predictions**
**- Get predicted labels and scores**

In [51]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

**Get inverse transform for predicted labels and test labels**

In [52]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

**Print some samples**

In [53]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test_bow[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	  (0, 1086)	1
  (0, 1208)	1
  (0, 1223)	1
  (0, 1495)	1
  (0, 2130)	1
  (0, 4255)	1
  (0, 5081)	1
  (0, 5481)	1
  (0, 5556)	1
  (0, 5581)	1
  (0, 5972)	1
  (0, 6039)	1
  (0, 6171)	1
  (0, 6347)	1
  (0, 6605)	1
  (0, 7521)	1
  (0, 8075)	1
  (0, 8221)	1
  (0, 9785)	1
  (0, 9817)	1
  (0, 10275)	1
  (0, 10330)	1
  (0, 10499)	1
  (0, 10511)	1
  (0, 10667)	1
  :	:
  (0, 167010)	1
  (0, 168311)	1
  (0, 168424)	1
  (0, 168426)	1
  (0, 168442)	1
  (0, 168755)	1
  (0, 168775)	1
  (0, 168873)	1
  (0, 169289)	1
  (0, 169405)	1
  (0, 171134)	1
  (0, 172810)	1
  (0, 172896)	1
  (0, 172963)	1
  (0, 173421)	1
  (0, 173562)	1
  (0, 173670)	1
  (0, 173676)	1
  (0, 173738)	1
  (0, 173880)	1
  (0, 174357)	1
  (0, 174513)	1
  (0, 174650)	1
  (0, 174732)	1
  (0, 175345)	1
True labels:	S,d,e,n,t,u
Predicted labels:	S,U,d,i,k,n,t


Title:	  (0, 2473)	1
  (0, 2505)	1
  (0, 42646)	1
  (0, 78199)	1
  (0, 78201)	1
  (0, 90722)	1
  (0, 90723)	1
  (0, 130136)	1
  (0, 135786)	1
  (0, 175034)	1
True labels:	T,

### Clearly print Performance Metrics.
**- Accuracy**

**- F1-score**

**- Precision**

**- Recall**

**- ROC-AUC**

In [54]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))
    print('Average ROC-AUC score: ', roc_auc_score(y_val, predicted, average='micro'))

In [55]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.5709219858156028
F1 score:  0.8326229697039511
Average precision score:  0.741377468362899
Average recall score:  0.79673721340388
Average ROC-AUC score:  0.8809060968361154


## 4. Improve Performance of model.

### A. Experiment with other vectorisers.

**TFIDF Vectorizer**

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [57]:
vectorizer

TfidfVectorizer(max_df=0.8, max_features=2500, min_df=7)

In [58]:
vectorizer.get_feature_names()[:5]

['ability', 'able', 'absolutely', 'accent', 'accept']

### B. Build classifier Models using other algorithms than base model.

**Use a linear classifier (LinearSVC is used in the following) of your choice, wrap it up in OneVsRestClassifier to train it on every label**


In [59]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

def display_metrics_micro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Micro', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: Micro', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: Micro', recall_score(Ytest, Ypred, average='micro'))
    print('Average ROC-AUC score: ', roc_auc_score(Ytest, Ypred, average='micro'))
    
    
def display_metrics_macro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Macro', f1_score(Ytest, Ypred, average='macro'))
    print('Average recall score: MAcro', recall_score(Ytest, Ypred, average='macro'))
    
def display_metrics_weighted(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: weighted', f1_score(Ytest, Ypred, average='weighted'))
    print('Average precision score: weighted', average_precision_score(Ytest, Ypred, average='weighted'))
    print('Average recall score: weighted', recall_score(Ytest, Ypred, average='weighted'))

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

def build_model_train(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
    if model=='lr':
        model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='svm':
        model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='nbayes':
        model = MultinomialNB(alpha=1.0)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
        
    elif model=='lda':
        model = LinearDiscriminantAnalysis(solver='svd')
        model = OneVsRestClassifier(model)
        #model.fit(X_train.toarray(), y_train)

    return model

In [62]:
models = ['lr','svm','nbayes','lda']
for model in models:
    model1 = build_model_train(X_train_bow,y_train,model=model)
    if model == 'lda':
      model1.fit(X_train_bow.toarray(),y_train)
      Ypred=model1.predict(X_test_bow)
    else:
      model1.fit(X_train_bow,y_train)
      Ypred=model1.predict(X_test_bow)
    print("\n")
    print(f"**displaying  metrics for the mode {model1}\n")
    display_metrics_micro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_macro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_weighted(y_test,Ypred)
    print("\n")
    print("\n")
    



**displaying  metrics for the mode OneVsRestClassifier(estimator=LogisticRegression(penalty='l1',
                                                 solver='liblinear'))

Accuracy score:  0.4929078014184397
F1 score: Micro 0.805445659762625
Average precision score: Micro 0.7052138469933719
Average recall score: Micro 0.7630070546737213
Average ROC-AUC score:  0.8618705360187865




Accuracy score:  0.4929078014184397
F1 score: Macro 0.46357084699015605
Average recall score: MAcro 0.4051653973772904




Accuracy score:  0.4929078014184397
F1 score: weighted 0.7722506498612003
Average precision score: weighted 0.7214388463156779
Average recall score: weighted 0.7630070546737213






**displaying  metrics for the mode OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1'))

Accuracy score:  0.5319148936170213
F1 score: Micro 0.8365097588978186
Average precision score: Micro 0.7461974755118821
Average recall score: Micro 0.8031305114638448
Average ROC-AUC score:  0.884102745866

### C. Tune Parameters/Hyperparameters of the model/s.

**Using Grid Search**

**For LR Model**

In [63]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

tfidf_transformer = TfidfTransformer(smooth_idf=True)

log_reg_clf = OneVsRestClassifier(
  estimator=LogisticRegression(
    intercept_scaling=1,
    class_weight='balanced',
    random_state=0
  )
)

# Create regularization hyperparameter space
C = np.logspace(0, 4, 10)

param_grid = [{
  'vect__use_idf': (True, False),
  'clf__estimator__C': C,
  'clf__estimator__penalty': ['l1','l2']
}]

log_reg_clf_tfidf = Pipeline([
  ('vect', tfidf_transformer),
  ('clf', log_reg_clf)
])

print(log_reg_clf_tfidf.get_params().keys())

gs_logReg_tfidf = GridSearchCV(
  log_reg_clf_tfidf,
  param_grid,
  scoring='accuracy',
  cv=5,
  verbose=1,
  n_jobs=-1
)
gs_logReg_tfidf.fit(X_train_bow, y_train)
print("The best parameters: \n", gs_logReg_tfidf.best_params_)
print("The best score: \n", gs_logReg_tfidf.best_score_)

df_test_predicted_idf = gs_logReg_tfidf.predict(X_test_bow)

display_metrics_micro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_macro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_weighted(y_test,df_test_predicted_idf)
print("\n")
print("\n")

dict_keys(['memory', 'steps', 'verbose', 'vect', 'clf', 'vect__norm', 'vect__smooth_idf', 'vect__sublinear_tf', 'vect__use_idf', 'clf__estimator__C', 'clf__estimator__class_weight', 'clf__estimator__dual', 'clf__estimator__fit_intercept', 'clf__estimator__intercept_scaling', 'clf__estimator__l1_ratio', 'clf__estimator__max_iter', 'clf__estimator__multi_class', 'clf__estimator__n_jobs', 'clf__estimator__penalty', 'clf__estimator__random_state', 'clf__estimator__solver', 'clf__estimator__tol', 'clf__estimator__verbose', 'clf__estimator__warm_start', 'clf__estimator', 'clf__n_jobs'])
Fitting 5 folds for each of 40 candidates, totalling 200 fits
The best parameters: 
 {'clf__estimator__C': 1.0, 'clf__estimator__penalty': 'l2', 'vect__use_idf': True}
The best score: 
 0.5589633655789494
Accuracy score:  0.5726950354609929
F1 score: Micro 0.8538343893379944
Average precision score: Micro 0.7609052452425132
Average recall score: Micro 0.8615520282186949
Average ROC-AUC score:  0.9074268954563

**For SVM Model**

In [64]:
tfidf_transformer = TfidfTransformer(smooth_idf=True)

svm_reg_clf = OneVsRestClassifier(
  estimator=LinearSVC(verbose=True,class_weight='balanced'))

param_grid = [{
  'tfidf__use_idf': (True, False),
  'clf__estimator__C': [0.1, 1, 10, 100],
  'clf__estimator__loss': ['hinge', 'squared_hinge'],
  'clf__estimator__penalty': ['l1', 'l2']
}]

svm_reg_clf_tfidf = Pipeline([
  ('tfidf', tfidf_transformer),
  ('clf', svm_reg_clf)
])

gs_svmReg_tfidf = GridSearchCV(
  svm_reg_clf_tfidf,
  param_grid,
  scoring='accuracy',
  cv=5,
  verbose=1,
  n_jobs=-1
)

print(svm_reg_clf_tfidf.get_params().keys())

gs_svmReg_tfidf.fit(X_train_bow, y_train)
print("The best parameters: \n", gs_svmReg_tfidf.best_params_)
print("The best score: \n", gs_svmReg_tfidf.best_score_)

df_test_predicted_idf = gs_svmReg_tfidf.predict(X_test_bow)

display_metrics_micro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_macro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_weighted(y_test,df_test_predicted_idf)
print("\n")
print("\n")

dict_keys(['memory', 'steps', 'verbose', 'tfidf', 'clf', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__C', 'clf__estimator__class_weight', 'clf__estimator__dual', 'clf__estimator__fit_intercept', 'clf__estimator__intercept_scaling', 'clf__estimator__loss', 'clf__estimator__max_iter', 'clf__estimator__multi_class', 'clf__estimator__penalty', 'clf__estimator__random_state', 'clf__estimator__tol', 'clf__estimator__verbose', 'clf__estimator', 'clf__n_jobs'])
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]The best parameters: 
 {'clf__estimator__C': 0

**For Naive Bayes Model**

In [65]:
tfidf_transformer = TfidfTransformer(smooth_idf=True)

nvb_reg_clf = OneVsRestClassifier(
  estimator=MultinomialNB(alpha=1.0,fit_prior=True))

param_grid = [{
  'tfidf__use_idf': (True, False),
  'clf__estimator__alpha': [0.0001, 0.001, 0.1, 1.0]
}]

nvb_reg_clf_tfidf = Pipeline([
  ('tfidf', tfidf_transformer),
  ('clf', nvb_reg_clf)
])

gs_nvbReg_tfidf = GridSearchCV(
  nvb_reg_clf_tfidf,
  param_grid,
  scoring='accuracy',
  cv=5,
  verbose=1,
  n_jobs=-1
)

print(nvb_reg_clf_tfidf.get_params().keys())

gs_nvbReg_tfidf.fit(X_train_bow, y_train)
print("The best parameters: \n", gs_nvbReg_tfidf.best_params_)
print("The best score: \n", gs_nvbReg_tfidf.best_score_)

df_test_predicted_idf = gs_nvbReg_tfidf.predict(X_test_bow)

display_metrics_micro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_macro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_weighted(y_test,df_test_predicted_idf)
print("\n")
print("\n")

dict_keys(['memory', 'steps', 'verbose', 'tfidf', 'clf', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__alpha', 'clf__estimator__class_prior', 'clf__estimator__fit_prior', 'clf__estimator', 'clf__n_jobs'])
Fitting 5 folds for each of 8 candidates, totalling 40 fits
The best parameters: 
 {'clf__estimator__alpha': 0.1, 'tfidf__use_idf': True}
The best score: 
 0.48937562545376057
Accuracy score:  0.5053191489361702
F1 score: Micro 0.8067084942084942
Average precision score: Micro 0.7170900922587494
Average recall score: Micro 0.7369929453262787
Average ROC-AUC score:  0.855046064875715




Accuracy score:  0.5053191489361702
F1 score: Macro 0.4889800242137758
Average recall score: MAcro 0.4131615323157131




Accuracy score:  0.5053191489361702
F1 score: weighted 0.7819449675431898
Average precision score: weighted 0.7477479997848753
Average recall score: weighted 0.7369929453262787






**For LDA Model**

**Hyperparameter tuning for LDA takes around 25-30 mins as we have to use a Sparse Matrix converting the X_train_bow to X_train_bow.toarray() while fitting the model. Please be patient :)**

In [66]:
tfidf_transformer = TfidfTransformer(smooth_idf=True)

lda_reg_clf = OneVsRestClassifier(
  estimator=LinearDiscriminantAnalysis(solver='svd'))

param_grid = [{
  #'tfidf__use_idf': (True, False),
  'clf__estimator__solver': ['svd', 'lsqr']
  #'clf__estimator__store_covariance': (True, False)
  #'clf__estimator__shrinkage': np.arange(0, 1, 0.01)
}]

lda_reg_clf_tfidf = Pipeline([
  #('tfidf', tfidf_transformer),
  ('clf', lda_reg_clf)
])

gs_ldaReg_tfidf = GridSearchCV(
  lda_reg_clf_tfidf,
  param_grid,
  scoring='accuracy',
  cv=5,
  verbose=1,
  n_jobs=-1
)

print(lda_reg_clf_tfidf.get_params().keys())

gs_ldaReg_tfidf.fit(X_train_bow.toarray(), y_train)
print("The best parameters: \n", gs_ldaReg_tfidf.best_params_)
print("The best score: \n", gs_ldaReg_tfidf.best_score_)

df_test_predicted_idf = gs_ldaReg_tfidf.predict(X_test_bow)

display_metrics_micro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_macro(y_test,df_test_predicted_idf)
print("\n")
print("\n")
display_metrics_weighted(y_test,df_test_predicted_idf)
print("\n")
print("\n")


dict_keys(['memory', 'steps', 'verbose', 'clf', 'clf__estimator__covariance_estimator', 'clf__estimator__n_components', 'clf__estimator__priors', 'clf__estimator__shrinkage', 'clf__estimator__solver', 'clf__estimator__store_covariance', 'clf__estimator__tol', 'clf__estimator', 'clf__n_jobs'])
Fitting 5 folds for each of 2 candidates, totalling 10 fits
The best parameters: 
 {'clf__estimator__solver': 'svd'}
The best score: 
 0.10062692541647862
Accuracy score:  0.09219858156028368
F1 score: Micro 0.5870320339637205
Average precision score: Micro 0.4257304637709579
Average recall score: Micro 0.6706349206349206
Average ROC-AUC score:  0.7436968341664474




Accuracy score:  0.09219858156028368
F1 score: Macro 0.33878876224990356
Average recall score: MAcro 0.45758325840223657




Accuracy score:  0.09219858156028368
F1 score: weighted 0.6459536756819231
Average precision score: weighted 0.6129115279768472
Average recall score: weighted 0.6706349206349206






## 5. Share insights on relative performance comparison

### A. Which vectorizer performed better? Probable reason?

***The TDIF vectorizer performed better as in Bag of Words(Count Vectorizer), we witnessed how vectorization was just concerned with the frequency of vocabulary words in a given document. As a result, articles, prepositions, and conjunctions which don’t contribute a lot to the meaning get as much importance as, say, adjectives. TF-IDF helps us to overcome this issue. Words that get repeated too often don’t overpower less frequent but important words.***

### B. Which model outperformed? Probable reason?

***SVM model outperformed. Accuracy scores are the highest as compared to other models. Even Logistic regression did well, but SVM has slightly better accuracy. Also ROC-AUC score and Recall are high for SVM. LDA model was the worst as it gave lowest accuracy score.***

### C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?

***As SVM Model model was the best, we used tweaked the 'loss', 'C' and 'penalty' hyperparameters with Grid Search to get good statistics. These hyperparameters ensured that the model is generalised and not overfitting***

### D. According to you, which performance metric should be given most importance, why?.

***According to me Accuracy and F1 Score should be given most importance because the Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as the ratio of the number of correct predictions and the total number of predictions. In this buisness scenario we see these two parameters should be given most importance.***

# Part B

## Domain:
Customer support

## Context:
Great Learning has a an academic support department which receives numerous support requests every day throughout the year.
Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy
workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper
resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with
the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human
support executive if the request is complex or not in it’s database.

## Data Description:
A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.

## Project Objective:
Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for.
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.
Hint: There are a lot of techniques using which one can clean and prepare the data which can be used to train a ML/DL classifier. Hence, it might
require you to experiment, research, self learn and implement the above classifier. There might be many iterations between hand building the
corpus and designing the best fit text classifier. As the quality and quantity of corpus increases the model’s performance i.e. ability to answer
right questions also increases.
Reference: https://www.mygreatlearning.com/blog/basics-of-building-an-artificial-intelligence-chatbot/

## Evaluation: 
Evaluator will use linguistics to twist and turn sentences to ask questions on the topics described in DATA DESCRIPTION and check if
the bot is giving relevant replies.


# Customer support

## Import the required Libraries

In [67]:
!pip install tflearn
import nltk
import numpy
import tensorflow
import tflearn
import random
import json
from nltk.chat.util import Chat, reflections
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

Collecting tflearn
  Downloading tflearn-0.5.0.tar.gz (107 kB)
[?25l[K     |███                             | 10 kB 24.9 MB/s eta 0:00:01[K     |██████                          | 20 kB 11.3 MB/s eta 0:00:01[K     |█████████▏                      | 30 kB 7.5 MB/s eta 0:00:01[K     |████████████▏                   | 40 kB 7.3 MB/s eta 0:00:01[K     |███████████████▎                | 51 kB 4.3 MB/s eta 0:00:01[K     |██████████████████▎             | 61 kB 5.1 MB/s eta 0:00:01[K     |█████████████████████▍          | 71 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████▍       | 81 kB 5.8 MB/s eta 0:00:01[K     |███████████████████████████▌    | 92 kB 5.7 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 102 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████████| 107 kB 5.1 MB/s 
Building wheels for collected packages: tflearn
  Building wheel for tflearn (setup.py) ... [?25l[?25hdone
  Created wheel for tflearn: filename=tflearn-0.5.0-py3-no

## Algorithm for this text-based chatbot

### Input the corpus

In [68]:
#importing corpus
import json

#importing corpus file
with open(project_path + 'GL Bot.json') as file:
    Corpus=json.load(file)

#Display corpus
print(Corpus)

{'intents': [{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?'], 'context_set': ''}, {'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye'], 'context_set': ''}, {'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of p

### Perform data pre-processing on corpus:

### Text case [upper or lower] handling 

In [69]:
def recursion_lower(x):
    if type(x) is str:
        return x.lower()
    elif type(x) is list:
        return [recursion_lower(i) for i in x]
    elif type(x) is dict:
        return {recursion_lower(k):recursion_lower(v) for k,v in x.items()}
    else:
        return x

In [70]:
Corpus = recursion_lower(Corpus)
print(Corpus)

{'intents': [{'tag': 'intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['hello! how can i help you ?'], 'context_set': ''}, {'tag': 'exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['i hope i was able to assist you, good bye'], 'context_set': ''}, {'tag': 'olympus', 'patterns': ['olympus', 'explain me how olympus works', 'i am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of p

### Tokenisation

In [71]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [72]:
#Extract data
W = [] #Tokens
L = [] #Identified Tags or Labels
doc_x = [] #Tokenised words
doc_y = [] #Tags or labels

for intent in Corpus['intents']:
    for pattern in intent['patterns']:
        w_temp = nltk.word_tokenize(pattern)
        W.extend(w_temp)
        doc_x.append(w_temp)
        doc_y.append(intent["tag"])
        
    #Add the missing tag if any
    if intent['tag'] not in L:
        L.append(intent['tag'])

### Stemming

In [73]:
# Stemming

W = [stemmer.stem(w.lower()) for w in W if w!= "?"] #Stemming or learning the root word
W = sorted(list(set(W))) #Sorted words
L = sorted(L) #Sorted list of tags or labels
Train = []
Target = []

### Generate BOW [Bag of Words]

In [74]:
out_empty = [0 for _ in range(len(L))]

#Loop to create bag of words and put the frequency count on each word
for x,doc in enumerate(doc_x):
    bag = []
    
    w_temp = [stemmer.stem(w.lower()) for w in doc]

    for w in W:
        if w in w_temp:
            bag.append(1)
        else:
            bag.append(0)
           
    output_row = out_empty[:]
    output_row[L.index(doc_y[x])] = 1
        
    Train.append(bag) #List
    Target.append(output_row) #List

### Generate one hot encoding for the target column

In [75]:
Train = numpy.array(Train)
Target = numpy.array(Target)

In [76]:
len(Target[0])

8

In [77]:
len(Train[0])

150

In [78]:
Target

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [79]:
Train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

### Design a neural network to classify the words with TAGS as target outputs

In [80]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import SGD
import random
model = Sequential()
model.add(Dense(64, input_shape=(len(Train[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(Target[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
#fitting and saving the model 
hist = model.fit(Train, Target, epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Train on 128 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200


### Defination of Bag Of Words

In [81]:
def bag_of_words(inp, W):
    bag = [0 for _ in range(len(W))]

    inp_W = nltk.word_tokenize(inp)
    inp_W = [stemmer.stem(W.lower()) for W in inp_W]

    for t in inp_W:
        for i, w in enumerate(W):
            if w == t:
                bag[i] = 1
            
    return numpy.array(bag)

### Design a chat utility as a function to interact with the user till the user calls a “quit”

### If the user does not understand or finds the bot’s answer irrelevant, the user calls a “*” asking the bot to re-evaluate what the user has asked

In [82]:
#Text chat utility function
import random
def chat():
    print("Chat with Surajit(type:stop to quit)")
    print("If answer is not right(type:*)")
    while True:
        inp=input("\n\nYou:")
        if inp.lower()=="*":
            print("BOT:Please rephrase your question and try again")
        if inp.lower()=="quit":
            break
            
        results=model.predict([[bag_of_words(inp,W)]])
        results_index=numpy.argmax(results)
        tag=L[results_index]
        
        for tg in Corpus["intents"]:
            if tg['tag']==tag:
                responses=tg['responses']
        print(random.choice(responses))

### Run the chat utility function

In [83]:
chat()

Chat with Surajit(type:stop to quit)
If answer is not right(type:*)


You:hello
hello! how can i help you ?


You:olympus
link: olympus wiki


You:*
BOT:Please rephrase your question and try again
hello! how can i help you ?


You:not good
tarnsferring the request to your pm


You:stupid bot
please use respectful words


You:you are good
i hope i was able to assist you, good bye


You:machine learning
link: machine learning wiki 


You:Surajit
hello! how can i help you ?


You:what is your name?
i am your virtual learning assistant


You:thanks
i hope i was able to assist you, good bye


You:bye
hello! how can i help you ?


You:quit


## Simple Text-based Chatbot using NLTK with Python

In [137]:
#create a variable named pairs
pairs =[
    [
        r"(.*)my name is (.*)", #request
        ["Hello %2, How are you today ?",] #response
    ],
        [
        r"(.*)help(.*) ",
        ["I can help you ",]
    ],
         [
        r"(.*) your name ?",
        ["My name is Suri, but you can just call me Robo Sur and I'm a chatbot .",]
    ],
        [
        r"(.*) are you ?",
        ["My name is Suri, but you can just call me Robo Sur and I'm a chatbot .",]
    ],
        [
        r"how are you (.*) ?",
        ["I'm doing very well", "i am great !"]
    ],
        [
        r"sorry (.*)",
        ["Its alright","Its OK, never mind that",]
    ],
        [
        r"i'm (.*) (good|well|okay|ok)",
        ["Nice to hear that","Alright, great !",]
    ],
        [
        r"(hi|hey|hello|hola|holla)(.*)",
        ["Hello", "Hey there",]
    ],
        [
        r"what (.*) want ?",
        ["Make me an offer I can't refuse",]
    ],
        [
        r"(.*)created(.*)",
        ["Surajit Pal created me using Python's NLTK library ","top secret ;)",]
    ],
        [
        r"(.*) (location|city) ?",
        ['New Delhi, India',]
    ],
        [
        r"stay(.*) (location|city) ?",
        ['New Delhi, India',]
    ],
        [
        r"(.*)raining in (.*)",
        ["No rain in the past 4 days here in %2","In %2 there is a 50% chance of rain",]
    ],
        [
        r"(.*)(music|songs|hobby)(.*)",
        ["I love listening to Music",]
    ],
        [
        r"(.*)(artist|band) ?",
        ["Jimmy Hendrix"]
    ],
        [
        r"quit",
        ["Bye for now. See you soon :) ","It was nice talking to you. See you soon :)"]
    ],
        [
        r"(.*)",
        ['That is nice to hear']
    ],
]

In [138]:
def chatbot():
    print("Hello,\tMy name is Surajit \n\n\tI am your virtual assistant \n\tNote: I do understand ENGLISH if written in lowercase")
    print("\n\tPlease let me know your user name\n")
    chat=Chat(pairs,reflections)
    chat.converse()

In [139]:
#Function to keep the chat window live and looped ready to take inputs from user
if __name__=="__main__":

#Calling the chat bot utlity function
            chatbot()

Hello,	My name is Surajit 

	I am your virtual assistant 
	Note: I do understand ENGLISH if written in lowercase

	Please let me know your user name

>surajitpal21
That is nice to hear
>What are you?
My name is Suri, but you can just call me Robo Sur and I'm a chatbot .
>Who are you?
My name is Suri, but you can just call me Robo Sur and I'm a chatbot .
>Who created you?
Surajit Pal created me using Python's NLTK library 
>What is your hobby?
I love listening to Music
>Who is favorite artist?
Jimmy Hendrix
>What is your location?
New Delhi, India
>what do you want?
Make me an offer I can't refuse
>is it raining in New Delhi?
No rain in the past 4 days here in new delhi?
>bye
That is nice to hear
>quit
It was nice talking to you. See you soon :)
