# Part One - Blog Text Classification

## About the dataset 

#### Context:

“A blog (a truncation of the expression "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries ("posts"). Posts are typically displayed in reverse chronological order, so that the most recent post appears first, at the top of the web page. Until 2009, blogs were usually the work of a single individual, occasionally of a small group, and often covered a single subject or topic.” -- Wikipedia article “Blog”

[This](https://www.kaggle.com/rtatman/blog-authorship-corpus) dataset contains text from blogs written on or before 2004, with each blog being the work of a single user.
Content:

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

    8240 "10s" blogs (ages 13-17),
    8086 "20s" blogs(ages 23-27)
    2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

## Imports

In [None]:
# importing all the necessary libraries
import warnings
warnings.filterwarnings('ignore')
seed = 10

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
import nltk
from nltk.corpus import stopwords
import string
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

## Loading the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data_path = '/content/drive/MyDrive/4. Statistical NLP/Project NLP/Stats NLP Project - Dataset /Dataset - blogtext.csv'
df = pd.read_csv(data_path, nrows= 25000)
# since it is a huge dataset, we'll try out things with a subset of it

df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [None]:
df.info()
# overview of the subset data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      25000 non-null  int64 
 1   gender  25000 non-null  object
 2   age     25000 non-null  int64 
 3   topic   25000 non-null  object
 4   sign    25000 non-null  object
 5   date    25000 non-null  object
 6   text    25000 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.3+ MB


## Data Pre-processing

**We'll apply the following pre-processing steps on the data:**

* remove unwanted spaces
* remove unwanted characters
* remove stopwords
* convert to lowercase
* target / label merger
* train and test split
* vectorisation

In [None]:
df['text'] = df['text'].astype('str')

### Removing unwanted spaces

In [None]:
df['text_stripped'] = df['text'].str.strip()
# applying strip to remove any leading or trailing spaces

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...


### Removing unwanted characters

In [None]:
def remove_punctuation(text):
  return text.translate(str.maketrans('','',string.punctuation))
df['text_wo_punct'] = df['text_stripped'].apply(remove_punctuation)
# applying transformation to remove punctuations like !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...


In [None]:
def remove_char(text):
  return re.sub(pattern='[^a-zA-Z]',repl=' ',string=str(text))
df['text_wo_chars'] = df['text_wo_punct'].apply(remove_char)
# applying regex to filter out all characters that are not alphabetical

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct,text_wo_chars
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...,Info has been found pages and MB of pd...
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...,These are the team members Drewes van der La...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...,In het kader van kernfusie op aarde MAAK JE E...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to Yahoos Toolbar I can now capture the...


### Removing stopwords

In [None]:
nltk.download('words')
words = set(nltk.corpus.words.words())
def remove_non_english_words(text):
  return " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())
df['text_english'] = df['text_wo_chars'].apply(remove_non_english_words)
# removing non-English words

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct,text_wo_chars,text_english
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...,Info has been found pages and MB of pd...,been found and of Now i have to wait untill ou...
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...,These are the team members Drewes van der La...,These are the team van mail mail me mail
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...,In het kader van kernfusie op aarde MAAK JE E...,In het van How to build an From Subject How To...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to I can now capture the of now I can s...


In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def remove_stopwords(text):
  return " ".join(w for w in str(text).split() if w not in stop_words)
df['text_wo_stopwords'] = df['text_english'].apply(remove_stopwords)
# removing stopwords

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct,text_wo_chars,text_english,text_wo_stopwords
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...,Info has been found pages and MB of pd...,been found and of Now i have to wait untill ou...,found Now wait untill team leader
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...,These are the team members Drewes van der La...,These are the team van mail mail me mail,These team van mail mail mail
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...,In het kader van kernfusie op aarde MAAK JE E...,In het van How to build an From Subject How To...,In het van How build From Subject How To Build...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing,testing testing,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to I can now capture the of now I can s...,Thanks I capture I show cool links Pop audio v...


### Lower Casing

In [None]:
df['text_lower'] = df['text_wo_stopwords'].str.lower()
# converting to lowercase

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct,text_wo_chars,text_english,text_wo_stopwords,text_lower
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...,Info has been found pages and MB of pd...,been found and of Now i have to wait untill ou...,found Now wait untill team leader,found now wait untill team leader
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...,These are the team members Drewes van der La...,These are the team van mail mail me mail,These team van mail mail mail,these team van mail mail mail
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...,In het kader van kernfusie op aarde MAAK JE E...,In het van How to build an From Subject How To...,In het van How build From Subject How To Build...,in het van how build from subject how to build...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing,testing testing,testing testing,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to I can now capture the of now I can s...,Thanks I capture I show cool links Pop audio v...,thanks i capture i show cool links pop audio v...


### Stemming

In [None]:
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence
df['text_stemmed'] = df['text_lower'].apply(stemming)

In [None]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_stripped,text_wo_punct,text_wo_chars,text_english,text_wo_stopwords,text_lower,text_stemmed
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","Info has been found (+/- 100 pages, and 4.5 MB...",Info has been found 100 pages and 45 MB of pd...,Info has been found pages and MB of pd...,been found and of Now i have to wait untill ou...,found Now wait untill team leader,found now wait untill team leader,found now wait until team leader
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,These are the team members: Drewes van der L...,These are the team members Drewes van der La...,These are the team members Drewes van der La...,These are the team van mail mail me mail,These team van mail mail mail,these team van mail mail mail,these team van mail mail mail
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde: MAAK JE ...,In het kader van kernfusie op aarde MAAK JE E...,In het kader van kernfusie op aarde MAAK JE E...,In het van How to build an From Subject How To...,In het van How build From Subject How To Build...,in het van how build from subject how to build...,in het van how build from subject how to build...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing,testing testing,testing testing,testing testing,test test
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo!'s Toolbar I can now 'capture'...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to Yahoos Toolbar I can now capture the...,Thanks to I can now capture the of now I can s...,Thanks I capture I show cool links Pop audio v...,thanks i capture i show cool links pop audio v...,thank i captur i show cool link pop audio vide...


### Pipelining pre-processing steps

In [None]:
# defining a pipeline for the pre-processing steps
def preprocess(text):
  text = text.strip()
  text = remove_punctuation(text)
  text = remove_char(text)
  text = remove_non_english_words(text)
  text = remove_stopwords(text)
  text = text.lower()
  text = stemming(text)
  return text

In [None]:
# applying preprocessing pipeline on subset data
df['text'] = df['text'].map(lambda text : preprocess(text))
df['text'][:5]

0                     found now wait until team leader
1                        these team van mail mail mail
2    in het van how build from subject how to build...
3                                            test test
4    thank i captur i show cool link pop audio vide...
Name: text, dtype: object

### Target / Label merger

In [None]:
df_merged = pd.DataFrame(columns=['labels','text'])
df_merged['text'] = df['text']
df_merged['labels'] = df.apply(lambda row: [row['gender'],row['age'],row['topic'],row['sign']], axis=1)
# merging all the label columns into a single column into a new DataFrame

In [None]:
df_merged.head()

Unnamed: 0,labels,text
0,"[male, 15, Student, Leo]",found now wait until team leader
1,"[male, 15, Student, Leo]",these team van mail mail mail
2,"[male, 15, Student, Leo]",in het van how build from subject how to build...
3,"[male, 15, Student, Leo]",test test
4,"[male, 33, InvestmentBanking, Aquarius]",thank i captur i show cool link pop audio vide...


### Creating train and test datasets

In [None]:
X = df_merged.text
y = df_merged.labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, random_state=seed)

In [None]:
print('Shape of training dataset: {}'.format(X_train.shape))
print('Shape of testing dataset: {}'.format(X_test.shape))

Shape of training dataset: (18750,)
Shape of testing dataset: (6250,)


### Vectorisation

In [None]:
# defining a vectorizer considering only those rows which occur more than 15% and less than 80%
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,3),stop_words='english',min_df=0.15,max_df=0.8)

In [None]:
# using CountVectorizer on Bag of Words to transform into Document Term Matrix
X_train_dtm = ctv.fit_transform(X_train)
X_test_dtm = ctv.transform(X_test)

In [None]:
# observing the Vocabulary and Document Term Matrix together for train data
pd.DataFrame(data=X_train_dtm.toarray(),columns=ctv.get_feature_names())

Unnamed: 0,come,day,dont,feel,good,got,know,life,like,littl,look,love,make,new,peopl,realli,right,say,someth,thing,think,time,today,tri,want,way,work
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,3
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18745,1,1,2,0,0,0,2,0,1,1,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,0,0
18746,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
18747,0,1,1,0,0,0,0,0,3,0,1,0,0,0,0,2,0,0,2,0,0,1,0,1,0,4,0
18748,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
# observing the Vocabulary and Document Term Matrix together for test data
pd.DataFrame(data=X_test_dtm.toarray(),columns=ctv.get_feature_names())

Unnamed: 0,come,day,dont,feel,good,got,know,life,like,littl,look,love,make,new,peopl,realli,right,say,someth,thing,think,time,today,tri,want,way,work
0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0
1,0,0,2,0,0,1,0,0,2,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,1,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,1,0,2,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6245,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6246,1,1,1,0,1,3,0,0,2,0,0,0,1,0,1,6,1,0,1,0,0,1,1,1,1,1,0
6247,0,1,1,0,0,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6248,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Creating a dictionary to capture the count of every label

In [None]:
dfT = df[['gender', 'age', 'topic', 'sign']]

In [None]:
dfT['age'] = dfT['age'].astype('str')

In [None]:
keys=[] 
values=[] 

for i in range(dfT.shape[1]): # iterate through all the colummns        
    for j in range(dfT.iloc[:,i].value_counts().shape[0]): # iterate through all the rows of value_counts of that column
        keys.append(dfT.iloc[:,i].value_counts().index[j])         
        values.append(dfT.iloc[:,i].value_counts().iloc[j])

In [None]:
dictionary = dict(zip(keys,values))

In [None]:
print(dictionary)

{'male': 13568, 'female': 11432, '23': 2814, '27': 2729, '24': 2630, '17': 2583, '35': 2503, '36': 1753, '16': 1702, '26': 1297, '25': 1278, '15': 1240, '14': 1022, '33': 931, '34': 890, '13': 365, '48': 244, '46': 204, '37': 166, '38': 142, '39': 132, '47': 105, '41': 95, '45': 72, '42': 48, '43': 24, '40': 21, '44': 10, 'indUnk': 10308, 'Student': 3558, 'Technology': 3249, 'Fashion': 1622, 'Internet': 1008, 'Education': 990, 'Engineering': 720, 'Arts': 553, 'Communications-Media': 440, 'Marketing': 351, 'Non-Profit': 206, 'BusinessServices': 204, 'Government': 187, 'Religion': 182, 'Consulting': 170, 'Sports-Recreation': 120, 'Automotive': 116, 'Banking': 109, 'Science': 100, 'Manufacturing': 93, 'LawEnforcement-Security': 90, 'Museums-Libraries': 72, 'InvestmentBanking': 71, 'Publishing': 70, 'Advertising': 56, 'Accounting': 53, 'Law': 47, 'Transportation': 46, 'Agriculture': 46, 'Architecture': 45, 'Biotech': 36, 'Construction': 21, 'Military': 19, 'HumanResources': 15, 'RealEstate

### Transforming the labels for classification

In [None]:
# first let's try to visualise how the MultiLabelBinarizer will transform our labels
mlb = MultiLabelBinarizer(classes=sorted(dictionary.keys())).fit(y_train)
pd.DataFrame(mlb.fit_transform(y_train), columns=mlb.classes_)

Unnamed: 0,13,14,15,16,17,23,24,25,26,27,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,Accounting,Advertising,Agriculture,Aquarius,Architecture,Aries,Arts,Automotive,Banking,Biotech,BusinessServices,Cancer,Capricorn,Chemicals,Communications-Media,Construction,Consulting,Education,Engineering,Fashion,Gemini,Government,HumanResources,Internet,InvestmentBanking,Law,LawEnforcement-Security,Leo,Libra,Manufacturing,Marketing,Military,Museums-Libraries,Non-Profit,Pisces,Publishing,RealEstate,Religion,Sagittarius,Science,Scorpio,Sports-Recreation,Student,Taurus,Technology,Telecommunications,Transportation,Virgo,female,indUnk,male
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18745,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
18746,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
18747,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1
18748,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [None]:
y_train_mlb = mlb.transform(y_train)
y_test_mlb = mlb.transform(y_test)
# transforming the labels in a binary form

In [None]:
y_train_mlb[99]
# transformed y_train

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [None]:
y_test_mlb[99]
# transformed y_test

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [None]:
y_train.iloc[99]

['male', 23, 'Internet', 'Aquarius']

In [None]:
# verifying if MLB conversion went fine
mlb.inverse_transform(y_train_mlb)[99]

('Aquarius', 'Internet', 'male')

## Classification

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([('clf', 
                             OneVsRestClassifier(LogisticRegression(solver='sag'),
                                                 n_jobs=-1)),])
LogReg_pipeline.fit(X_train_dtm, y_train_mlb)

Pipeline(memory=None,
         steps=[('clf',
                 OneVsRestClassifier(estimator=LogisticRegression(C=1.0,
                                                                  class_weight=None,
                                                                  dual=False,
                                                                  fit_intercept=True,
                                                                  intercept_scaling=1,
                                                                  l1_ratio=None,
                                                                  max_iter=100,
                                                                  multi_class='auto',
                                                                  n_jobs=None,
                                                                  penalty='l2',
                                                                  random_state=None,
                                                      

In [None]:
y_train_pred = LogReg_pipeline.predict(X_train_dtm)
y_train_pred

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
y_test_pred = LogReg_pipeline.predict(X_test_dtm)
y_test_pred

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## Accuracy and Classification Report

In [None]:
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, recall_score

In [None]:
def print_scores(actual, predicted, averaging_type):
    print('\nAVERAGING TYPE==> ',averaging_type)
    print('F1 score: ',f1_score(actual,predicted, average=averaging_type))
    print('Average Precision Score: ',average_precision_score(actual,predicted, average=averaging_type))
    print('Average Recall Score: ',recall_score(actual,predicted, average=averaging_type))

In [None]:
print('--------------------------TRAIN SCORES--------------------------------')
print('Accuracy score: ',accuracy_score(y_train_mlb, y_train_pred))
print_scores(y_train_mlb, y_train_pred, 'micro')
print_scores(y_train_mlb, y_train_pred, 'macro')
print_scores(y_train_mlb, y_train_pred, 'weighted')

--------------------------TRAIN SCORES--------------------------------
Accuracy score:  5.333333333333333e-05

AVERAGING TYPE==>  micro
F1 score:  0.037002422983731395
Average Precision Score:  0.05149887476067868
Average Recall Score:  0.019004444444444445

AVERAGING TYPE==>  macro
F1 score:  0.00284365044200865
Average Precision Score:  nan
Average Recall Score:  0.0015787888658166925

AVERAGING TYPE==>  weighted
F1 score:  0.03428941685208409
Average Precision Score:  0.28467794887335407
Average Recall Score:  0.019004444444444445


In [None]:
print('--------------------------TEST SCORES--------------------------------')
print('Accuracy score: ',accuracy_score(y_test_mlb, y_test_pred))
print_scores(y_test_mlb, y_test_pred, 'micro')
print_scores(y_test_mlb, y_test_pred, 'macro')
print_scores(y_test_mlb, y_test_pred, 'weighted')

--------------------------TEST SCORES--------------------------------
Accuracy score:  0.0

AVERAGING TYPE==>  micro
F1 score:  0.03603884301812328
Average Precision Score:  0.05090629848783695
Average Recall Score:  0.018506666666666668

AVERAGING TYPE==>  macro
F1 score:  0.002751152479333052
Average Precision Score:  nan
Average Recall Score:  0.0015267744335050848

AVERAGING TYPE==>  weighted
F1 score:  0.03341944607144317
Average Precision Score:  0.28467980764201245
Average Recall Score:  0.018506666666666668


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test_mlb,y_test_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         0
           5       0.00      0.00      0.00         0
           6       0.00      0.00      0.00         0
           7       0.00      0.00      0.00         0
           8       0.00      0.00      0.00         0
           9       0.00      0.00      0.00         0
          10       0.00      0.00      0.00         0
          11       0.00      0.00      0.00         0
          12       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         0
          14       0.00      0.00      0.00         0
          15       0.00      0.00      0.00         0
          16       0.00      0.00      0.00         0
          17       0.00    

We can see clearly that the accuracy is good enough for few words / labels and we see zero accuracy for few labels which means we did not have any test data with those predicted labels. Also, we have trained this model on a subset of the whole data since Google COlab crashes if we consider the whole data - it is intuitive that the model would perform better when trained in on the whole data.

## Actual & Predicted Labels of 5 examples

In [None]:
five_actual = y_test_mlb[:5]
five_pred = y_test_pred[:5]

In [None]:
five_actual = mlb.inverse_transform(five_actual)
five_actual

[('Libra', 'female', 'indUnk'),
 ('Aries', 'female', 'indUnk'),
 ('Internet', 'Virgo', 'female'),
 ('Aries', 'Fashion', 'male'),
 ('Sagittarius', 'indUnk', 'male')]

In [None]:
five_pred = mlb.inverse_transform(five_pred)
five_pred

[(), (), (), (), ()]

# Part Two - Customer Support Chatbot

## Imports

In [1]:
import nltk
nltk.download('popular',quiet=True)
nltk.download('punkt',quiet=True)
nltk.download('wordnet',quiet=True)
from nltk.corpus import wordnet

In [2]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import re
import nltk
import spacy
import string
import seaborn as sns
from nltk.stem.snowball import SnowballStemmer

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Loading the corpus data

In [3]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [4]:
corpus_path = '/content/drive/MyDrive/4. Statistical NLP/Project NLP/Stats NLP Project - Dataset /GL Bot.json'

In [5]:
# loading data
corpus_df = pd.read_json(corpus_path)

In [6]:
# eyeballing data
corpus_df.head()

Unnamed: 0,intents
0,"{'tag': 'Intro', 'patterns': ['hi', 'how are y..."
1,"{'tag': 'Exit', 'patterns': ['thank you', 'tha..."
2,"{'tag': 'Olympus', 'patterns': ['olympus', 'ex..."
3,"{'tag': 'SL', 'patterns': ['i am not able to u..."
4,"{'tag': 'NN', 'patterns': ['what is deep learn..."


In [7]:
# checking shape
corpus_df.shape

(8, 1)

In [8]:
for i in corpus_df['intents']:
  print(i)

{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?'], 'context_set': ''}
{'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye'], 'context_set': ''}
{'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of problem with oly

In [9]:
# Separating patterns, responses and tag and saving in new dataframe
data = pd.DataFrame()

for i in corpus_df['intents']:
  data = data.append(i, ignore_index=True)
  
data.head(8)

Unnamed: 0,context_set,patterns,responses,tag
0,,"[hi, how are you, is anyone there, hello, what...",[Hello! how can i help you ?],Intro
1,,"[thank you, thanks, cya, see you, later, see y...","[I hope I was able to assist you, Good Bye]",Exit
2,,"[olympus, explain me how olympus works, I am n...",[Link: Olympus wiki],Olympus
3,,"[i am not able to understand svm, explain me h...",[Link: Machine Learning wiki ],SL
4,,"[what is deep learning, unable to understand d...",[Link: Neural Nets wiki],NN
5,,"[what is your name, who are you, name please, ...",[I am your virtual learning assistant],Bot
6,,"[what the hell, bloody stupid bot, do you thin...",[Please use respectful words],Profane
7,,"[my problem is not solved, you did not help me...",[Tarnsferring the request to your PM],Ticket


In [10]:
# Building and enhancing patterns for wider coverage

final_list = []
for row in data['patterns']:
  list_words = row.copy()

  for word in row:

    if len(word.split()) == 1:
      synArray = wordnet.synsets(word)
    
      if len(synArray) > 0:
        for syn in synArray:
          synonyms=[]
          lemsArray = syn.lemmas()
          if len(lemsArray) > 0:
            for lem in lemsArray:
              lem_name = re.sub('[^a-zA-Z0-9 \n\.]', ' ', lem.name())
              synonyms.append(lem_name)
          else:
            synonyms.append(syn.name())

          if len(synonyms) > 0:
            synonyms = list(set(synonyms))

          list_words.extend(synonyms)

      else:
        continue
    else:
      continue

  final_list.append(list_words)

# .. and creating a new column with enhanced keywords
data['patterns_and_synonyms'] = final_list

In [11]:
data.head()

Unnamed: 0,context_set,patterns,responses,tag,patterns_and_synonyms
0,,"[hi, how are you, is anyone there, hello, what...",[Hello! how can i help you ?],Intro,"[hi, how are you, is anyone there, hello, what..."
1,,"[thank you, thanks, cya, see you, later, see y...","[I hope I was able to assist you, Good Bye]",Exit,"[thank you, thanks, cya, see you, later, see y..."
2,,"[olympus, explain me how olympus works, I am n...",[Link: Olympus wiki],Olympus,"[olympus, explain me how olympus works, I am n..."
3,,"[i am not able to understand svm, explain me h...",[Link: Machine Learning wiki ],SL,"[i am not able to understand svm, explain me h..."
4,,"[what is deep learning, unable to understand d...",[Link: Neural Nets wiki],NN,"[what is deep learning, unable to understand d..."


In [12]:
#Printing enhanced patterns along with existing patterns

for index, row in data.iterrows():
  print(row['patterns'])
  print(row['patterns_and_synonyms'])


['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time']
['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'Hawai i', 'Hawaii', 'Aloha State', 'HI', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'listen', 'hear', 'take heed', 'listen', 'heed', 'listen', 'mind', 'blend', 'intermingle', 'intermix', 'immingle', 'blend', 'go', 'blend in', 'fuse', 'blend', 'commingle', 'meld', 'combine', 'conflate', 'mix', 'flux', 'merge', 'immix', 'coalesce', 'blended', 'online', 'on line', 'online', 'on line', 'online', 'on line'

In [13]:
# covert all text to lower case

final_list = []
for list_of_words in data['patterns_and_synonyms']:
  new_list_of_words = []
  for item in list_of_words:
    item_lowercase = item.lower()
    new_list_of_words.append(item_lowercase)

  final_list.append(new_list_of_words)

data['patterns_and_synonyms_lowercase'] = final_list


In [32]:
data.head(10)

Unnamed: 0,context_set,patterns,responses,tag,patterns_and_synonyms,patterns_and_synonyms_lowercase,patterns_and_synonyms_lowercase_lammatized
0,,"[hi, how are you, is anyone there, hello, what...",[Hello! how can i help you ?],Intro,"[hi, how are you, is anyone there, hello, what...","[hi, how are you, is anyone there, hello, what...","[hi, how are you, is anyone there, hello, what..."
1,,"[thank you, thanks, cya, see you, later, see y...","[I hope I was able to assist you, Good Bye]",Exit,"[thank you, thanks, cya, see you, later, see y...","[thank you, thanks, cya, see you, later, see y...","[thank you, thanks, cya, see you, later, see y..."
2,,"[olympus, explain me how olympus works, I am n...",[Link: Olympus wiki],Olympus,"[olympus, explain me how olympus works, I am n...","[olympus, explain me how olympus works, i am n...","[olympus, explain me how olympus work, i am no..."
3,,"[i am not able to understand svm, explain me h...",[Link: Machine Learning wiki ],SL,"[i am not able to understand svm, explain me h...","[i am not able to understand svm, explain me h...","[i am not able to understand svm, explain me h..."
4,,"[what is deep learning, unable to understand d...",[Link: Neural Nets wiki],NN,"[what is deep learning, unable to understand d...","[what is deep learning, unable to understand d...","[what is deep learning, unable to understand d..."
5,,"[what is your name, who are you, name please, ...",[I am your virtual learning assistant],Bot,"[what is your name, who are you, name please, ...","[what is your name, who are you, name please, ...","[what is your name, who are you, name please, ..."
6,,"[what the hell, bloody stupid bot, do you thin...",[Please use respectful words],Profane,"[what the hell, bloody stupid bot, do you thin...","[what the hell, bloody stupid bot, do you thin...","[what the hell, bloody stupid bot, do you thin..."
7,,"[my problem is not solved, you did not help me...",[Tarnsferring the request to your PM],Ticket,"[my problem is not solved, you did not help me...","[my problem is not solved, you did not help me...","[my problem is not solved, you did not help me..."


In [15]:
for index, row in data.iterrows():
  print(row['patterns_and_synonyms'])
  print(row['patterns_and_synonyms_lowercase'])

['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'Hawai i', 'Hawaii', 'Aloha State', 'HI', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'listen', 'hear', 'take heed', 'listen', 'heed', 'listen', 'mind', 'blend', 'intermingle', 'intermix', 'immingle', 'blend', 'go', 'blend in', 'fuse', 'blend', 'commingle', 'meld', 'combine', 'conflate', 'mix', 'flux', 'merge', 'immix', 'coalesce', 'blended', 'online', 'on line', 'online', 'on line', 'online', 'on line']
['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'

In [16]:
from nltk.stem import WordNetLemmatizer
def lemmatize_text(text):

    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [17]:
final_list = []
for list_of_words in data['patterns_and_synonyms_lowercase']:
  new_list_of_words = []
  for item in list_of_words:
    item_lammatized = lemmatize_text(item)
    new_list_of_words.append(item_lammatized)

  final_list.append(new_list_of_words)

data['patterns_and_synonyms_lowercase_lammatized'] = final_list

In [18]:
for index, row in data.iterrows():
  print(row['patterns_and_synonyms_lowercase'])
  print(row['patterns_and_synonyms_lowercase_lammatized'])

['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'hawai i', 'hawaii', 'aloha state', 'hi', 'hello', 'how do you do', 'hi', 'hullo', 'howdy', 'listen', 'hear', 'take heed', 'listen', 'heed', 'listen', 'mind', 'blend', 'intermingle', 'intermix', 'immingle', 'blend', 'go', 'blend in', 'fuse', 'blend', 'commingle', 'meld', 'combine', 'conflate', 'mix', 'flux', 'merge', 'immix', 'coalesce', 'blended', 'online', 'on line', 'online', 'on line', 'online', 'on line']
['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'

In [19]:
# creating separate data frame for responses.
data_1 = pd.DataFrame(data['responses'])

In [20]:
data_1

Unnamed: 0,responses
0,[Hello! how can i help you ?]
1,"[I hope I was able to assist you, Good Bye]"
2,[Link: Olympus wiki]
3,[Link: Machine Learning wiki ]
4,[Link: Neural Nets wiki]
5,[I am your virtual learning assistant]
6,[Please use respectful words]
7,[Tarnsferring the request to your PM]


In [21]:
data_1['responses'] = data_1['responses'].astype(str)

In [22]:
data_1.dtypes

responses    object
dtype: object

In [23]:
# cleansing the data in responses
data_1['responses'] = data_1['responses'].str.replace("[","")
data_1['responses'] = data_1['responses'].str.replace("]","")
data_1['responses'] = data_1['responses'].str.replace("'","")

In [24]:
data_1.head(10)

Unnamed: 0,responses
0,Hello! how can i help you ?
1,"I hope I was able to assist you, Good Bye"
2,Link: Olympus wiki
3,Link: Machine Learning wiki
4,Link: Neural Nets wiki
5,I am your virtual learning assistant
6,Please use respectful words
7,Tarnsferring the request to your PM


In [25]:
# Putting pattern and index in separate dataframe. Approach is to match user input with patterns. 
# Then use the index to get the correct response from responses dataframe.

class Response:
  def __init__(self, x, y):
    self.x = x
    self.y = y

data_2 = pd.DataFrame()
list_1 = []
for index, row in data.iterrows():
  for item in row['patterns_and_synonyms_lowercase_lammatized']:
    list_1.append(Response(index, item))
    # res = Response(index, item)
    # Response.append(res)

data_2 = pd.DataFrame([t.__dict__ for t in list_1 ])

In [26]:
data_2.head(10)

Unnamed: 0,x,y
0,0,hi
1,0,how are you
2,0,is anyone there
3,0,hello
4,0,whats up
5,0,hey
6,0,yo
7,0,listen
8,0,please help me
9,0,i am learner from


In [27]:

wnlemmatizer = nltk.stem.WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [wnlemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

In [30]:
requests = data_2['y']
def generate_response(user_input):
    robo_response = ''
    
    list_ = [user_input]
    ser = pd.Series(list_)
    requests_1 = requests.append(ser)
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text)
    all_word_vectors = word_vectorizer.fit_transform(requests_1)
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        robo_response = robo_response + "I am sorry, I could not understand you"
        return robo_response
    else:
        robo_response = robo_response + requests_1[similar_sentence_number]
        d1 = data_2[data_2['y'] == robo_response]
        idx = d1.iloc[0]['x']
        ans = data_1.iloc[idx]['responses']
        return ans

In [34]:
continue_dialogue = True
print("Virtual Assistant : Hello, I am your virtual assistant. Let me know how can I help you today.")
while(continue_dialogue == True):
    human_text = input()
    human_text = human_text.lower()
    if human_text != 'bye':
      print("Virtual Assistant : ", generate_response(human_text))
    else:
      continue_dialogue = False
      print("Virtual Assistant : Good bye and take care of yourself. Have a good day!")

print("Ciao!")

Virtual Assistant : Hello, I am your virtual assistant. Let me know how can I help you today.
I don't know about machine learning
Virtual Assistant :  Link: Machine Learning wiki 
what is svm
Virtual Assistant :  Link: Neural Nets wiki
how does neural networks work
Virtual Assistant :  Link: Neural Nets wiki
I need to raise a ticket
Virtual Assistant :  Tarnsferring the request to your PM
help me with certification
Virtual Assistant :  Hello! how can i help you ?
what courses do you offer
Virtual Assistant :  Hello! how can i help you ?
hows does lstm
Virtual Assistant :  I am sorry, I could not understand you
ok bye
Virtual Assistant :  I hope I was able to assist you, Good Bye
bye
Virtual Assistant : Good bye and take care of yourself. Have a good day!
Ciao!


We can see clearly, that the bot doesn't perform very well. It's not a robust model, but it can answer basic questions. With a more volumnous corpus and better filter on Cosine Similarity, we can have a better performing bot!