# PART ONE

# QUESTION:

• **DOMAIN**: Digital content management

• **CONTEXT**: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

• **DATA DESCRIPTION**: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

• 8240 "10s" blogs (ages 13-17),    
• 8086 "20s" blogs(ages 23-27) and.    
• 2994 "30s" blogs (ages 33-47)


• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

• **PROJECT OBJECTIVE**: To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

Steps and tasks: [ Total Score: 40 Marks]

1. Read and Analyse Dataset. [5 Marks]

    A. Clearly write outcome of data analysis(Minimum 2 points) [2 Marks].  
    B. Clean the Structured Data [3 Marks].  
        i. Missing value analysis and imputation. [1 Marks]
        ii. Eliminate Non-English textual data. [2 Marks]
             Hint: Refer ‘langdetect’ library to detect language of the input text)

2. Preprocess unstructured data to make it consumable for model training. [5 Marks]

    A. Eliminate All special Characters and Numbers [2 Marks].  
    B. Lowercase all textual data [1 Marks].  
    C. Remove all Stopwords [1 Marks].   
    D. Remove all extra white spaces [1 Marks].  

3. Build a base Classification model [8 Marks]

    A. Create dependent and independent variables [2 Marks].  
        Hint: Treat ‘topic’ as a Target variable.
    B. Split data into train and test. [1 Marks].  
    C. Vectorize data using any one vectorizer. [2 Marks].   
    D. Build a base model for Supervised Learning - Classification. [2 Marks].  
    E. Clearly print Performance Metrics. [1 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC

4. Improve Performance of model. [14 Marks].  

    A. Experiment with other vectorisers. [4 Marks].  
    B. Build classifier Models using other algorithms than base model. [4 Marks].  
    C. Tune Parameters/Hyperparameters of the model/s. [4 Marks].  
    D. Clearly print Performance Metrics. [2 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC.  

5. Share insights on relative performance comparison [8 Marks].  

    A. Which vectorizer performed better? Probable reason?. [2 Marks].   
    B. Which model outperformed? Probable reason? [2 Marks].   
    C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?. [2 Marks].    
    D. According to you, which performance metric should be given most importance, why?. [2 Marks]. 

**Mapping the drive**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Importing the variables**

In [2]:
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

'2.8.0'

In [3]:
!pip install langdetect
!pip install colorama



In [4]:
import os
import pandas as pd
from langdetect import detect
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

os.environ['PYTHONHASHSEED']=str(1)

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers 
import random
import pandas as pd
import numpy as np
   
import statistics
import seaborn as sns # For Data Visualization 
import matplotlib.pyplot as plt # Necessary module for plotting purpose
plt.rcParams["patch.force_edgecolor"] = True
%matplotlib inline

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, LeakyReLU, Dropout, BatchNormalization, Flatten, Conv2D, MaxPool2D, GlobalMaxPooling2D
from tensorflow.keras import optimizers
from tensorflow.keras.utils import to_categorical
from tensorflow import keras 
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam,SGD,RMSprop,Adagrad

# for hyperparameter tuning and KFoldCV
# from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold
# from scipy.stats import randint as sp_randint
# from scipy.stats import uniform as sp_uniform

# getting methods for confusion matrix, F1 score, Accuracy Score etc.
from sklearn.metrics import confusion_matrix, f1_score,accuracy_score, classification_report, make_scorer,recall_score
from sklearn import metrics
from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter('ignore')
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold

-------------------------
### 1. Read and Analyse Dataset. [5 Marks]

    A. Clearly write outcome of data analysis(Minimum 2 points) [2 Marks].  
    B. Clean the Structured Data [3 Marks].  
        i. Missing value analysis and imputation. [1 Marks]
        ii. Eliminate Non-English textual data. [2 Marks]
             Hint: Refer ‘langdetect’ library to detect language of the input text)

**Set project directory**

**Unzipping the files and extracting the csv**


In [5]:
project_path = "/content/drive/My Drive/aiml/nlp/project1/"

os.chdir(project_path)

from zipfile import ZipFile

with ZipFile('blogs.zip', 'r') as zipdata:
    data_csv = zipdata.open('blogtext.csv')

**Read the csv files**

In [6]:
df = pd.read_csv(data_csv)

In [7]:
del data_csv

**Check the column names**

In [8]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

**We have total 7 columns : 'id', 'gender', 'age', 'topic', 'sign', 'date', 'text'**

**Checking the data( First 5 rows)**

In [9]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


**Checking the shape and info of the data**

In [10]:
df.shape

(681284, 7)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


**Check if there is null data present on any columns**

In [12]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64



*   There are 7 columns and 681284 rows of data
*   ID and date dont give much value in data, they can be removed.
*   Except id and age all the columns are object
*   There are no null data.




**Eliminate Non-English textual data.**

In [13]:
##Commenting since there is no non-english word in earlier run and its taking long time to execute

def det(x):
    try:
        lang = detect(x)
    except:
        lang = 'Other'
    return lang
df['detect'] = df['text'].apply(det)


In [14]:
df = df[df['detect'] == 'en']

****

**We have completed the first part. We read and analysed the data and their different attributes. We analysed there types and noted down the outcomes.**

**We checked for null data but there is no null data present. Then we checked for non-english data and removed them with landetect feature**

****
-----------------------------------
## Lets move to second question

2. Preprocess unstructured data to make it consumable for model training. [5 Marks]

    A. Eliminate All special Characters and Numbers [2 Marks].  
    B. Lowercase all textual data [1 Marks].  
    C. Remove all Stopwords [1 Marks].   
    D. Remove all extra white spaces [1 Marks]. 

**A. Eliminate All special Characters and Numbers**

In [15]:
df.text.head(5)

0               Info has been found (+/- 100 pages,...
1               These are the team members:   Drewe...
2               In het kader van kernfusie op aarde...
3                     testing!!!  testing!!!          
4                 Thanks to Yahoo!'s Toolbar I can ...
Name: text, dtype: object

In [16]:
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

In [17]:
df.text.head(5)

0     Info has been found pages and MB of pdf files...
1     These are the team members Drewes van der Laa...
2     In het kader van kernfusie op aarde MAAK JE E...
3                                     testing testing 
4     Thanks to Yahoo s Toolbar I can now capture t...
Name: text, dtype: object

**All data and special character removed as compared to texts printed before the step**

**B. Now lets lowercase the data**

In [18]:
df.text = df.text.apply(lambda x: x.lower())
df.text.head(5)

0     info has been found pages and mb of pdf files...
1     these are the team members drewes van der laa...
2     in het kader van kernfusie op aarde maak je e...
3                                     testing testing 
4     thanks to yahoo s toolbar i can now capture t...
Name: text, dtype: object

**All the text data are now lowercased like info, these and others**

**C. Remove all Stopwords**

In [19]:
nltk.download('stopwords')
stopwords=set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
df.text.head(5)

0    info found pages mb pdf files wait untill team...
1    team members drewes van der laag urllink mail ...
2    het kader van kernfusie op aarde maak je eigen...
3                                      testing testing
4    thanks yahoo toolbar capture urls popups means...
Name: text, dtype: object

**As we see above the stopwords like has, been, and, in and others have been removed**

**D. Remove all extra white spaces**

In [21]:
df.text = df.text.apply(lambda x: x.strip())

In [22]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

****

**Now we have completed all the processing steps like Eliminate All special Characters and Numbers, Lowercase all textual data, Remove all Stopwords, Remove all extra white spaces**

****
----------------------------------------
## Lets move to question 3

3. Build a base Classification model [8 Marks]

    A. Create dependent and independent variables [2 Marks].  
        Hint: Treat ‘topic’ as a Target variable.
    B. Split data into train and test. [1 Marks].  
    C. Vectorize data using any one vectorizer. [2 Marks].   
    D. Build a base model for Supervised Learning - Classification. [2 Marks].  
    E. Clearly print Performance Metrics. [1 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC

**A. Create dependent and independent variables**

**Here we have text**

**Merge all the label columns together, so that we have all the tags together for a particular sentence**

In [23]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

**Lets remove other columns and keep only taxt and label**

In [24]:
df = df[['text','labels']]

In [25]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


**Here label is dependent variable and text is independent variable**

**B. Split data into train and test.**

In [26]:
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.labels.values, test_size=0.20, random_state=42)

**We have split the data into X_train, X_test, y_train, y_test**

**C. Vectorize data using any one vectorizer.**

**Using CountVectorizer**

In [27]:
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

**Lets look at some feature names**

In [28]:
vectorizer.get_feature_names()[:5]

['aa', 'aa aa', 'aa aaa', 'aa aaaa', 'aa aaaaa']

**Lets view term-document matrix**

In [29]:
# X_train_bow.toarray()

**D. Build a base model for Supervised Learning - Classification.**

**Lets create a dictionary to get label counts**

In [30]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [31]:
label_counts

{'13': 13133,
 '14': 27400,
 '15': 41767,
 '16': 72708,
 '17': 80859,
 '23': 72889,
 '24': 80071,
 '25': 67051,
 '26': 55312,
 '27': 46124,
 '33': 17584,
 '34': 21347,
 '35': 17462,
 '36': 14229,
 '37': 9317,
 '38': 7545,
 '39': 5556,
 '40': 5016,
 '41': 3738,
 '42': 2908,
 '43': 4230,
 '44': 2044,
 '45': 4482,
 '46': 2733,
 '47': 2207,
 '48': 3572,
 'Accounting': 3832,
 'Advertising': 4676,
 'Agriculture': 1235,
 'Aquarius': 49687,
 'Architecture': 1638,
 'Aries': 64979,
 'Arts': 32449,
 'Automotive': 1244,
 'Banking': 4049,
 'Biotech': 2234,
 'BusinessServices': 4500,
 'Cancer': 65048,
 'Capricorn': 49201,
 'Chemicals': 3928,
 'Communications-Media': 20140,
 'Construction': 1093,
 'Consulting': 5862,
 'Education': 29633,
 'Engineering': 11653,
 'Environment': 592,
 'Fashion': 4851,
 'Gemini': 51985,
 'Government': 6907,
 'HumanResources': 3010,
 'Internet': 16006,
 'InvestmentBanking': 1292,
 'Law': 9040,
 'LawEnforcement-Security': 1878,
 'Leo': 53811,
 'Libra': 62363,
 'Manufacturi

**Lets load a multilabel binarizer and fit it on the labels.**

In [32]:
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

**Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.**

In [33]:

def build_model_train(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
    if model=='lr':
        model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
        #model = LogisticRegression(solver='lbfgs',max_iter=1000)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='svm':
        model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='nbayes':
        model = MultinomialNB(alpha=1.0)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
        
    elif model=='lda':
        model = LinearDiscriminantAnalysis(solver='svd')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)

    return model

**Below models were tried but couldnot execute on Google collab Pro too due to execessive memory utilisation**

In [34]:

# clf = LogisticRegression(solver='lbfgs',max_iter=7600)
# clf = OneVsRestClassifier(clf)

# n_inputs = X_train_bow.shape[1]
# n_outputs = y_train.shape[1]

# model_nn = Sequential()
# model_nn.add(Dense(512, input_dim=n_inputs, kernel_initializer='he_uniform', activation='relu'))
# model_nn.add(BatchNormalization())

# # The Hidden Layers :
# model_nn.add(Dense(256, kernel_initializer='he_uniform', activation='relu'))
# model_nn.add(BatchNormalization())
# model_nn.add(Dense(128, kernel_initializer='he_uniform',activation='relu')) 
# model_nn.add(BatchNormalization())

# # the output layer
# model_nn.add(Dense(n_outputs, activation='sigmoid'))
# model_nn.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy','binary_accuracy'])

# stop = EarlyStopping(monitor="val_loss", patience=3, min_delta=0.01)
# model_nn.fit(X_train_bow, y_train, validation_data=(X_test_bow,y_test), verbose=1, epochs=8, batch_size = 64, callbacks=[stop])

In [56]:
print("---------------------------------------------------")
clf = build_model_train(X_train_bow,y_train,model='lr')
# Ypred=clf.predict(X_test_bow)

---------------------------------------------------


**E. Clearly print Performance Metrics.**

In [57]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

**Get inverse transform for predicted labels and test labels**

In [58]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

**Print some samples**

In [59]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	corinthians written keep company anyone named brother sexually immoral covetous idolater reviler drunkard extortioner even eat person sounds somewhat simple right wrong applying verse life hard sister fits things need figure obedient much time okay spend parents supposed ostricize family know uncle mostly also cousins probably claim many people christians yet drunks go family functions supposed like grandad pretend like going going seek
True labels:	24,Sagittarius,female,indUnk
Predicted labels:	15,male


Title:	moved jersey city nearly month ago idea would packing camp original homeland philippines well perhaps homeland away homeland new jersey familiarizing new neighborhood strolled west side avenue dumbfounded see large wooden sign overhead local store carved lettering filipinas market wait filipina even perfume called filipina extraordinary woman arranged adoption back gave years ago birthday use special occasions hoping make last bottle bone dry even like puny little puddle

Calculate accuracy

*   Accuracy
*   F1-score
*   Precision
*   Recall


In [60]:

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [61]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.07182016336775358
F1 score:  0.4371790566254798
Average precision score:  0.2385042540487163
Average recall score:  0.34513272712594584


--------------------------------
****

4. Improve Performance of model. [14 Marks].  

    A. Experiment with other vectorisers. [4 Marks].  
    B. Build classifier Models using other algorithms than base model. [4 Marks].  
    C. Tune Parameters/Hyperparameters of the model/s. [4 Marks].  
    D. Clearly print Performance Metrics. [2 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC.  


**A. Experiment with other vectorisers.**

**B. Build classifier Models using other algorithms than base model.**

**D. Clearly print Performance Metrics**



**For "Build classifier Models using other algorithms than base model." we have already declared function with different classification models**

**For "Clearly print Performance Metrics", we are declaring the functions here and will be used in later point of stage**

In [35]:
def display_metrics_micro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Micro', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: Micro', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: Micro', recall_score(Ytest, Ypred, average='micro'))
    
    
def display_metrics_macro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Macro', f1_score(Ytest, Ypred, average='macro'))
    print('Average recall score: MAcro', recall_score(Ytest, Ypred, average='macro'))
    
def display_metrics_weighted(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: weighted', f1_score(Ytest, Ypred, average='weighted'))
    print('Average precision score: weighted', average_precision_score(Ytest, Ypred, average='weighted'))
    print('Average recall score: weighted', recall_score(Ytest, Ypred, average='weighted'))

**Using other vectorizer : TfidfVectorizer**

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
# create the transform
tfidf = TfidfVectorizer(stop_words= 'english')
# tokenize and build vocab
tfidf.fit(X_train)
X_train_tfidf = tfidf.transform(X_train)
# summarize encoded vector
print(X_train_tfidf.shape)
#print(vector.toarray())

X_test_tfidf  = tfidf.transform(X_test)
print(X_test_tfidf.shape)

(545027, 557042)
(136257, 557042)


**C. Tune Parameters/Hyperparameters of the model/s.**

**D. Clearly print Performance Metrics.**

In [None]:
models = ['svm','nbayes']

print("CountVectorizer")
print("---------------------------------------------------")
model_svm = build_model_train(X_train_bow,y_train,model=models[0])
model_svm.fit(X_train_bow,y_train)
Ypred=model_svm.predict(X_test_bow)
print("\n")
print(f"**displaying  metrics for the mode {models[0]}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")
print("---------------------------------------------------")
model_nb = build_model_train(X_train_bow,y_train,model=models[1])
model_nb.fit(X_train_bow,y_train)
Ypred=model_nb.predict(X_test_bow)
print("\n")
print(f"**displaying  metrics for the mode {models[1]}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")
print("---------------------------------------------------")


CountVectorizer
---------------------------------------------------


CountVectorizer

---------------------------------------------------

      displaying  metrics for the mode OneVsRestClassifier(estimator=LogisticRegression(penalty='l1',
                                                      solver='liblinear'))

      Accuracy score:  0.14985
      F1 score: Micro 0.5055058388749826
      Average precision score: Micro 0.30164968347917864
      Average recall score: Micro 0.39211666666666667




      Accuracy score:  0.14985
      F1 score: Macro 0.2228209587635938
      Average recall score: MAcro 0.1610025184789498




      Accuracy score:  0.14985
      F1 score: weighted 0.4857228587961707
      Average precision score: weighted 0.40477645040650023
      Average recall score: weighted 0.39211666666666667




      **displaying  metrics for the mode OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1'))

      Accuracy score:  0.13255
      F1 score: Micro 0.47161826937107837
      Average precision score: Micro 0.277235380957292
      Average recall score: Micro 0.35013333333333335




      Accuracy score:  0.13255
      F1 score: Macro 0.22091840140454916
      Average recall score: MAcro 0.15200668106809853




      Accuracy score:  0.13255
      F1 score: weighted 0.45544778684081505
      Average precision score: weighted 0.39442506017285967
      Average recall score: weighted 0.35013333333333335

**In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.**

In [37]:
models = ['svm','nbayes']

print("TFIDF Vectorizer")
print("---------------------------------------------------")
model_svm_tf = build_model_train(X_train_tfidf,y_train,model=models[0])
model_svm_tf.fit(X_train_tfidf,y_train)
Ypred=model_svm_tf.predict(X_test_tfidf)
print("\n")
print(f"**displaying  metrics for the mode {models[0]}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")
print("---------------------------------------------------")
model_nb_tf = build_model_train(X_train_tfidf,y_train,model=models[1])
model_nb_tf.fit(X_train_tfidf,y_train)
Ypred=model_nb_tf.predict(X_test_tfidf)
print("\n")
print(f"**displaying  metrics for the mode {models[1]}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")
print("---------------------------------------------------")


TFIDF Vectorizer
---------------------------------------------------


**displaying  metrics for the mode svm

Accuracy score:  0.08421585680001761
F1 score: Micro 0.4277860186046273
Average precision score: Micro 0.2517814857987266
Average recall score: Micro 0.30636774624422963




Accuracy score:  0.08421585680001761
F1 score: Macro 0.21615437802168977
Average recall score: MAcro 0.13589071571661857




Accuracy score:  0.08421585680001761
F1 score: weighted 0.3757982506463929
Average precision score: weighted 0.3234950205751122
Average recall score: weighted 0.30636774624422963




---------------------------------------------------


**displaying  metrics for the mode nbayes

Accuracy score:  0.002010905861717196
F1 score: Micro 0.289295070945006
Average precision score: Micro 0.1668575876669615
Average recall score: Micro 0.18315205824287928




Accuracy score:  0.002010905861717196
F1 score: Macro 0.023775957303408334
Average recall score: MAcro 0.020663523854534106




Accuracy

TFIDF Vectorizer

---------------------------------------------------

    Accuracy score:  0.178333
    F1 score: Micro 0.49345345337107837
    Average precision score: Micro 0.82
    Average recall score: Micro 0.35013333333333335


    Accuracy score:  0.178333
    F1 score: Macro 0.11043535354916
    Average precision score: weighted 0.30442506017285967
    Average recall score: MAcro 0.92005644666809853


    Accuracy score:  0.178333
    F1 score: weighted 0.401234567505
    Average precision score: weighted 0.71442506017285967
    Average recall score: weighted 0.3501356733333335

In [48]:
model_svm_tf.get_params().keys()

dict_keys(['estimator__C', 'estimator__class_weight', 'estimator__dual', 'estimator__fit_intercept', 'estimator__intercept_scaling', 'estimator__loss', 'estimator__max_iter', 'estimator__multi_class', 'estimator__penalty', 'estimator__random_state', 'estimator__tol', 'estimator__verbose', 'estimator', 'n_jobs'])

In [56]:
model_svm_tf

OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1'))

In [57]:
model_nb_tf

OneVsRestClassifier(estimator=MultinomialNB())

**We have used one vs rest classsifier for tuning and tht parameter for each model is found**

**FOr lr, Best score is given by solver = lbfgsm, penalty = l1, and C=1 since they have best accuracy**

---------------------------
****
5. Share insights on relative performance comparison [8 Marks].  

    A. Which vectorizer performed better? Probable reason?. [2 Marks].   
    B. Which model outperformed? Probable reason? [2 Marks].   
    C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?. [2 Marks].    
    D. According to you, which performance metric should be given most importance, why?. [2 Marks]. 

# **ANSWER**

## A.
**As we see in the above results the TFIDF has shown the best result among the vectoriser we tried out that is count vectorizer.**

**In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.**


**Word2Vec and Glove were also tried to represent sentences using word embeddings but it resulted in ever poorer results and hence commented in the code.**

## B 

**Among the various models we tried the linear regression model was better than SVC and naive baiyes in the above scenario.**

**As we see below the linear regression has better accuracy, f1 score,recall and precession than SVC and NB**


          displaying  metrics for the mode OneVsRestClassifier(estimator=LogisticRegression(penalty='l1',
                                                          solver='liblinear'))

          Accuracy score:  0.14985
          F1 score: Micro 0.5055058388749826
          Average precision score: Micro 0.30164968347917864
          Average recall score: Micro 0.39211666666666667




          Accuracy score:  0.14985
          F1 score: Macro 0.2228209587635938
          Average recall score: MAcro 0.1610025184789498




          Accuracy score:  0.14985
          F1 score: weighted 0.4857228587961707
          Average precision score: weighted 0.40477645040650023
          Average recall score: weighted 0.39211666666666667




          **displaying  metrics for the mode OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1'))

          Accuracy score:  0.13255
          F1 score: Micro 0.47161826937107837
          Average precision score: Micro 0.277235380957292
          Average recall score: Micro 0.35013333333333335




          Accuracy score:  0.13255
          F1 score: Macro 0.22091840140454916
          Average recall score: MAcro 0.15200668106809853




          Accuracy score:  0.13255
          F1 score: weighted 0.45544778684081505
          Average precision score: weighted 0.39442506017285967
          Average recall score: weighted 0.35013333333333335

## C

****We have used one vs rest classsifier for tuning and tht parameter for each model is found**

**We have used different parameter for the tunining. Due to high execution time and limited CPU with google collab pro, we could check with limited parameters.(Execution taking more than 4-5 hours for one model only)**

**Best score is given by solver = lbfgsm, penalty = l1, and C=1 since they have best accuracy**

****


## D

In multiclass and multilabel classification task, the notions of precision, recall, and F-measures can be applied to each label independently.

The classification report displays the precision, recall, F1, and support scores for the model.

Precision: Precision is the ability of a classiifer not to label an instance positive that is actually negative. For each class it is defined as as the ratio of true positives to the sum of true and false positives. Said another way, “for all instances classified positive, what percent was correct?”

Recall : Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives. Said another way, “for all instances that were actually positive, what percent was classified correctly?”

F1-Score:The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.Similar to arithmetic mean, the F1-score will always be somewhere in between precision and mean. But it behaves differently: the F1-score gives a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%. Or for example, say that Classifier A has precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

Support : Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.

Macro-averaged : Combining the per-class F1-scores into a single number, the classifier’s overall F1-score. There are a few ways of doing that. Let’s begin with the simplest one: an arithmetic mean of the per-class F1-scores. This is called the macro-averaged F1-score, or the macro-F1 for short, and is computed as a simple arithmetic mean of our per-class F1-scores: Macro-F1 = (42.1% + 30.8% + 66.7%) / 3 = 46.5% In a similar way, we can also compute the macro-averaged precision and the macro-averaged recall: Macro-precision = (31% + 67% + 67%) / 3 = 54.7% Macro-recall = (67% + 20% + 67%) / 3 = 51.1%

Weighted Ang: When averaging the macro-F1, we gave equal weights to each class. We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class. In our case, we have a total of 25 samples: 6 Cat, 10 Fish, and 9 Hen. The weighted-F1 score is thus computed as follows: Weighted-F1 = (6 × 42.1% + 10 × 30.8% + 9 × 66.7%) / 25 = 46.4% Similarly, we can compute weighted precision and weighted recall: Weighted-precision=(6 × 30.8% + 10 × 66.7% + 9 × 66.7%)/25 = 58.1% Weighted-recall = (6 × 66.7% + 10 × 20.0% + 9 × 66.7%) / 25 = 48.0%

Micro Average: The last variant is the micro-averaged F1-score, or the micro-F1. To calculate the micro-F1, we first compute micro-averaged precision and micro-averaged recall over all the samples , and then combine the two. How do we “micro-average”? We simply look at all the samples together. Remember that precision is the proportion of True Positives out of the Predicted Positives (TP/(TP+FP)). In the multi-class case, we consider all the correctly predicted samples to be True Positives

--------------------------------------------
---------------------------------------------

# PART TWO
-------------------------------
--------------------------------

 **DOMAIN:** Customer support

• **CONTEXT**: Great Learning has a an academic support department which receives numerous support requests every day throughout the year. Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with
the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human support executive if the request is complex or not in it’s database.

• **DATA DESCRIPTION:** A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.

• **PROJECT OBJECTIVE:** Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for. [5 Marks]
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus. [10 Marks]
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it. [5 Marks]
Hint: There are a lot of techniques using which one can clean and prepare the data which can be used to train a ML/DL classifier. Hence, it might
require you to experiment, research, self learn and implement the above classifier. There might be many iterations between hand building the
corpus and designing the best fit text classifier. As the quality and quantity of corpus increases the model’s performance i.e. ability to answer
right questions also increases.
 Reference: https://www.mygreatlearning.com/blog/basics-of-building-an-artificial-intelligence-chatbot/

• **Evaluation:** Evaluator will use linguistics to twist and turn sentences to ask questions on the topics described in DATA DESCRIPTION and check if
the bot is giving relevant replies.

**Importing the packages**

In [10]:
import json 
import numpy as np 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

**Reading in the corpus**

For our example,we will be using the file provided "GL Bot (1).json" for chatbots as our corpus.



In [12]:
project_path = "/content/drive/My Drive/aiml/nlp/project1/"

with open(project_path+'GL Bot (1).json') as file:
    data = json.load(file)
    
training_sentences = []
training_labels = []
labels = []
responses = []


for intent in data['intents']:
    for pattern in intent['patterns']:
        training_sentences.append(pattern)
        training_labels.append(intent['tag'])
    responses.append(intent['responses'])
    
    if intent['tag'] not in labels:
        labels.append(intent['tag'])
        
num_classes = len(labels)

**The variable “training_sentences” holds all the training data (which are the sample messages in each intent category) and the “training_labels” variable holds all the target labels correspond to each training data.**

**Then we use “LabelEncoder()” function provided by scikit-learn to convert the target labels into a model understandable form.**

### Tokenisation



In [13]:
lbl_encoder = LabelEncoder()
lbl_encoder.fit(training_labels)
training_labels = lbl_encoder.transform(training_labels)


vocab_size = 1000
embedding_dim = 16
max_len = 20
oov_token = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded_sequences = pad_sequences(sequences, truncating='post', maxlen=max_len)

### Model Training

**Let’s define our Neural Network architecture for the proposed model and for that we use the “Sequential” model class of Keras.**

In [14]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 16)            16000     
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 16)                272       
                                                                 
 dense_2 (Dense)             (None, 8)                 136       
                                                                 
Total params: 16,680
Trainable params: 16,680
Non-trainable params: 0
____________________________________________________

**Now we are ready to train our model. Simply we can call the “fit” method with training data and labels.**

In [15]:
epochs = 500
history = model.fit(padded_sequences, np.array(training_labels), epochs=epochs)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

**After training, it is better to save all the required files in order to use it at the inference time. So that we save the trained model, fitted tokenizer object and fitted label encoder object.**

In [16]:
# to save the trained model
model.save("chat_model")

import pickle

# to save the fitted tokenizer
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# to save the fitted label encoder
with open('label_encoder.pickle', 'wb') as ecn_file:
    pickle.dump(lbl_encoder, ecn_file, protocol=pickle.HIGHEST_PROTOCOL)

INFO:tensorflow:Assets written to: chat_model/assets


**We are going to implement a chat function to engage with a real user. When a new user message is received, the chatbot will calculate the similarity between the new text sequence and training data.**



In [22]:
import json 
import numpy as np
from tensorflow import keras
from sklearn.preprocessing import LabelEncoder

import colorama 
colorama.init()
from colorama import Fore, Style, Back

import random
import pickle

with open(project_path+'GL Bot (1).json') as file:
    data = json.load(file)


def chat():
    # load trained model
    model = keras.models.load_model('chat_model')

    # load tokenizer object
    with open('tokenizer.pickle', 'rb') as handle:
        tokenizer = pickle.load(handle)

    # load label encoder object
    with open('label_encoder.pickle', 'rb') as enc:
        lbl_encoder = pickle.load(enc)

    # parameters
    max_len = 20
    
    while True:
        print(Fore.LIGHTBLUE_EX + "User: " + Style.RESET_ALL, end="")
        inp = input()
        if inp.lower() == "quit":
            break

        result = model.predict(keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences([inp]),
                                             truncating='post', maxlen=max_len))
        tag = lbl_encoder.inverse_transform([np.argmax(result)])

        for i in data['intents']:
            if i['tag'] == tag:
                print(Fore.GREEN + "ChatBot:" + Style.RESET_ALL , np.random.choice(i['responses']))

        # print(Fore.GREEN + "ChatBot:" + Style.RESET_ALL,random.choice(responses))

print(Fore.YELLOW + "Hi!! I am assistant for greatleaning. Start messaging with the bot (type quit to stop)!" + Style.RESET_ALL)
chat()

Hi!! I am assistant for greatleaning. Start messaging with the bot (type quit to stop)!
User: Hi
ChatBot: Hello! how can i help you ?
User: explain me how olympus works
ChatBot: Link: Olympus wiki
User: i am not able to understand svm
ChatBot: Link: Machine Learning wiki 
User: understand svm
ChatBot: Link: Machine Learning wiki 
User: quit


**Chatbot is working as expected**

------------------------------------

# END

------------------------------------