# Assignment P2: Response Type Classification in Discussions
This notebook illustrates the Assignment P2 of CSC 791 Natural Language Processing Fall 2020. In this assignment, you will learn feature engineering through a classification task. Background Interactions through question-answering play an important role in discussions. Through questioning, askers may want to elicit information (e.g., wh-questions), clarify situations (e.g., closed-ended questions), or even make a point (e.g., rhetorical questions). However, how a question is responded does not necessarily align with the intent of the asker.


In [None]:
#Download Spacy 
!python -m spacy download en_core_web_md
!python3 -m spacy download en_core_web_sm
!python -m spacy.en.download all

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=c3acd99ec4c16404e352843d12b4630754defb9f9eaa1ade5b80e3bb722ad14d
  Stored in directory: /tmp/pip-ephem-wheel-cache-vq2l8ms3/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
/us

In [3]:
# Upload Testing and Training datasets
# Please select both the files: p2_train and p2_test

from google.colab import files
uploaded = files.upload()

Saving p2_test.csv to p2_test.csv
Saving p2_train.csv to p2_train.csv


In [4]:
import pandas as pd
import numpy as np
import io

In [5]:
# Load Training and Testing data
df_train = pd.read_csv(io.BytesIO(uploaded['p2_train.csv']))
df_test  = pd.read_csv(io.BytesIO(uploaded['p2_test.csv']))

In [6]:
print(df_train.shape)
print(df_test.shape)

(1640, 10)
(410, 10)


In [7]:
df_test.head()

Unnamed: 0,thread_id,question_id,response_id,no_turn_q_id,quoted_q_id,precedent,question,subsequent,response,type
0,232fcl,232fcl,cgsrxd3,,q_27319,,Now if Julie was underage (let's say you and ...,,"\nNo. She's happy, he's happy, who am I to end...",irrelevant
1,221bir,cgikp7a,cginbgy,n_228942,q_28801,"And Egypt...there was a lot of chaos, a lot of...",Did the protests motivate the military to act?,Undoubtedly. But it was still the military tha...,Seems a weird example to use in defence of an ...,attacked
2,20s1x1,cg6ay9r,cg6f18i,n_244708,q_30570,&gt; Hello! Both my major AND my minor are due...,Can you illustrate how a single class in CS w...,Your statement that way more happiness is ach...,"Oh gods yes. First of all, most students aren'...",answered
3,1zyf3k,cfzfpej,cfzg0hy,n_254761,q_31795,"Oh, come on, now you're just trying to misunde...","Didn't I even say, that I am completely and 10...",Let me make it very clear: I am an universal ...,"Sure, but you still obviously have some precon...",irrelevant
4,1yfntu,cfka1pw,cfkar3s,,q_34113,,However if we forget that debate for a minute...,,I'd be willing to guess that this question is ...,attacked


In [8]:
# Analyzing the dataset
df_train.head()
classification = {'agreed':0, 'answered':1, 'attacked':2, 'irrelevant':3}

df_train['class'] = df_train.apply(lambda i : classification[i['type']], axis=1)
df_test['class'] = df_test.apply(lambda i : classification[i['type']], axis=1)

df_train['class'] = df_train['type']
df_train.groupby(by = ["type"]).count()["question"]

type
agreed         61
answered      994
attacked      299
irrelevant    286
Name: question, dtype: int64

In [9]:
# Cleaning Text

def init():

  import html 
  esc = ['&amp;', '&lt;', '&quot;', '&#x27;', '&gt;', '<>']

  for index, row in df_train.iterrows():
    for e in esc:
      if(type(row['precedent']) == str):
        row['precedent']  = row['precedent'].replace(e, "")
      if(type(row['question']) == str):
        row['question']   = row['question'].replace(e, "")
      if(type(row['subsequent']) == str):
        row['subsequent'] = row['subsequent'].replace(e, "")
      if(type(row['response']) == str):
        row['response']   = row['response'].replace(e, "")

  for index, row in df_test.iterrows():
    for e in esc:
      if(type(row['precedent']) == str):
        row['precedent']  = row['precedent'].replace(e, "")
      if(type(row['question']) == str):
        row['question']   = row['question'].replace(e, "")
      if(type(row['subsequent']) == str):
        row['subsequent'] = row['subsequent'].replace(e, "")
      if(type(row['response']) == str):
        row['response']   = row['response'].replace(e, "")

  df_train["clean_quest"] = df_train["precedent"] + df_train["question"] + df_train["subsequent"] + df_train['response']
  df_test["clean_quest"]  = df_test["precedent"] + df_test["question"] + df_test["subsequent"] + df_train['response']

  X_train, Y_train = df_train['clean_quest'], df_train['type']
  X_test,  Y_test  = df_test['clean_quest'], df_test['type']

  X_train = list(X_train)
  X_test  = list(X_test)
  Y_train = list(Y_train)
  Y_test  = list(Y_test)

In [10]:
# Cleaning Text


import html 
esc = ['&amp;', '&lt;', '&quot;', '&#x27;', '&gt;', '<>']

for index, row in df_train.iterrows():
  for e in esc:
    if(type(row['precedent']) == str):
      row['precedent']  = row['precedent'].replace(e, "")
    if(type(row['question']) == str):
      row['question']   = row['question'].replace(e, "")
    if(type(row['subsequent']) == str):
      row['subsequent'] = row['subsequent'].replace(e, "")
    if(type(row['response']) == str):
      row['response']   = row['response'].replace(e, "")

for index, row in df_test.iterrows():
  for e in esc:
    if(type(row['precedent']) == str):
      row['precedent']  = row['precedent'].replace(e, "")
    if(type(row['question']) == str):
      row['question']   = row['question'].replace(e, "")
    if(type(row['subsequent']) == str):
      row['subsequent'] = row['subsequent'].replace(e, "")
    if(type(row['response']) == str):
      row['response']   = row['response'].replace(e, "")

df_train["clean_quest"] = df_train["precedent"] + df_train["question"] + df_train["subsequent"] + df_train['response']
df_test["clean_quest"]  = df_test["precedent"] + df_test["question"] + df_test["subsequent"] + df_train['response']

X_train, Y_train = df_train['clean_quest'], df_train['type']
X_test,  Y_test  = df_test['clean_quest'], df_test['type']

X_train = list(X_train)
X_test  = list(X_test)
Y_train = list(Y_train)
Y_test  = list(Y_test)

# Feature Extraction

In [11]:
init()
# Sentence Embedding

# Load Spacy model
import spacy.cli

spacy.cli.download("en_core_web_lg")
nlp = spacy.load('en_core_web_lg')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [12]:
# For Spacy word2vec
def createVector(sent):
  vect = nlp(sent)
  avg = np.zeros(300)
  # print(vect)
  
  # Calculate average vector
  for token in vect: 
    avg += token.vector
  vect_avg = avg/len(vect)

  return vect_avg.tolist()

In [13]:
# Apply Transformation
df_train['vector'] = df_train.apply(lambda i : createVector(str(i['clean_quest'])), axis=1)
df_test['vector']  = df_test.apply(lambda i  : createVector(str(i['clean_quest'])), axis=1)

In [14]:
# Trasforming Vector info

X_train = np.array(df_train['vector'].values.tolist())
X_test  = np.array(df_test['vector'].values.tolist())


In [15]:
from sklearn.svm import LinearSVC
classifier = LinearSVC()
classifier.fit(X_train, Y_train)

prediction1 = classifier.predict(X_test)



In [16]:
def accuracy(Y_test, prediction):
  same = 0 
  for i,j in zip(Y_test, prediction):
    if i==j:
      same += 1
  return (same/len(Y_test))

In [17]:
# Temporary splitting of testing and training data
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, prediction1))
print(classification_report(Y_test, prediction1))
print("Accuracy: ", accuracy(Y_test, prediction1))

[[  0  13   0   0]
 [  0 310   7   3]
 [  0  37   2   0]
 [  0  37   1   0]]
              precision    recall  f1-score   support

      agreed       0.00      0.00      0.00        13
    answered       0.78      0.97      0.86       320
    attacked       0.20      0.05      0.08        39
  irrelevant       0.00      0.00      0.00        38

    accuracy                           0.76       410
   macro avg       0.25      0.26      0.24       410
weighted avg       0.63      0.76      0.68       410

Accuracy:  0.7609756097560976


  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
# Install vaderSentiment
!pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/76/fc/310e16254683c1ed35eeb97386986d6c00bc29df17ce280aed64d55537e9/vaderSentiment-3.3.2-py2.py3-none-any.whl (125kB)
[K     |██▋                             | 10kB 12.9MB/s eta 0:00:01[K     |█████▏                          | 20kB 1.7MB/s eta 0:00:01[K     |███████▉                        | 30kB 2.2MB/s eta 0:00:01[K     |██████████▍                     | 40kB 2.5MB/s eta 0:00:01[K     |█████████████                   | 51kB 2.0MB/s eta 0:00:01[K     |███████████████▋                | 61kB 2.2MB/s eta 0:00:01[K     |██████████████████▏             | 71kB 2.5MB/s eta 0:00:01[K     |████████████████████▉           | 81kB 2.7MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 2.9MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 2.7MB/s eta 0:00:01[K     |████████████████████████████▋   | 112kB 2.7MB/s eta 0:00:01[K     |███████████████████████████████▏| 12

In [19]:
# Sentiment Analysis

init()

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
import string

analyzer = SentimentIntensityAnalyzer()

def sentiment_analyzer_scores(text):
    score = analyzer.polarity_scores(text)
    return score['compound']

# {'neg': 0.197, 'neu': 0.754, 'pos': 0.049, 'compound': -0.9764}: Example Output

In [20]:
X_train, X_test = [], []

import six

for i in df_train['clean_quest']:
  if isinstance(i, six.string_types):
    X_train.append([sentiment_analyzer_scores(i)])
  else:
    X_train.append([0])

for i in df_test['clean_quest']:
  if isinstance(i, six.string_types):
    X_test.append([sentiment_analyzer_scores(i)])
  else:
    X_test.append([0])


X_train = np.array(X_train)
X_test = np.array(X_test)

In [21]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, Y_train)
prediction2 = classifier.predict(X_test)


In [22]:
# Temporary splitting of testing and training data
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, prediction2))
print(classification_report(Y_test, prediction2))
print("Accuracy: ", accuracy(Y_test, prediction2))

[[  0  12   1   0]
 [  0 304  14   2]
 [  1  35   2   1]
 [  1  35   2   0]]
              precision    recall  f1-score   support

      agreed       0.00      0.00      0.00        13
    answered       0.79      0.95      0.86       320
    attacked       0.11      0.05      0.07        39
  irrelevant       0.00      0.00      0.00        38

    accuracy                           0.75       410
   macro avg       0.22      0.25      0.23       410
weighted avg       0.62      0.75      0.68       410

Accuracy:  0.7463414634146341


In [23]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
from nltk import word_tokenize
from nltk.data import load


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


In [24]:
# POS Tagging
# Creating a one-hot encoder vector to represent the information

init()

tagdict = load('help/tagsets/upenn_tagset.pickle')
pos_tags_list = list(tagdict.keys())
pos_dict = dict.fromkeys(pos_tags_list, 0)

X_train, X_test = [], []

def tagger(df):
  res = []
  for sen in df['clean_quest']:
    tags = nltk.pos_tag(word_tokenize(str(sen)))
    sen_dict = pos_dict.copy()
    for w,t in tags:
      if t in sen_dict:
        sen_dict[t] += 1
    res.append(list(sen_dict.values()))
  return res

X_train, X_test = np.array(tagger(df_train)), np.array(tagger(df_test))


In [25]:
classifier = LinearSVC()
classifier.fit(X_train, Y_train)

prediction3 = classifier.predict(X_test)



In [26]:
# Temporary splitting of testing and training data
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, prediction3))
print(classification_report(Y_test, prediction3))
print("Accuracy: ", accuracy(Y_test, prediction3))

[[  0  13   0   0]
 [  0 315   1   4]
 [  0  39   0   0]
 [  0  37   0   1]]
              precision    recall  f1-score   support

      agreed       0.00      0.00      0.00        13
    answered       0.78      0.98      0.87       320
    attacked       0.00      0.00      0.00        39
  irrelevant       0.20      0.03      0.05        38

    accuracy                           0.77       410
   macro avg       0.24      0.25      0.23       410
weighted avg       0.63      0.77      0.68       410

Accuracy:  0.7707317073170732


  _warn_prf(average, modifier, msg_start, len(result))


In [40]:
# Self defined Decission Tree Classifier

# Using the sentiment score to check for discussions which lead classify as 'attacked'
# Using lexicon based approach after reviewing elements in the training data set for the other 3 classes

init()

sentiment_score, prediction4 = [], []
agreed = ['yup', 'yes','correct','rational','sure','absolutely','agree']
answered = ['read','actually','fact','think']

for i in df_test['clean_quest']:
  if isinstance(i, six.string_types):
    sentiment_score.append([sentiment_analyzer_scores(i)])
  else:
    sentiment_score.append([0])

def check_if_exists( sentence, lst):
  for i in lst:
    if isinstance(sentence, six.string_types):
      if i in sentence:
        return True
  return False

i=0
for index, row in df_test.iterrows():
  if sentiment_score[i][0] < (-0.5):
    prediction4.append('attacked')
  elif check_if_exists( row['clean_quest'], agreed):
    prediction4.append('agreed') 
  elif check_if_exists( row['clean_quest'], answered):
    prediction4.append('answered')
  else:
    prediction4.append('irrelevant')
  i += 1


In [41]:
print(confusion_matrix(Y_test, prediction4))
print(classification_report(Y_test, prediction4))
print("Accuracy: ", accuracy(Y_test, prediction4))

[[  1   3   8   1]
 [ 37  55  73 155]
 [  6   3  13  17]
 [  2   6   7  23]]
              precision    recall  f1-score   support

      agreed       0.02      0.08      0.03        13
    answered       0.82      0.17      0.28       320
    attacked       0.13      0.33      0.19        39
  irrelevant       0.12      0.61      0.20        38

    accuracy                           0.22       410
   macro avg       0.27      0.30      0.18       410
weighted avg       0.66      0.22      0.26       410

Accuracy:  0.22439024390243903


# Ensemble Voting Classifier

In [51]:
import operator
from sklearn.ensemble import VotingClassifier

classification = {'agreed':0, 'answered':1, 'attacked':2, 'irrelevant':3}
predictionA, predictionB = [], []

for i in range(len(prediction1)):
  result1 = {'agreed':0, 'answered':0, 'attacked':0, 'irrelevant':0}
  result2 = {'agreed':0, 'answered':0, 'attacked':0, 'irrelevant':0}

  result1[prediction1[i]] += 0.5
  result1[prediction2[i]] += 0.6
  result1[prediction3[i]] += 0.3
  result1[prediction4[i]] += 0.15

  max_value = max(result1.values())
  for cls in result1.keys():
    if( result1[cls] == max_value):
      predictionA.append(cls)

  result2[prediction1[i]] += 0.5
  result2[prediction3[i]] += 0.4

  max_value = max(result2.values())
  for cls in result2.keys():
    if( result2[cls] == max_value):
      predictionB.append(cls)


In [52]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, predictionB))
print(classification_report(Y_test, predictionB))
print("Accuracy: ", accuracy(Y_test, predictionB))

[[  0  13   0   0]
 [  0 310   7   3]
 [  0  37   2   0]
 [  0  37   1   0]]
              precision    recall  f1-score   support

      agreed       0.00      0.00      0.00        13
    answered       0.78      0.97      0.86       320
    attacked       0.20      0.05      0.08        39
  irrelevant       0.00      0.00      0.00        38

    accuracy                           0.76       410
   macro avg       0.25      0.26      0.24       410
weighted avg       0.63      0.76      0.68       410

Accuracy:  0.7609756097560976


  _warn_prf(average, modifier, msg_start, len(result))


In [53]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(Y_test, predictionA))
print(classification_report(Y_test, predictionA))
print("Accuracy: ", accuracy(Y_test, predictionA))

[[  0  13   0   0]
 [  0 319   1   0]
 [  0  39   0   0]
 [  0  37   1   0]]
              precision    recall  f1-score   support

      agreed       0.00      0.00      0.00        13
    answered       0.78      1.00      0.88       320
    attacked       0.00      0.00      0.00        39
  irrelevant       0.00      0.00      0.00        38

    accuracy                           0.78       410
   macro avg       0.20      0.25      0.22       410
weighted avg       0.61      0.78      0.68       410

Accuracy:  0.7780487804878049


  _warn_prf(average, modifier, msg_start, len(result))


# References:


*   https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/
*   https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
*   https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
*   https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html  
