**Author: Sherly Sherly**

# Research Objectives:
- Applications of different preprocessing methods
    - lemmatize/non-lemmatize
    - POS tags
    - unigram/bigram/skip-gram
- Application of different techniques
    - The objective is to evaluate what is the best way to do the word representations in order to calculate the similarity
    - Examples:
        - Bag-of-words
        - TF-IDF
        - Latent Semantic Indexing
        - Latent Dirichlet Allocation
        - Word2Vec
        - FastText
        - FastText pre-trained

- Application of same technique on different problems 
    - Question - Answer
    - Question - Question

References:
- https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/kenter-short-2015.pdf
- https://pdfs.semanticscholar.org/d632/3544c5c103c8c0094202f38922c12db50d65.pdf
- https://arxiv.org/pdf/1802.05667.pdf
- https://github.com/tbmihailov/semeval2016-task3-cqa

## Learning points from the different papers

<u>Question Condensing Networks for Answer Selection in Community Question Answering</u>  
Wei Wu, Xu Sun, Houfeng Wang

- Specifically trained GloVe vectors can model word interactions more precisely +  Character embedding has proven to be very useful for out-of-vocabulary (OOV) words, so it is especially suitable for noisy web text in CQA. We concatenate these two embedding vectors for every word to generate word-level embeddings
-  We propose to treat the question subject and the question body separately in community question answering. We treat the question subject as the primary part of the question, and aggregate the question body information based on similarity and disparity with the question subject.
- We introduce a new method that uses the multi-dimensional attention mechanism to align question-answer pair. With this attention mechanism, the interaction between questions and answers can be learned more accurately.
- Our proposed Question Condensing Networks (QCN) achieves the state-of-the-art performance on two SemEval CQA datasets, outperforming all exisiting SOTA models by a large margin, which demonstrates the effectiveness of our model.
- We propose to cheat the question subject as the primary part of the question representation, and aggregate question body information from two perspectives: similarity and disparity with the question subject.


<u>KeLP at SemEval-2017 Task 3: Learning Pairwise Patterns in Community Question Answering</u>  
Simone Filice, Giovanni Da San Martino and Alessandro Moschitti
- We modeled the three subtasks as binary classification problems: kernel-based classifiers are trained and the classification score is used to sort the instances and produce the final ranking.

# Tasks
 
## Subtask B: Question-Question Similarity
**Given**  
- a new question (aka original question) and
- the set of the first 10 related questions (retrieved by a search engine), 

rerank the related questions according to their similarity with respect to the original question. In this case, we will consider the "PerfectMatch" and "Relevant" questions both as good (i.e., we will not distinguish between them and we will consider them both "Relevant"), and they should be ranked above the "Irrelevant" questions. The gold labels for this subtask are contained in the RELQ_RELEVANCE2ORGQ field of the XML file. See the README file for a detailed explanation of their meaning. Again, this is not a classification task; it is a ranking task.
 
**Evaluation:**  
As in subtask A, the official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked is MAP using the 10 ranked questions.


<a id='data_format'></a>

## 1. Data Formatting

The given dataset for the challenge is in XML format. We will first need to parse the relevant information from the XML format.

In [1]:
datapathprefix = "v3.2/train/"
data_paths = ['SemEval2016-Task3-CQA-QL-train-part1-with-multiline.xml',
             'SemEval2016-Task3-CQA-QL-train-part2-with-multiline.xml']

In [2]:
import xml.etree.ElementTree as ElementTree

def XMLParser(filepath):
    # construct the Element Tree and get the root
    tree = ElementTree.parse(filepath)
    root = tree.getroot()
    question_list = []

    for org_question in root.findall('OrgQuestion'):
        question_dict = {}

        question_dict['ORGQ_ID'] = org_question.attrib["ORGQ_ID"]
        question_dict['org_subject'] = org_question.find("OrgQSubject").text
        question_dict['org_question'] = org_question.find("OrgQBody").text

        thread = org_question.find('Thread')
        rel_question = thread.find('RelQuestion')

        question_dict['threadId'] = rel_question.attrib['RELQ_ID']
        question_dict['subject'] = rel_question.find('RelQSubject').text
        question_dict['question'] = rel_question.find('RelQBody').text

        # if there are no question body, use subject
        if(question_dict['question'] is None):
            question_dict['question'] = question_dict['subject']

        question_dict['relevance'] = rel_question.attrib['RELQ_RELEVANCE2ORGQ']

        question_list.append(question_dict)

    return question_list

In [3]:
data = []
for p in data_paths:
    data += XMLParser(datapathprefix + p)

<a id='data_clean'></a>


## 2. Data Cleaning

In [4]:
from bs4 import BeautifulSoup
# suppress warning from bs4
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Clean extraction of the information
def stripHtml(txt):
    if txt is None:
        return
    else:
        soup = BeautifulSoup(txt, "lxml")
        return ''.join(BeautifulSoup(soup.get_text(), "lxml").findAll(text=True))

In [5]:
import re
import pandas as pd

from nltk.corpus import stopwords
stopw = stopwords.words('english')
stopw += ['hi', 'hello']

def merge_acronym(s):
    r = re.compile(r'(?:(?<=\.|\s)[A-Z]\.)+')
    acronyms = r.findall(s)
    
    for w in acronyms:
        s = s.replace(w, w.replace('.', ''))
        
    return s

def clean_text(doc):

    # determine which ones are required for this task
    doc = stripHtml(doc)
    doc = re.sub('[^A-Za-z ]+', " ", doc)

    doc = doc.replace("-", "")
    doc = doc.lower()
    
    # Some other potential cleaning to do
    # doc = doc.replace("...", "")
    # doc = doc.replace("Mr.", "Mr").replace("Mrs.", "Mrs")    
    # doc = merge_acronym(doc)
    
    # Remove whitespace
    doc = ' '.join(doc.split())
    return doc

def remove_stop_words(doc):
    doc = ' '.join([i for i in doc.split() if i not in stopw])
    return doc

def clean(doc):
    doc = clean_text(doc)
    return remove_stop_words(doc)

In [6]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance
0,Q1,Massage oil,Where I can buy good oil for massage?,Q1_R1,massage oil,is there any place i can find scented massage ...,PerfectMatch
1,Q1,Massage oil,Where I can buy good oil for massage?,Q1_R6,Philipino Massage center,"Hi,Can any one tell me a place where i can hav...",Relevant
2,Q1,Massage oil,Where I can buy good oil for massage?,Q1_R8,Best place for massage,"&lt;p&gt;\nTell me, where is the best place to...",Irrelevant
3,Q1,Massage oil,Where I can buy good oil for massage?,Q1_R10,body massage,"hi there, i can see a lot of massage center he...",Relevant
4,Q1,Massage oil,Where I can buy good oil for massage?,Q1_R22,What attracts you more ?,What attracts you more ?,Irrelevant


In [7]:
df['question'] = df.question.apply(lambda x: clean(x))
df['org_question'] = df.org_question.apply(lambda x: clean(x))
df['subject'] = df.subject.apply(lambda x: clean(x))
df['org_subject'] = df.org_subject.apply(lambda x: clean(x))

In [8]:
df.head()

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance
0,Q1,massage oil,buy good oil massage,Q1_R1,massage oil,place find scented massage oils qatar,PerfectMatch
1,Q1,massage oil,buy good oil massage,Q1_R6,philipino massage center,one tell place good massage drom philipinies y...,Relevant
2,Q1,massage oil,buy good oil massage,Q1_R8,best place massage,tell best place go massage mind want spend qr ...,Irrelevant
3,Q1,massage oil,buy good oil massage,Q1_R10,body massage,see lot massage center dont one better someone...,Relevant
4,Q1,massage oil,buy good oil massage,Q1_R22,attracts,attracts,Irrelevant


<a id='methodologies'></a>

## 3. Methodologies

<a id='tdidf'></a>
### 3.1 TF-IDF

In [9]:
import math
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


def fit_tfidf(data):
    count_vect = CountVectorizer()
    count_vect = count_vect.fit(data)

    freq_term_matrix = count_vect.transform(data)

    feature_names = count_vect.get_feature_names()

    tfidf = TfidfTransformer()
    tfidf.fit(freq_term_matrix)
    
    return count_vect, tfidf, feature_names


def retrieve_doc_matrix(doc, count_vect, tfidf):
    # get the freq term matrix of the doc
    freq_term_matrix = count_vect.transform([doc])
    # get the tfidf matrix
    tfidf_matrix = tfidf.transform(freq_term_matrix)
    # dense form of the matrix
    dense_mat = tfidf_matrix.todense()
    # return doc_matrix
    return dense_mat.tolist()[0]


def cosine_similarity(vector1, vector2):
    dot_product = sum(p*q for p,q in zip(vector1, vector2))
    magnitude = math.sqrt(
            sum([val**2 for val in vector1])) * \
            math.sqrt(sum([val**2 for val in vector2])
        )
    if not magnitude:
        return 0
    return dot_product/magnitude


def rank_similarity(doc, other_docs, count_vect, tfidf):
    doc_sim_scores = []
    doc_mat = retrieve_doc_matrix(doc, count_vect, tfidf)
    
    for d in other_docs:
        d_mat = retrieve_doc_matrix(d, count_vect, tfidf)
        doc_sim_scores.append((d, cosine_similarity(doc_mat, d_mat)))
        
    sorted_docs = sorted(doc_sim_scores,
                         key=lambda tup: tup[1],
                         reverse=True)
    
    return sorted_docs

In [10]:
question_train = df.org_question.values + df.question.values
cnt_vect, tfidf_fit, f_names = fit_tfidf(question_train)

In [11]:
def gen_sim_scores(sent1, sent2, count_vect, tfidf):
    sent1_mat = retrieve_doc_matrix(sent1, count_vect, tfidf)
    sent2_mat = retrieve_doc_matrix(sent2, count_vect, tfidf)
    return cosine_similarity(sent1_mat, sent2_mat)

df['score'] = df.apply(
        lambda x: gen_sim_scores(x['org_question'], x['question'],
               cnt_vect, tfidf_fit), axis=1)

In [12]:
df.head()

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance,score
0,Q1,massage oil,buy good oil massage,Q1_R1,massage oil,place find scented massage oils qatar,PerfectMatch,0.320963
1,Q1,massage oil,buy good oil massage,Q1_R6,philipino massage center,one tell place good massage drom philipinies y...,Relevant,0.318438
2,Q1,massage oil,buy good oil massage,Q1_R8,best place massage,tell best place go massage mind want spend qr ...,Irrelevant,0.234103
3,Q1,massage oil,buy good oil massage,Q1_R10,body massage,see lot massage center dont one better someone...,Relevant,0.483658
4,Q1,massage oil,buy good oil massage,Q1_R22,attracts,attracts,Irrelevant,0.0


<a id='word2vec'></a>
### 3.2 Word2Vec

In [13]:
import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [14]:
w2v_format = [x.split() for x in question_train]

In [15]:
w2v_model = word2vec.Word2Vec(w2v_format, min_count=2, size=300)

2019-11-13 18:40:55,850 : INFO : collecting all words and their counts
2019-11-13 18:40:55,859 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-11-13 18:40:56,198 : INFO : collected 9434 word types from a corpus of 104931 raw words and 2669 sentences
2019-11-13 18:40:56,201 : INFO : Loading a fresh vocabulary
2019-11-13 18:40:56,352 : INFO : min_count=2 retains 4707 unique words (49% of original 9434, drops 4727)
2019-11-13 18:40:56,364 : INFO : min_count=2 leaves 100204 word corpus (95% of original 104931, drops 4727)
2019-11-13 18:40:56,484 : INFO : deleting the raw counts dictionary of 9434 items
2019-11-13 18:40:56,527 : INFO : sample=0.001 downsamples 43 most-common words
2019-11-13 18:40:56,550 : INFO : downsampling leaves estimated 90805 word corpus (90.6% of prior 100204)
2019-11-13 18:40:56,600 : INFO : estimated required memory for 4707 words and 300 dimensions: 13650300 bytes
2019-11-13 18:40:56,602 : INFO : resetting layer weights
2019-11-13 1

In [16]:
w2v_model.most_similar(['beach'])

  """Entry point for launching an IPython kernel.
2019-11-13 18:40:59,862 : INFO : precomputing L2-norms of word weight vectors


[('away', 0.9999431371688843),
 ('im', 0.9999403953552246),
 ('well', 0.9999372959136963),
 ('bit', 0.9999366402626038),
 ('baggage', 0.9999358654022217),
 ('world', 0.9999357461929321),
 ('quite', 0.9999347925186157),
 ('customer', 0.9999324083328247),
 ('found', 0.9999318718910217),
 ('little', 0.9999315738677979)]

In [17]:
import numpy as np
from scipy import spatial

index2word_set = set(w2v_model.wv.index2word)

## check what is the difference between the two methods

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

s1_afv = avg_feature_vector('this is a sentence a a a one', model=w2v_model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=w2v_model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

0.9995837211608887


  from ipykernel import kernelapp as app


In [18]:
def get_w2v_sim_score(sent1, sent2):
    sent1_mat = avg_feature_vector(sent1, model=w2v_model,
                                   num_features=300,
                                   index2word_set=index2word_set)
    sent2_mat = avg_feature_vector(sent2, model=w2v_model,
                                   num_features=300,
                                   index2word_set=index2word_set)
    
    return cosine_similarity(sent1_mat, sent2_mat)

In [19]:
df['w2v_sim'] = df.apply(lambda x: get_w2v_sim_score(
    x['org_question'], x['question']), axis=1)

  from ipykernel import kernelapp as app


In [20]:
df.head(n=10)

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance,score,w2v_sim
0,Q1,massage oil,buy good oil massage,Q1_R1,massage oil,place find scented massage oils qatar,PerfectMatch,0.320963,0.999116
1,Q1,massage oil,buy good oil massage,Q1_R6,philipino massage center,one tell place good massage drom philipinies y...,Relevant,0.318438,0.999398
2,Q1,massage oil,buy good oil massage,Q1_R8,best place massage,tell best place go massage mind want spend qr ...,Irrelevant,0.234103,0.999021
3,Q1,massage oil,buy good oil massage,Q1_R10,body massage,see lot massage center dont one better someone...,Relevant,0.483658,0.999651
4,Q1,massage oil,buy good oil massage,Q1_R22,attracts,attracts,Irrelevant,0.0,0.0
5,Q1,massage oil,buy good oil massage,Q1_R25,got joking seen shop downtown manama,img assist nid title placenta cream desc link ...,Irrelevant,0.0,0.998877
6,Q1,massage oil,buy good oil massage,Q1_R27,blackheads,suggestions get rid,Irrelevant,0.0,0.997664
7,Q1,massage oil,buy good oil massage,Q1_R32,get tea tree oil,someone please advise husband wants get tea tr...,PerfectMatch,0.184031,0.998489
8,Q1,massage oil,buy good oil massage,Q1_R43,strong migraine pain,plz help living hell days tried kind medicine ...,Irrelevant,0.0,0.998939
9,Q1,massage oil,buy good oil massage,Q1_R46,garlic oil,someone please tell find garlic oil qatar hear...,Irrelevant,0.218388,0.999413


### Word2Vec

In [None]:
import gensim
from gensim.models import Word2Vec

DIM = 600
WORKERS = 8
WINDOW = 10
NEGATIVE = 10

id2word = gensim.corpora.Dictionary(w2v_format)
word2id = dict((v,k) for k,v in id2word.iteritems())
corpus = lambda: ([word.lower() for word in question if word in word2id] for question in w2v_format)
model = Word2Vec(size=DIM, window=WINDOW, workers=WORKERS,hs=0,negative=NEGATIVE)
model.build_vocab(corpus())
model.train(corpus(), total_words=model.corpus_count, epochs=model.epochs)
#Done training the model
model.init_sims(replace=True)
# pickle.dump(model, open("tmp/w2v1_model.p", "wb"))

In [22]:
def generateVec(model, sent, numFeatures):
    featureVec = np.zeros((numFeatures,), dtype="float32")
    num_words = 0
    index2word_set = set(model.wv.index2word)
    for word in sent:
        if word in index2word_set:
            num_words += 1
            featureVec = np.add(featureVec, model[word])
    featureVec = np.divide(featureVec, num_words)
    return featureVec

In [23]:
df['w2v_score'] = df.apply(lambda x: cosine_similarity(
    generateVec(model, x['org_question'], DIM),
    generateVec(model, x['question'], DIM)
), axis=1)

  
  if __name__ == '__main__':


In [24]:
df['w2v_sub_score'] = df.apply(lambda x: cosine_similarity(
    generateVec(model, x['org_subject'], DIM),
    generateVec(model, x['subject'], DIM)
), axis=1)

  
  if __name__ == '__main__':


In [25]:
df.head(n=10)

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance,score,w2v_sim,w2v_score,w2v_sub_score
0,Q1,massage oil,buy good oil massage,Q1_R1,massage oil,place find scented massage oils qatar,PerfectMatch,0.320963,0.999116,0.845747,1.0
1,Q1,massage oil,buy good oil massage,Q1_R6,philipino massage center,one tell place good massage drom philipinies y...,Relevant,0.318438,0.999398,0.8526,0.706746
2,Q1,massage oil,buy good oil massage,Q1_R8,best place massage,tell best place go massage mind want spend qr ...,Irrelevant,0.234103,0.999021,0.885455,0.775217
3,Q1,massage oil,buy good oil massage,Q1_R10,body massage,see lot massage center dont one better someone...,Relevant,0.483658,0.999651,0.834709,0.760214
4,Q1,massage oil,buy good oil massage,Q1_R22,attracts,attracts,Irrelevant,0.0,0.0,0.598539,0.412908
5,Q1,massage oil,buy good oil massage,Q1_R25,got joking seen shop downtown manama,img assist nid title placenta cream desc link ...,Irrelevant,0.0,0.998877,0.855783,0.668123
6,Q1,massage oil,buy good oil massage,Q1_R27,blackheads,suggestions get rid,Irrelevant,0.0,0.997664,0.929298,0.68477
7,Q1,massage oil,buy good oil massage,Q1_R32,get tea tree oil,someone please advise husband wants get tea tr...,PerfectMatch,0.184031,0.998489,0.808488,0.799709
8,Q1,massage oil,buy good oil massage,Q1_R43,strong migraine pain,plz help living hell days tried kind medicine ...,Irrelevant,0.0,0.998939,0.820462,0.672715
9,Q1,massage oil,buy good oil massage,Q1_R46,garlic oil,someone please tell find garlic oil qatar hear...,Irrelevant,0.218388,0.999413,0.82369,0.850235


<a id='lsi'></a>
### 3.5 Latent Semantic Indexing (LSI)

We will be utilising the library gensim to perform LSI

In [26]:
from gensim import corpora, models, similarities

from collections import defaultdict

frequency = defaultdict(int)
for txt in w2v_format:
    for t in txt:
        frequency[t] += 1
        
texts = [[token for token in text if frequency[token]>1]
        for text in w2v_format]

In [27]:
dictionary = corpora.Dictionary(texts)
dictionary.save('tmp/question.dict')

2019-11-13 18:43:42,858 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-11-13 18:43:43,575 : INFO : built Dictionary(4707 unique tokens: ['buy', 'find', 'good', 'massage', 'oil']...) from 2669 documents (total 100204 corpus positions)
2019-11-13 18:43:43,583 : INFO : saving Dictionary object under tmp/question.dict, separately None
2019-11-13 18:43:43,664 : INFO : saved tmp/question.dict


In [28]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('tmp/question.mm', corpus)

2019-11-13 18:43:44,273 : INFO : storing corpus in Matrix Market format to tmp/question.mm
2019-11-13 18:43:44,293 : INFO : saving sparse matrix to tmp/question.mm
2019-11-13 18:43:44,303 : INFO : PROGRESS: saving document #0
2019-11-13 18:43:44,885 : INFO : PROGRESS: saving document #1000
2019-11-13 18:43:45,365 : INFO : PROGRESS: saving document #2000
2019-11-13 18:43:45,558 : INFO : saved 2669x4707 matrix, density=0.691% (86861/12562983)
2019-11-13 18:43:45,641 : INFO : saving MmCorpus index to tmp/question.mm.index


In [29]:
# you can load back the dict and corpus too
# dictionary = corpora.Dictionary.load('tmp/question.dict')
# corpus = corpora.MmCorpus('tmp/question.mm')

In [30]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=200)

2019-11-13 18:43:45,787 : INFO : using serial LSI version on this node
2019-11-13 18:43:45,819 : INFO : updating model with new documents
2019-11-13 18:43:45,828 : INFO : preparing a new chunk of documents
2019-11-13 18:43:46,098 : INFO : using 100 extra samples and 2 power iterations
2019-11-13 18:43:46,101 : INFO : 1st phase: constructing (4707, 300) action matrix
2019-11-13 18:43:46,515 : INFO : orthonormalizing (4707, 300) action matrix
2019-11-13 18:43:49,501 : INFO : 2nd phase: running dense svd on (300, 2669) matrix
2019-11-13 18:43:50,967 : INFO : computing the final decomposition
2019-11-13 18:43:51,000 : INFO : keeping 200 factors (discarding 9.575% of energy spectrum)
2019-11-13 18:43:51,077 : INFO : processed documents up to #2669
2019-11-13 18:43:51,103 : INFO : topic #0(106.660): 0.425*"visa" + 0.406*"qatar" + 0.225*"doha" + 0.211*"know" + 0.207*"get" + 0.169*"visit" + 0.168*"please" + 0.145*"would" + 0.137*"anyone" + 0.137*"help"
2019-11-13 18:43:51,116 : INFO : topic #1

In [31]:
index = similarities.MatrixSimilarity(lsi[corpus])

2019-11-13 18:43:52,137 : INFO : creating matrix with 2669 documents and 200 features


In [32]:
doc = df.loc[0].org_question
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
sims = index[vec_lsi]
enum_sims = sorted(enumerate(sims), key=lambda item: -item[1])

print(enum_sims[:10])

[(4, 0.9941629), (5, 0.9146293), (6, 0.79458326), (1, 0.739517), (0, 0.71737975), (3, 0.61968863), (9, 0.5743849), (7, 0.5676053), (814, 0.5496509), (1021, 0.5422623)]


In [33]:
def gen_sim(x):
    vec_bow = dictionary.doc2bow(x['org_question'].lower().split())
    vec_lsi = lsi[vec_bow]
    sims = index[vec_lsi]
    for idx, score in enumerate(sims):
        if idx == x['row_index']:
            return score

In [34]:
df['row_index'] = df.index
df['lsi_score'] = df.apply(lambda x: gen_sim(x), axis=1)

In [35]:
df.head(n=10)

Unnamed: 0,ORGQ_ID,org_subject,org_question,threadId,subject,question,relevance,score,w2v_sim,w2v_score,w2v_sub_score,row_index,lsi_score
0,Q1,massage oil,buy good oil massage,Q1_R1,massage oil,place find scented massage oils qatar,PerfectMatch,0.320963,0.999116,0.845747,1.0,0,0.71738
1,Q1,massage oil,buy good oil massage,Q1_R6,philipino massage center,one tell place good massage drom philipinies y...,Relevant,0.318438,0.999398,0.8526,0.706746,1,0.739517
2,Q1,massage oil,buy good oil massage,Q1_R8,best place massage,tell best place go massage mind want spend qr ...,Irrelevant,0.234103,0.999021,0.885455,0.775217,2,0.494614
3,Q1,massage oil,buy good oil massage,Q1_R10,body massage,see lot massage center dont one better someone...,Relevant,0.483658,0.999651,0.834709,0.760214,3,0.619689
4,Q1,massage oil,buy good oil massage,Q1_R22,attracts,attracts,Irrelevant,0.0,0.0,0.598539,0.412908,4,0.994163
5,Q1,massage oil,buy good oil massage,Q1_R25,got joking seen shop downtown manama,img assist nid title placenta cream desc link ...,Irrelevant,0.0,0.998877,0.855783,0.668123,5,0.914629
6,Q1,massage oil,buy good oil massage,Q1_R27,blackheads,suggestions get rid,Irrelevant,0.0,0.997664,0.929298,0.68477,6,0.794583
7,Q1,massage oil,buy good oil massage,Q1_R32,get tea tree oil,someone please advise husband wants get tea tr...,PerfectMatch,0.184031,0.998489,0.808488,0.799709,7,0.567605
8,Q1,massage oil,buy good oil massage,Q1_R43,strong migraine pain,plz help living hell days tried kind medicine ...,Irrelevant,0.0,0.998939,0.820462,0.672715,8,0.426919
9,Q1,massage oil,buy good oil massage,Q1_R46,garlic oil,someone please tell find garlic oil qatar hear...,Irrelevant,0.218388,0.999413,0.82369,0.850235,9,0.574385


<a id='fasttext'></a>
### 3.6 FastText

In [36]:
# !pip install fasttext
import fastText

# Skipgram model
model = fastText.train_unsupervised('train.txt', model='skipgram')

In [37]:
model.get_words()[:10]

['</s>',
 'qatar',
 'visa',
 'doha',
 'know',
 'please',
 'get',
 'anyone',
 'would',
 'one']

In [38]:
model.get_words

<bound method _FastText.get_words of <fastText.FastText._FastText object at 0x1056499b0>>

In [39]:
# Getting the tokens 
words = []
for word in model.get_words():
    words.append(word)

In [40]:
# Printing out number of tokens available
print("Number of Tokens: {}".format(len(words)))

# Printing out the dimension of a word vector 
print("Dimension of a word vector: {}".format(
    len(model.get_word_vector(words[0]))
))

# Print out the vector of a word 
print("Vector components of a word: {}".format(
    model.get_word_vector(words[0])
))

Number of Tokens: 2676
Dimension of a word vector: 100
Vector components of a word: [-0.0319672   0.1319892   0.31514543  0.410413   -0.1447973   0.04152701
  0.09371077  0.01562968  0.1512818   0.14957687 -0.01333168  0.08096377
 -0.07832588  0.24669628 -0.04287419 -0.07149091 -0.1387505   0.2956999
  0.01943246  0.03656456  0.07951707 -0.11457002 -0.06516531 -0.16427673
 -0.04190664 -0.02716281 -0.10165581 -0.13693957 -0.08325977 -0.19285879
 -0.08619766  0.13417538 -0.2542087  -0.13881095  0.0810195   0.11718099
  0.0253346  -0.30254847 -0.10877774 -0.07203186  0.11552175  0.06549048
  0.36798212  0.06593114 -0.12190998  0.23032668 -0.3667291  -0.14014255
  0.19403766 -0.06437854 -0.11028724  0.00516628  0.11386647  0.00657951
 -0.13765365 -0.0285877  -0.18273328  0.2647317   0.08529811 -0.01537095
 -0.23491588  0.06719963  0.17836581 -0.02758135 -0.03991823 -0.23159184
 -0.20494325 -0.05576655  0.12461594 -0.07800097  0.01072029 -0.02262745
 -0.23066637  0.04385775 -0.06628027 -0.0

In [41]:
def get_fasttext_sim(model, x):
    return cosine_similarity(model.get_word_vector(x['org_question']),
                            model.get_word_vector(x['question']))
df['ft_sim'] = df.apply(lambda x: get_fasttext_sim(model, x), axis=1)

In [42]:
df[["ORGQ_ID", "relevance", "score", "w2v_score",
    "w2v_sub_score", "lsi_score", "ft_sim"]].head(n=20)

Unnamed: 0,ORGQ_ID,relevance,score,w2v_score,w2v_sub_score,lsi_score,ft_sim
0,Q1,PerfectMatch,0.320963,0.845747,1.0,0.71738,0.946557
1,Q1,Relevant,0.318438,0.8526,0.706746,0.739517,0.877368
2,Q1,Irrelevant,0.234103,0.885455,0.775217,0.494614,0.910362
3,Q1,Relevant,0.483658,0.834709,0.760214,0.619689,0.929131
4,Q1,Irrelevant,0.0,0.598539,0.412908,0.994163,0.866588
5,Q1,Irrelevant,0.0,0.855783,0.668123,0.914629,0.946325
6,Q1,Irrelevant,0.0,0.929298,0.68477,0.794583,0.720323
7,Q1,PerfectMatch,0.184031,0.808488,0.799709,0.567605,0.701124
8,Q1,Irrelevant,0.0,0.820462,0.672715,0.426919,0.793625
9,Q1,Irrelevant,0.218388,0.82369,0.850235,0.574385,0.918123


<a id='evaluation'></a>
## 4. Evaluation

In [43]:
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [44]:
eval_df = df[["ORGQ_ID", "relevance", "score", "w2v_score",
    "w2v_sub_score", "lsi_score", "ft_sim"]]

In [45]:
eval_df['rel'] = eval_df.relevance.apply(
    lambda x: 1 if x in ['PerfectMatch', 'Relevant'] else 0)
eval_df['w2vsimclass'] = eval_df.w2v_score.apply(lambda x: 1 if x > 0.85 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [46]:
# lets say 0.8 sim is relevant
eval_df.head(n=10)

Unnamed: 0,ORGQ_ID,relevance,score,w2v_score,w2v_sub_score,lsi_score,ft_sim,rel,w2vsimclass
0,Q1,PerfectMatch,0.320963,0.845747,1.0,0.71738,0.946557,1,0
1,Q1,Relevant,0.318438,0.8526,0.706746,0.739517,0.877368,1,1
2,Q1,Irrelevant,0.234103,0.885455,0.775217,0.494614,0.910362,0,1
3,Q1,Relevant,0.483658,0.834709,0.760214,0.619689,0.929131,1,0
4,Q1,Irrelevant,0.0,0.598539,0.412908,0.994163,0.866588,0,0
5,Q1,Irrelevant,0.0,0.855783,0.668123,0.914629,0.946325,0,1
6,Q1,Irrelevant,0.0,0.929298,0.68477,0.794583,0.720323,0,1
7,Q1,PerfectMatch,0.184031,0.808488,0.799709,0.567605,0.701124,1,0
8,Q1,Irrelevant,0.0,0.820462,0.672715,0.426919,0.793625,0,0
9,Q1,Irrelevant,0.218388,0.82369,0.850235,0.574385,0.918123,0,0


In [47]:
_mapval = mapk(eval_df.groupby(["ORGQ_ID"])['rel'].apply(list).values,
    eval_df.groupby(["ORGQ_ID"])['w2vsimclass'].apply(list).values, 10)

In [48]:
_mapval

0.09102565245823672

## FEATURE ENGINEERING

In [49]:
from fuzzywuzzy import fuzz
import numpy as np
from tqdm import tqdm
from scipy.stats import skew, kurtosis
from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis

from gensim import models
gensim_model = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
norm_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
norm_model.init_sims(replace=True)

def wmd(s1, s2):
    s1 = str(s1).lower().split()
    s2 = str(s2).lower().split()
    stop_words = stopwords.words('english')
    s1 = [w for w in s1 if w not in stop_words]
    s2 = [w for w in s2 if w not in stop_words]
    return gensim_model.wmdistance(s1, s2)


def norm_wmd(s1, s2):
    s1 = str(s1).lower().split()
    s2 = str(s2).lower().split()
    stop_words = stopwords.words('english')
    s1 = [w for w in s1 if w not in stop_words]
    s2 = [w for w in s2 if w not in stop_words]
    return norm_model.wmdistance(s1, s2)

2019-11-13 18:46:50,690 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin.gz
2019-11-13 18:55:07,423 : INFO : loaded (3000000, 300) matrix from GoogleNews-vectors-negative300.bin.gz
2019-11-13 18:55:07,472 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin.gz
2019-11-13 19:01:41,789 : INFO : loaded (3000000, 300) matrix from GoogleNews-vectors-negative300.bin.gz
2019-11-13 19:01:41,852 : INFO : precomputing L2-norms of word weight vectors


In [50]:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def sentence2vector(s):
    words = str(s).lower()
#     .decode('utf-8')
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(gensim_model[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    # normalized vector
    return v / np.sqrt((v ** 2).sum())

def gen_features(data):
    data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
    data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
    data['diff_len'] = data.len_q1 - data.len_q2
    
    data['len_char_q1'] = data.question1.apply(
        lambda x: len(''.join(set(str(x).replace(' ', '')))))
    data['len_char_q2'] = data.question2.apply(
        lambda x: len(''.join(set(str(x).replace(' ', '')))))
    data['len_word_q1'] = data.question1.apply(
        lambda x: len(str(x).split()))
    data['len_word_q2'] = data.question2.apply(
        lambda x: len(str(x).split()))
    data['common_words'] = data.apply(
        lambda x: len(set(str(x['question1']).lower().split()) \
                      .intersection(set(str(x['question2']).lower() \
                      .split()))), axis=1)
    
    data['fuzz_qratio'] = data.apply(
        lambda x: fuzz.QRatio(str(x['question1']),
                              str(x['question2'])), axis=1)
    data['fuzz_WRatio'] = data.apply(
        lambda x: fuzz.WRatio(str(x['question1']),
                              str(x['question2'])), axis=1)
    data['fuzz_partial_ratio'] = data.apply(
        lambda x: fuzz.partial_ratio(str(x['question1']),
                                     str(x['question2'])), axis=1)
    data['fuzz_partial_token_set_ratio'] = data.apply(
        lambda x: fuzz.partial_token_set_ratio(str(x['question1']),
                                               str(x['question2'])), axis=1)
    
    data['fuzz_partial_token_sort_ratio'] = data.apply(
        lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),
                                                str(x['question2'])), axis=1)
    data['fuzz_token_set_ratio'] = data.apply(
        lambda x: fuzz.token_set_ratio(str(x['question1']),
                                       str(x['question2'])), axis=1)
    data['fuzz_token_sort_ratio'] = data.apply(
        lambda x: fuzz.token_sort_ratio(str(x['question1']),
                                        str(x['question2'])), axis=1)
    
    data['wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2']), axis=1)


    data['norm_wmd'] = data.apply(lambda x: norm_wmd(x['question1'],
                                                     x['question2']), axis=1)

    # generate question vectors
    question1_vectors = np.zeros((data.shape[0], 300))
    error_count = 0

    for i, q in tqdm(enumerate(data.question1.values)):
        question1_vectors[i, :] = sentence2vector(q)

    question2_vectors  = np.zeros((data.shape[0], 300))
    for i, q in tqdm(enumerate(data.question2.values)):
        question2_vectors[i, :] = sentence2vector(q)

    data['cosine_distance'] = [cosine(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['cityblock_distance'] = [cityblock(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['jaccard_distance'] = [jaccard(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['canberra_distance'] = [canberra(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['euclidean_distance'] = [euclidean(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['minkowski_distance'] = [minkowski(x, y, 3) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['braycurtis_distance'] = [braycurtis(x, y) for (x, y) in zip(
        np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

    data['skew_q1vec'] = [skew(x) for x in np.nan_to_num(question1_vectors)]
    data['skew_q2vec'] = [skew(x) for x in np.nan_to_num(question2_vectors)]
    data['kur_q1vec'] = [kurtosis(x) for x in np.nan_to_num(question1_vectors)]
    data['kur_q2vec'] = [kurtosis(x) for x in np.nan_to_num(question2_vectors)]

    return data

In [51]:
df_copy = df.copy()

In [52]:
df_copy['question1'] = df_copy['org_question']
df_copy['question2'] = df_copy['question']
df_copy = df_copy[['question1', 'question2', 'relevance']]

In [None]:
global PYEMD_EXT
import pyemd
PYEMD_EXT = True

features_df = gen_features(df_copy)

In [54]:
features_df['q1vec'] = features_df.question1.apply(
    lambda q: sentence2vector(q))
features_df['q2vec'] = features_df.question2.apply(
    lambda q: sentence2vector(q))

### XGBoost

In [55]:
import xgboost as xgb
from sklearn.model_selection import train_test_split

In [56]:
features_df.columns

Index(['question1', 'question2', 'relevance', 'len_q1', 'len_q2', 'diff_len',
       'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2',
       'common_words', 'fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio',
       'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio',
       'fuzz_token_set_ratio', 'fuzz_token_sort_ratio', 'wmd', 'norm_wmd',
       'cosine_distance', 'cityblock_distance', 'jaccard_distance',
       'canberra_distance', 'euclidean_distance', 'minkowski_distance',
       'braycurtis_distance', 'skew_q1vec', 'skew_q2vec', 'kur_q1vec',
       'kur_q2vec', 'q1vec', 'q2vec'],
      dtype='object')

In [57]:
X_data = features_df[['len_q1', 'len_q2', 'diff_len',
       'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2',
       'common_words', 'fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio',
       'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio',
       'fuzz_token_set_ratio', 'fuzz_token_sort_ratio', 'wmd', 'norm_wmd',
       'cosine_distance', 'cityblock_distance', 'jaccard_distance',
       'canberra_distance', 'euclidean_distance', 'minkowski_distance',
       'braycurtis_distance', 'skew_q1vec', 'skew_q2vec', 'kur_q1vec',
       'kur_q2vec']].values
features_df['rel'] = features_df.relevance.apply(
    lambda x: 1 if x in ['PerfectMatch', 'Relevant'] else 0)

y = features_df.rel.values

In [58]:
seed = 7
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(
    X_data, y, test_size=test_size, random_state=seed)

In [59]:
from sklearn.metrics import classification_report

model1 = xgb.XGBClassifier()
train_model1 = model1.fit(X_train, y_train)
pred1 = train_model1.predict(X_test)
print('Model 1 XGboost Report\n',(classification_report(y_test, pred1)))

Model 1 XGboost Report
               precision    recall  f1-score   support

           0       0.70      0.86      0.77       469
           1       0.71      0.49      0.58       332

   micro avg       0.70      0.70      0.70       801
   macro avg       0.70      0.67      0.68       801
weighted avg       0.70      0.70      0.69       801



In [60]:
model2 = xgb.XGBClassifier(n_estimators=100, max_depth=8, learning_rate=0.1, subsample=0.5)

train_model2 = model2.fit(X_train, y_train)
pred2 = train_model2.predict(X_test)

print('Model 2 XGboost Report\n', (classification_report(y_test, pred2)))


Model 2 XGboost Report
               precision    recall  f1-score   support

           0       0.71      0.80      0.75       469
           1       0.65      0.55      0.59       332

   micro avg       0.69      0.69      0.69       801
   macro avg       0.68      0.67      0.67       801
weighted avg       0.69      0.69      0.69       801



In [61]:
from sklearn.metrics import accuracy_score

print("Accuracy for model 1: %.2f" % (accuracy_score(y_test, pred1) * 100))
print("Accuracy for model 2: %.2f" % (accuracy_score(y_test, pred2) * 100))

Accuracy for model 1: 70.41
Accuracy for model 2: 69.16


In [62]:
model3 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

train_model3 = model3.fit(X_train, y_train)
pred3 = train_model3.predict(X_test)
print("Accuracy for model 3: %.2f" % (accuracy_score(y_test, pred3) * 100))

Accuracy for model 3: 69.29


In [64]:
xgb2 = xgb.XGBClassifier(
 learning_rate =0.7,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

train_model4 = xgb2.fit(X_train, y_train)
pred4 = train_model4.predict(X_test)
print("Accuracy for model 4: %.2f" % (accuracy_score(y_test, pred4) * 100))

Accuracy for model 4: 66.92


In [65]:
dropnafea = features_df.dropna()

In [66]:
Xf_train, Xf_test, yf_train, yf_test = train_test_split(
    dropnafea[['len_q1', 'len_q2', 'diff_len',
       'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2',
       'common_words', 'fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio',
       'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio',
       'fuzz_token_set_ratio', 'fuzz_token_sort_ratio', 'wmd', 'norm_wmd',
       'cosine_distance', 'cityblock_distance', 'jaccard_distance',
       'canberra_distance', 'euclidean_distance', 'minkowski_distance',
       'braycurtis_distance', 'skew_q1vec', 'skew_q2vec', 'kur_q1vec',
       'kur_q2vec']].values, dropnafea.rel.values, test_size=test_size, random_state=seed)

### RandomForest

In [67]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc_model = rfc.fit(Xf_train, yf_train)
pred8 = rfc_model.predict(Xf_test)
print("Accuracy for Random Forest Model: %.2f" % (accuracy_score(yf_test, pred8) * 100))



Accuracy for Random Forest Model: 69.26


https://www.kaggle.com/babatee/intro-xgboost-classification

## Conclusion
Write up and discussions can be found in the `NLP_project.pdf` file