# Finding Similar Stack Overflow Questions: Comparing Centroid Method with Doc2Vec
As you may know, Stack Overflow is one of the most popular Q&A forums for developers. There are around 17.2 million questions on the platform with 8K–10K new questions asked every day. While a lot of these questions are unique, I believe that a lot of questions are repeated, so it would be interesting to see what similar questions are asked most frequently. 

The topic of frequent questions extraction is somewhat connected to my Master's Thesis which I'm currently writing, so these kernels will reflect my progress in terms of research in this area. 

This notebook will attempt to answer the question __How can we embed sentences in such a way that similar questions will appear closer to each other as measured by cosine distance?__ by comparing two embedding methods __Centroid Method__ and __Doc2Vec__. 



## Imports and Data

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models import FastText
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from sklearn.neighbors import NearestNeighbors

In [None]:
df = pd.read_csv('../input/Questions.csv', encoding = "ISO-8859-1", nrows=30000, usecols=['Id', 'Title', 'Body'])
df.shape

In [None]:
#Let's take a look at some of the questions
print('Question1: ', df.iloc[0, 2])
print('Question2: ', df.iloc[1, 2])
print('Question3: ', df.iloc[2, 2])

As we can see, some questions can be quite long and elaborate while others are just asking for recommendations. Seems like I can expect that most of the question will be in paragraph  tags. Also, there's a lot of not needed tags which will have to be cleaned. General strategy for initial cleaning will be the following:

* Get text which is inside paragraph tags
* Tokenize
* Remove stopwords
* Make frequency thresholds

## Text Pre-processing

In [None]:
#Using beautiful soup to grab text inside 'p' tags and concatenate it
def get_question(html_text):
  soup = BeautifulSoup(html_text, 'lxml')
  question = ' '.join([t.text for t in soup.find_all('p')]) #concatenating all p tags
  return question

#Transforming questions to list for ease of processing
question_list = df['Body'].apply(get_question).values.tolist()

In [None]:
question_list[0]

In [None]:
#Tokenizing with simple preprocess gensim's simple preprocess
def sent_to_words(sentences):
    for sentence in sentences:
        yield(simple_preprocess(str(sentence), deacc=True)) # returns lowercase tokens, ignoring tokens that are too short or too long

question_words = list(sent_to_words(question_list))

In [None]:
question_words[0][0:5] #first 5 question tokens

Now, let's take a look at the questions word count before and after stop words removal.

In [None]:
lengths = [len(question) for question in question_words]
plt.hist(lengths, bins = 25)
plt.show()

print('Mean word count of questions is %s' % np.mean(lengths))

In [None]:
#Getting rid of stopwords
stop_words = stopwords.words('english')

def remove_stopwords(sentence):
  filtered_words = [word for word in sentence if word not in stop_words]
  return filtered_words

filtered_questions = [remove_stopwords(question) for question in question_words]

In [None]:
#Examining word counts after removal of stop words

lengths = [len(question) for question in filtered_questions]
plt.hist(lengths, bins = 25)
plt.show()

print('Mean word count of questions is %s' % np.mean(lengths))

In [None]:
len(filtered_questions)

The mean word count fell from 80 to 44 words. There's still engough information for the models to learn context, and the noice is reduced. 

## Sentence Embedding 
As a first approach, I will be using a so called __centroid method__ to dervie the sentence embeddings (taken from this research paper http://www2.aueb.gr/users/ion/docs/BioNLP_2016.pdf). It derives sentence embeddings as the sum of individual word embeddings in a sentece weighted by their tf-idf score, and divided by the sum of these tf-idf scores.  For the sake of simplicity, I'm going to be compare just two alternatives for word embeddings Word2Vec and FastText. I'll be using gensim implementations of both.

## Word2Vec
Word2Vec model learns vector representation of a word by either predicting the context around it (skip-gram), or predicting a word based on its context (CBoW). The most important parameters to specify here are the size of embbedding vector and the size of context window. The number of dimensions is usually between 100-300, with 128 being a standard choice for a lot of applications. Context window depends on the nature of text and embbeddings that you want to get. We'll start with context of 5, and see if the embbedings make sense. 


In [None]:
#Instantiating the model
n = 50
model = Word2Vec(filtered_questions, size = n, window = 8)

#Training model using questions corpora
model.train(filtered_questions, total_examples=len(filtered_questions), epochs=10)

Let's inspect the results by looking at the most similar words (vectors) of a word 'array' and 'database''

In [None]:
#Let's see how it worked
word_vectors = model.wv

print('Words similar to "array" are: ', word_vectors.most_similar(positive='array'))
print('Words similar to "database" are: ', word_vectors.most_similar(positive='database'))

Looks like word2vec knows that array is related to list, matrix, and is commonly used in context of slicing. For 'database', we can see that mode has lerned the abbreviation of db, and the related topics like tables and sqlite. In general, these results should be good enough to construct the sentence embeddings out of them.

## FastText
The main difference of FastText from Word2Vec is that it uses sub-word information (i.e character n-grams). While it brings additional utility to the embeddings, it also considerably slows down the process. 

In [None]:
ft_model = FastText(filtered_questions, size=n, window=8, min_count=5, workers=2,sg=1)

In [None]:
print('Words similar to "array" are: ', ft_model.wv.most_similar('array'))
print('Words similar to "database" are: ', ft_model.wv.most_similar('database'))

Here we can see that FastText has produced different vector embeddings. 'Array' now is close to the words which also contain the ngram 'array' and 'database' is close to different ngrams of the word database plus some variations of database tools. 

We can clearly see the difference between embbedding methods - Word2Vec puts the words which occur in the same context closer in the vector space, while FastText does the same but also allows to incorporate less frequent words into this vector space. Use of n-grams really does play a key role in word embbedings and hence, **I will proceed with using FastText embbeddings** as a basis for sentence embeddings. 

### TF-IDF

In [None]:
#dct = Dictionary(filtered_questions)  # fit dictionary
#corpus = [dct.doc2bow(line) for line in filtered_questions]  # convert corpus to BoW format
#tfidf_model = TfidfModel(corpus)  # fit model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(question_list)
print(X.shape)

In [None]:
#To proprely work with scikit's vectorizer
merged_questions = [' '.join(question) for question in filtered_questions]
document_names = ['Doc {:d}'.format(i) for i in range(len(merged_questions))]

def get_tfidf(docs, ngram_range=(1,1), index=None):
    vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
    tfidf = vect.fit_transform(docs).todense()
    return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T

tfidf = get_tfidf(merged_questions, ngram_range=(1,1), index=document_names)

### Centroid Function

In [None]:
def get_sent_embs(emb_model):
    sent_embs = []
    for desc in range(len(filtered_questions)):
        sent_emb = np.zeros((1, n))
        if len(filtered_questions[desc]) > 0:
            sent_emb = np.zeros((1, n))
            div = 0
            model = emb_model
            for word in filtered_questions[desc]:
                if word in model.wv.vocab and word in tfidf.index:
                    word_emb = model.wv[word]
                    weight = tfidf.loc[word, 'Doc {:d}'.format(desc)]
                    sent_emb = np.add(sent_emb, word_emb * weight)
                    div += weight
                else:
                    div += 1e-13 #to avoid dividing by 0
        if div == 0:
            print(desc)

        sent_emb = np.divide(sent_emb, div)
        sent_embs.append(sent_emb.flatten())
    return sent_embs

In [None]:
ft_sent = get_sent_embs(emb_model = ft_model) 

## Finding Similar Questions
Now we have sentence embeddings which in theory should reflect the similarity of some questions. To check if this assumption is valid, let's pick a question and find top 5 similar questions (knearest neighbours) as measured by cosine distance.

In [None]:
def get_n_most_similar(interest_index, embeddings, n):
    """
    Takes the embedding vector of interest, the list with all embeddings, and the number of similar questions to 
    retrieve.
    Outputs the disctionary IDs and distances
    """
    nbrs = NearestNeighbors(n_neighbors=n, metric='cosine').fit(embeddings)
    distances, indices = nbrs.kneighbors(embeddings)
    similar_indices = indices[interest_index][1:]
    similar_distances = distances[interest_index][1:]
    return similar_indices, similar_distances

def print_similar(interest_index, embeddings, n):
    """
    Convenience function for visual analysis
    """
    closest_ind, closest_dist = get_n_most_similar(interest_index, embeddings, n)
    print('Question %s \n \n is most similar to these %s questions: \n' % (question_list[interest_index], n))
    for question in closest_ind:
        print('ID ', question, ': ',question_list[question])

In [None]:
print_similar(42, ft_sent, 5)

Results are quite interesting. All of the questions are about some kind of text processing. Not exactly repeating questions, but we are definitely onto something. Possible explanation for a weak perfromance is that questions are too long and the final embedding is influenced by too much noise. My hope was that tf-idf score would counteract this, but apparently this is not the case. However, for shorter texts, this method works quite well. 

Next appraoch will be a more complicated (in terms of theory, not implementation) model called __Doc2Vec__. 

## Doc2Vec
Doc2Vec improves on simple averaging method by training a 'document' vector along the word vectors. As in Word2Vec there are two algortihms available to train the model, but I will be using the 'distributed memory' (that's why dm=1 in my model). It trains a model which predicts a word based on its context, by averaging the context word and paragraph ID vectors.  

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(filtered_questions)]
model = Doc2Vec(documents, vector_size=n, window=8, min_count=5, workers=2, dm = 1, epochs=20)

In [None]:
print(question_list[42], ' \nis similar to \n')
print([question_list[similar[0]] for similar in model.docvecs.most_similar(42)])

Results are less than impressive. Some results are about string manipulations or SQL, but Doc2Vec has failed to capture the main meaning of the reference question. 

From the current analysis I can conclude that with current parameters, __Centroid Method outperforms Doc2Vec__. Here's is another example of similar questions being close to each-other under the Centroid Method Embedding.

In [None]:
print_similar(101, ft_sent, 5)

Next steps to improve embeddings would be to:
* Add more tags to Doc2Vec which, in theory, would push questions with similar tags closer together
* Concatenate question headers and code parts with question text 
* Experiment with more questions (now we are training on a limited dataset)