[](http://)**This is a fork from an original kernel on the StackOverflow questions, applied to Visa Questions database (that you can find in my profile)**

Use LSA to identify related questions

resources on LSA : http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/

**Why is LSA?**

*Latent Semantic Analysis is a **technique for creating a vector representation of a document.** Having a vector representation of a document gives you a way to compare documents for their similarity by calculating the distance between the vectors. This in turn means you can do handy things like classifying documents to determine which of a set of known topics they most likely belong to.*

**What is tf-idf?**

*term frequency-inverse document frequency, or tf-idf for short.
tf-idf is pretty simple and I won’t go into it here, but the gist of it is that **each position in the vector corresponds to a different word, and you represent a document by counting the number of times each word appears.** Additionally, you normalize each of the word counts by the frequency of that word in your overall document collection, to give less frequent terms more weight.*

**How does LSA work?**

*LSA is quite simple, you just use SVD to perform **dimensionality reduction on the tf-idf vectors**–that’s really all there is to it!*



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from multiprocessing import Pool
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
from itertools import chain
from collections import Counter
import pickle
import scipy.io as scio
from sklearn.decomposition import TruncatedSVD
import scipy.spatial.distance as distance
import scipy.cluster.hierarchy as hierarchy
from scipy.stats import pearsonr

In [None]:
"""this is the data from Python Questions from StackOverflow"""
dat = pd.read_csv("../input/pythonquestions/Questions.csv", encoding='latin1')
dat['Title'].fillna("None", inplace=True)
dat['Score'].fillna(0, inplace=True)

In [None]:
"""this is the data from Visa Questions"""
visa_question_data = pd.read_table("../input/visa-questions-by-expat-in-china/visaQuestions.txt",header=None)

**Data look like this**

	Id	OwnerUserId	CreationDate	Score	Title	Body


In [None]:
visa_question_data.iloc(0)[0]

In [None]:
dat.iloc(0)[0]

In [None]:
# select a sample - results will improve without sampling in tf-idf caluculations, but due to
# Kaggle kernel memory limit we have to make a compromise here.
selected_ids = np.random.choice(range(dat.shape[0]), 10000, replace=False)
sample = dat.loc[selected_ids, :]
sample.shape

In [None]:
sample.head()

**DATA CLEANING**
 - purify strings
 - combine title and body

In [None]:
def purify_string(html):
    """
    this will apply to the sample
    """
    return re.sub('(\r\n)+|\r+|\n+', " ", re.sub('<[^<]+?>', '', html))

In [None]:
corpus = sample.ix[:, 'Body'].apply(purify_string)

In [None]:
visa_questions = visa_question_data.loc[:, 0].apply(purify_string)

In [None]:
def combine_title_body(tnb):
    return tnb[0] + " " + tnb[1]

*Pool(8)* come from the multiprocessing module, [multiprocessing docs ](https://docs.python.org/2/library/multiprocessing.html)

> multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

> The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). The following example demonstrates the common practice of defining such functions in a module so that child processes can successfully import that module. This basic example of data parallelism using Pool,

In [None]:
p = Pool(8)
combined_corpus = p.map(combine_title_body, zip(dat['Title'], corpus))
p.close()

In [None]:
visa_questions_list = list(visa_questions)

**Next step of cleaning is Stemming and Lemmatizing**

> Stemming and lemmatization
> For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
> 
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:
> am, are, is -> be 
> car, cars, car's, cars' -> car

> The result of this mapping of text will be something like:
> 
> the boy's cars are different colors ->the boy car be differ color

[source Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

**Stemming and Lemmatizing is applied to tokens, after Tokenizing the corpus**
> Tokenization
> Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. 

[source Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

In [None]:
lem = WordNetLemmatizer()
def cond_tokenize(t):
    if t is None:
        return []
    else:
        return [lem.lemmatize(w.lower()) for w in word_tokenize(t)]

In [None]:
p = Pool(8)
tokens = list(p.imap(cond_tokenize, combined_corpus))
p.close()

In [None]:
p = Pool(8)
visa_tokens = list(p.imap(cond_tokenize, visa_questions_list))
p.close()

In [None]:
# stops = stopwords.words('english')
pure_tokens = [" ".join(sent) for sent in tokens]

In [None]:
pure_visa_tokens = [" ".join(sent) for sent in visa_tokens]

In [None]:
i = 7
print(visa_tokens[i]) # this are the single lemmatized and stemmed tokens
print("\n")
print(pure_visa_tokens[i]) # these are the tokens combined in original form

**TFIDF section**

> In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, **is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.**[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.[2]

> Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
> 
> One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

[from wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [None]:
vectorizer = TfidfVectorizer(min_df=1, max_features=2000, stop_words='english', ngram_range=[1, 1], sublinear_tf=True)
tfidf = vectorizer.fit_transform(pure_visa_tokens) # this is the vector matrix of the tfidf

In [None]:
idfs = pd.DataFrame([[v, k] for k, v in vectorizer.vocabulary_.items()], columns=['id', 'word']).sort_values('id')
idfs['idf'] = vectorizer.idf_

In [None]:
 # *this is the IDFS vector that can be used to examine how the TFIDF worked*
print(idfs.sort_values('idf').head(40))

**Compress using SVD**

> SVD is used to get rid of redundant data, that is, for **dimensionality reduction.** For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.

[stackoverflow](https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning)

[Data Mining algorithms](https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition)

In [None]:
tsvd = TruncatedSVD(n_components=500) # TODO this n_components=500 is a hyperparameter, look into it
transformed = tsvd.fit_transform(tfidf)

In [None]:
np.sum(tsvd.explained_variance_ratio_)

In [None]:
transformed.shape

**Choosing a metric**

now that we have the SVD reducted TFIDF we need to choose a text metric to actually give a score to every entry

**Cosine similarity**

> Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. 
> With cosine similarity, we need to convert sentences into vectors. One way to do that is to use bag of words with either TF (term frequency) or TF-IDF (term frequency- inverse document frequency). The choice of TF or TF-IDF depends on application and is immaterial to how cosine similarity is actually performed — which just needs vectors. TF is good for text similarity in general, but TF-IDF is good for search query relevance.

https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50

In [None]:
# calculate pairwise cosine distance
D = distance.pdist(transformed, 'cosine')

**Calculate clustering with scipy.cluster**
> scipy.cluster.hierarchy.linkage
> scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)[source]
> Perform hierarchical/agglomerative clustering.
> 
> The input y may be either a 1d condensed distance matrix or a 2d array of observation vectors.

[scipy cluster docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage)

In [None]:
# hierarchical clustering - tree calculation
L = hierarchy.linkage(D)
#TODO : look into this ValueError: The condensed distance matrix must contain only finite values.

In [None]:
# mean distance between clusters
np.mean(D)

In [None]:
# split clusters by criterion. Here 0.71 is used as the inconsistency criterion. Adjust the
# number to change cluster sizes
# TODO : this is the second hyperparameters, look into it
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster
cls = hierarchy.fcluster(L, 0.71, criterion='inconsistent')

In [None]:
df_cls = pd.DataFrame({'Pos': selected_ids, 'Cluster': cls})
cnts = df_cls.groupby('Cluster').size().sort_values(ascending=False)
cnts.sort_values(ascending=False).head()

In [None]:
# add clusters to question data
bc = pd.concat([sample, df_cls.set_index('Pos')], axis=1)
bc.head()

In [None]:
# calculate cluster stats
stats = bc.groupby('Cluster')['Score'].describe().unstack()

In [None]:
stats.sort_values(ascending=False).head(10)

In [None]:
plt.figure(figsize=(12, 8))
plt.hlines([0], xmin=0, xmax=np.max(stats['count']) + 5, alpha=0.5)
plt.vlines([1], ymin=0, ymax=np.max(stats['mean']) + 50, alpha=0.5)
plt.scatter(stats['count'], stats['mean'], alpha=0.3)
plt.title("cluster mean score vs cluster size")
plt.xlabel("cluster size")
plt.ylabel("mean score")
plt.show()

### Check if clusters make sense

In [None]:
bc.loc[bc['Cluster'] == cnts.index[0]][['Score', 'Title', 'Body']]

In [None]:
bc.loc[bc['Cluster'] == cnts.index[1]][['Score', 'Title', 'Body']]

In [None]:
bc.loc[bc['Cluster'] == cnts.index[2]][['Score', 'Title', 'Body']]

We can improve our clusters by increasing sample size, using entire dataset to calculate tf-idf, adjusting cluster splitting criterion, using non-exclusive clustering techniques etc.

Next steps:

 1. Use clusters and most significant words in questions to generate question tags automatically
 2. Use an autoencoder to perform semantical hashing for better estimates of question relatedness