## TFIDF based retrial using gensim

This notebook defines the **gensim-based document retrieval method based on tf-idf similarity score** (between corpus documents and the query string).

1. Cleanup / preprocess 
2. Define dictionary
3. Transform corpus - Bag of Worgs
4. Learn tfidf vectors for corpus
5. Sparse matrix indexing for similarity scoring
6. Retrieve top N document for the given query string

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, sys

In [None]:
from sklearn.datasets import fetch_20newsgroups
from gensim import corpora
from gensim.parsing import strip_tags, strip_numeric, \
    strip_multiple_whitespaces, stem_text, strip_punctuation, \
    remove_stopwords, preprocess_string
import pprint
import re

### Get the dataset as text corpus

In [None]:
# get all the news group docs
data = fetch_20newsgroups(subset='all')

In [None]:
# collect all text documents as list
text_docs = data['data']

### Preprocess the text corpus

In [None]:
# preprocess using gensim.parsing
# ref: https://www.kaggle.com/venkatkrishnan/gensim-text-mining-techniques
transform_to_lower = lambda s: s.lower()

remove_single_char = lambda s: re.sub(r'\s+\w{1}\s+', '', s)

# Filters to be executed in pipeline
CLEAN_FILTERS = [strip_tags,
                strip_numeric,
                strip_punctuation, 
                strip_multiple_whitespaces, 
                transform_to_lower,
                remove_stopwords,
                remove_single_char]

# Method does the filtering of all the unrelevant text elements
def cleaning_pipe(document):
    # Invoking gensim.parsing.preprocess_string method with set of filters
    processed_words = preprocess_string(document, CLEAN_FILTERS)
    
    return processed_words
print(cleaning_pipe(text_docs[0]))

### Define corpus dictionary

In [None]:
def create_dictionary(docs):
    'create dictionary of words in preprocessed corpus'
    pdocs = [cleaning_pipe(doc) for doc in docs]
    dictionary = corpora.Dictionary(pdocs)
    dictionary.save('newsgroup.dict')
    return dictionary,pdocs

In [None]:
dictionary, pdocs = create_dictionary(text_docs)

In [None]:
len(dictionary)

- dictionary is huge in size (177k unique words - 177k dimensions) but gensim will be able to manage it efficiently.

### Transform any sample document as per the known dictionary

In [None]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(cleaning_pipe(new_doc))
print(new_vec)

### Transform complete corpus as BoW

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in pdocs]

### Fit the tfidf model a.k.a tfidf vectorizer

In [None]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

In [None]:
# transform any new document as tfidf vector
words = cleaning_pipe("want to sell bike")
print(tfidf[dictionary.doc2bow(words)])

## Sparse matrix indexing for similarity scoring

In [None]:
# index the tfidf vector of corpus as sparse matrix
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=len(dictionary))

### Retrieve top N document for the given query string

In [None]:
def get_closest_n(query, n):
    '''get the top matching docs as per cosine similarity
    between tfidf vector of query and all docs'''
    query_document = cleaning_pipe(query)
    query_bow = dictionary.doc2bow(query_document)
    sims = index[tfidf[query_bow]]
    top_idx = sims.argsort()[-1*n:][::-1]
    return [text_docs[i] for i in top_idx]

In [None]:
for d in get_closest_n("how to sell my broken aeroplane",2):
    print(d)