# Introduction
For this topic visualization project I will be using the Gensim libraries (https://radimrehurek.com/gensim/) and the D3 visualization library (http://dpoetry.com/theplains/Hierarchie-gh-pages/). I will be implementing the TFIDF concept to calculate how important a particular word is in a document. I will store the TFIDF value against the word and document. Based on the input word, I will find all the documents that word is present in and then retrieve all the important words present in those documents. I will then accumalate the TFIDF value of each of those words and then retrieve the top 5 words among them. These 5 words will represent the 5 most relevant words of the input word.

## Steps
The project can be implemented in 4 steps
1. Preprocess the data
2. Create a TFIDF market matrix
3. Generate the json based on the input term
4. Create the visualization app

## Step 1
I downloaded the PMC data in plain text format.
I was able to read the data from files into an array of documents.
I also removed any special characters from each document using a regex

In [None]:
    documents = []
    import glob
    import os
    directoryNames = list(set(glob.glob(os.path.join("Data", "*"))).difference(set(glob.glob(os.path.join("Data","*.*")))))
    numberOfDocuments = 0

    for folder in directoryNames:
        for fileNameDir in os.walk(folder):
            for fileName in fileNameDir[2]:
                if fileName[-4:] != ".txt":
                    continue
                nameFileDocument = "{0}{1}{2}".format(fileNameDir[0], os.sep, fileName)
                with open(nameFileDocument, 'r') as doc:
                    doc_text = doc.read().replace('\n', '')
                import re
                processed_doc_text = re.sub('[^a-zA-Z0-9\n]', ' ', doc_text)
                documents.append(processed_doc_text)
                numberOfDocuments += 1

I then proceeded to tokennize each document and removed all the english "stopwords". 
I got the list of stop words from the stop-words package.

In [None]:
    # remove common words and tokenize
    from stop_words import get_stop_words
    stop_words = get_stop_words('english')
    texts = [[word for word in document.lower().split() if word not in stop_words]
             for document in documents]

I then proceeded to remove all the words that appeared only once in the document.

In [None]:
    # remove words that appear only once
    from collections import defaultdict
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    texts = [[token for token in text if frequency[token] > 1]
             for text in texts]

## Step 2
Since I had decided to use gensim libraries for this project, for creating a TFIDF market matrix, I need a dictionary and a corpus (bag of words). 
I will create a dictionary from the list of documents using the Gensim API.
I will also be using the "filter_extremes" function of the gensim dictionary object to filter out the tokens that appear in,
1. less than no_below documents (absolute number) or
2. more than no_above documents (fraction of total corpus size, not absolute number).
3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [None]:
    from gensim import corpora
    dictionary = corpora.Dictionary(texts)
    dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=1000000)
    dictionary.save('files/pmc-data.dict') # store the dictionary, for future reference

I will now create the corpus object. 

In [None]:
    corpus = [dictionary.doc2bow(text) for text in texts]
    corpora.MmCorpus.serialize('files/pmc-data.mm', corpus) # store to disk, for later use

I will now create the TFIDF market matrix. 

In [None]:
    from gensim.corpora import MmCorpus
    mm = MmCorpus('files/pmc-data.mm')
    from gensim.models import TfidfModel
    tfidf = TfidfModel(mm, id2word=dictionary, normalize=True)
    MmCorpus.serialize('files/pmc-data-tfidf.mm', tfidf[mm], progress_cnt=10000)

## Step 3
Now since the TFIDF matrix is ready, I can not take the use input and then create a json response for visualizing the relevant words. 

In [None]:
    # Ask user to input the term
    term = input("Please input the term you want to visualize: ")
    if term != "":
        # generate the json tree of relevant words for the input term
        response = generateJSON(term, 1, [])
        import json
        with open('/var/www/Hierarchie/app/data/pmc-data.json', 'w') as outfile:
            json.dump(response, outfile)

I will also load the prevously saved data dictionary and tfidf data. 
I will create a id2word and word2id hashes from the dictionary object.

In [None]:
# Load the dictionary from the file. Create id2Word and word2Id variables
from gensim import corpora
dictionary = corpora.Dictionary.load('files/pmc-data.dict')
id2word = dictionary.token2id
word2id = {v: k for k, v in id2word.items()}

# Load the tfidf data
from scipy.io import mmread
file = mmread('files/pmc-data-tfidf.mm')

I am now ready to create the json. 
I will create a seperate function that does this job. 
First for finding the relevant words of the input term, I need to do 2 things. 
1. Find all the documents that have the input word
2. Get all the words in the documents that have the input word
I write this logic in 2 util functions.

In [None]:
# Retrives all the doc IDs in the TFIDF that contain the word
def getDocumentsWithWord(term):
    docs = []
    id = id2word[term]
    i = 0
    for col in file.col:
        if col == id:
            docs.append(file.row[i])
        i += 1
    return docs

# Retrieves all the words along with their TFIDF values in the document with id docId
def getWordsAndTFIDF(docId):
    data = []
    i = 0
    for row in file.row:
        if docId == row:
            data.append([file.col[i], file.data[i]])
        i+=1
    return data

I can now find all the relevant words. After finding all the words in the documents that contain the input term, I sum their TFIDF values and sort the hash based on their TFIDF values. 

In [None]:
    docs = getDocumentsWithWord(inputWord)
    parents.append(inputWord)

    relevant_words = {}

    for doc in docs:
        data = getWordsAndTFIDF(doc)
        for d in data:
            if word2id[d[0]] in parents:
                continue
            if not d[0] in relevant_words:
                relevant_words[d[0]] = 0.0
            relevant_words[d[0]] += d[1]

    from operator import itemgetter
    sorted_relevant_words = sorted(relevant_words.items(), key=itemgetter(1), reverse=True)

To find second level of relevant words, I will apply the same logic to all of the top "n" relevant words. That is I will find the top "n" relevant words for all the top "n" relevant words of the input term. This logic can continue based on the maximum number of levels (depth) of relevant words. For now I will be calculating 4 levels of relevant words and top 10 words in each level. 
To implement this logic and create the json response, I will use recursion.
The json is created keeping in mind the format required for the chosen visualization framework.

In [None]:
    topic = {}
    topic['name'] = inputWord
    topic['words'] = []
    topic['children'] = []

    for w in sorted_relevant_words[:max_words]:
        topic['words'].append(word2id[w[0]])
        parents.append(word2id[w[0]])

    for w in sorted_relevant_words[:max_words]:
        if level == max_level:
            continue
        else:
            topic['children'].append(generateJSON(word2id[w[0]], level=level+1, parents=parents))
    return topic

## Step 4
Since the json is ready, I can now visualize the json. The visualization app was created based on the open source repository provided by the framework. It is a simple angular js app with one controller and one route. 

<img src="img1.png">

## Reflection
This was a great project to learn about text mining and topic modelling. 
I gained a lot of knowledge by reading several papers and online content about topic modelling. 

Finding ways to optimize the dictionary and corpus by 
1. removing the common words (stop words), 
2. words that appear only once, 
3. words that appear in less than 20 documents, 
4. words that appear in more than 10% of documents, 
5. lemmatizing the dictionary (finding the lemma of any word and keeping only the lemma)
was a big learning process. 

Experimenting with LDA, LSA and finally choosing TFIDF was again a learning process. The difficulty in implementing and understanding the output of LDA and LSA was the main reason to choose simple TFIDF implementation. 

Visualization libraries was new and nice to know. They provide a wonderful way for visualizing topic modelling. 