# Welcome to GRC ML Hackathon Challenge A!

This challenge is a collaborative effort to perform data science using a collection of publications.  During this challenge, we will focus on pin-fin heat exchangers. The approach used is nearly identical to an anlysis which involving the NASA Taxonomy and USAF SBIR. That notebook for that analysis has been graciously shared with us by Charles Liles (charles.a.liles@nasa.gov), and can be read here: https://visualization.larc.nasa.gov/dash/notebooks/SBIR_LDA.

The data used for Challenge A is open source, and it was graciously provided by Karsten Look (karsten.look@nasa.gov).

This notebook performs the same latent Dirichlet allocation (LDA) topic modeling on the heat exchanger dataset as in the previously posted example. Some participants will will leverage inexpensive Google Cloud Storage for hosting the data, while others may use Google CoLab (via Google Drive) or storage available on a local computer such as a laptop. If using Google Cloud Storage, you can skip directly to this section by clicking this link: <a href='#cloud_storage'>Cloud Storage Read/Write Demo'</a>. If using Google CoLab, you can skip directly to section... #TODO include section on reading data from Google CoLab. And running the notebook locally, you can reference #TODO include section for reading data locally.

We will also perform t-SNE on vectorized heat exchanger data and interactive visualization using the Python bokeh library.  This link: <a href='#tsne_bokeh'>t-SNE and bokeh Visualization'</a> can be clicked to skip directly to this section of the notebook.

## Import Software Dependencies and Define Functions

We will next import all Python libraries needed for model generation below.  We will also build a function to handle entity extraction from text, stopword removal, and snowball stemming.  Stop words are common word such as "a" and "the" which are not useful for NLP machine learning; you can read more about stop word removal here: https://en.wikipedia.org/wiki/Stop_words. Stemming is a means of transforming words to their root form. For example, the terms "radiate", "radiates", and "radiation" would all transform to a root form of "radiat" after stemming.  More information about Snowball stemming is available here: https://en.wikipedia.org/wiki/Snowball_(programming_language).

In [1]:
# Uncomment the below lines if these dependencies are not already installed
# !pip install --upgrade scispacy
# !pip install --upgrade spacy
# !pip install --upgrade gensim
# !pip install --upgrade pyLDAvis

# Uncomment the below line if en_core_sci_sm is not loaded on your system.
# !pip install --upgrade https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

#uncomment if you need to read the data files from google cloud storage
#from google.cloud import storage

#uncomment if you need to read data in from Google Drive that's mounted to a CoLab notebook
#from google.colab import drive

import time
import bokeh
from sklearn.manifold import TSNE
import pandas as pd
import scispacy
import spacy
import string
from spacy.lang.en import English
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
from gensim import similarities
from sklearn.neighbors import NearestNeighbors
from nltk.corpus import stopwords  
from nltk.stem.snowball import SnowballStemmer
from scipy.spatial.distance import jensenshannon
import numpy as np
from google.cloud import storage

nlp = spacy.load('en_core_sci_sm')

import nltk
# Uncomment the below line if nltk's stopwords are not already on your system.
#nltk.download('stopwords')
stop_words = set(stopwords.words('english')) 
stemmer = SnowballStemmer('english')

from bokeh.models import HoverTool, CustomJS, ColumnDataSource
from bokeh.palettes import viridis, Viridis256, magma, Turbo256, linear_palette
from bokeh.transform import factor_cmap
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict
  from collections import Counter, Iterable


In [2]:
def get_entities(temp_article):
    '''
    A function for preprocessing an article's text.  Spacy's en_core_sci_sm library is used to identify scientfic domain terms. 
    Stopword removal and snowball stemming is then used to cleanup each of term.
    Arg:
        temp_article: the string input for preprocessing
    Returns:
        temp_entities: a list of preprocessed terms
    '''
    if len(temp_article) > nlp.max_length:
        nlp.max_length = len(temp_article)
    doc = nlp(temp_article)
    temp_entities = []
    for n in doc.ents:
        temp_str = ''
        for w in n:
            if w not in stop_words:
                if str(w).isalpha():
                    temp_str += stemmer.stem(str(w)) + ' '
                else:
                    temp_str += str(w) + ' '
        temp_entities.append(temp_str.strip())
    return temp_entities

We will now move forward with our preprocessing and LDA visualizations.  If you are already familiar with this work from the previous notebook, you can skip to t-SNE/visualization by following this link: <a href='#tsne_bokeh'>t-SNE and bokeh Visualization'</a>.

The team has already cleaned up the PDF articles, with the formatted version saved to `Combined_Output.txt`. Artciles in the file are newline delimited.

# Reading the data
For this challenge, there are threee different ways you can read the data.
1. From GCP Cloud storage
2. From Google Drive
3. From your local harddrive.

## From GCP Cloud storage
To read from GCP Cloud storage, we'll use the bucket ID variable and `google.cloud.storage` to connect to the bucket, and download the file a a string. The method name `download_as_string` is a little misleading. What's returned is a bytes object, so we need to decode it using `decode("utf-8")` to convert it to a Python string. All the articles are contained within a single string, so to go to a list of articles we use `split("\n")` to split the string using newline characters as the delimiter.

To preview the data, we only show the first 500 characters for the first few articles. Be careful when printing the downloaded bytes object, string, and list within JupyterLab and on Google Cloud. Trying to print everything might result in warnings, errors, or the notebook locking up.

In [4]:
BUCKET_ID="grc-ml-hackathon"
client = storage.Client()
bucket = client.get_bucket(BUCKET_ID)
blob = bucket.get_blob("challenge-ab/Combined_Output.txt")
data = blob.download_as_string().decode("utf-8")
data = data.split("\n")
print("\n\n".join((data[0][:500], data[1][:500], data[2][:500])))

AttributeError: 'NoneType' object has no attribute 'download_as_string'

## From Google CoLab
Before being able to follow these directions (https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA) for mounting, and reading data from Google drive, some manual setup required.

1. Contact Calvin and give him your personal Google Account. The account will be used to grant you access to the data via Google Drive by sharing it with you.
2. Log into Google Drive (drive.google.com) using your personal Google account, and navigate to the area titled "Shared with me". Once there, you'll be able to see the folder of data which Calvin shared. Right click on that folder and select the option titlted "Add shortcut to Drive."
3. While still logged into Google Drive, navigate to "My Drive" and confirm that the folder appears there. The icon should look have an arrow (very similar to shortcut icons on Windows).
4. If succesful, you should able to follow the direction from the above URL.

Below is an example of using the above instructions to read data into a Google Colab notebook from Google Drive.This might only work if you're running the notebook in Colab. If running from AI Platform Notebooks, refer to the section on reading data from Cloud Storage. If running locally, skip ahead to the section on reading data locally.

In [None]:
with open("/content/drive/My Drive/Combined_Output.txt", "r") as infile:
    data = infile.readlines()
    
print("\n\n".join((data[0][:500], data[1][:500], data[2][:500])))

## From local harddrive

If running JupyterLab on your own laptop or computer, twe can read the data directly from the locally harddrive. For this, we don't need to use third-party modules to read the file, and can use the Python's builtin `open` function or `Pathlib.open` for reading in the data. The example below shows how to read the data using `open`. Just remember to change the path to point to where the file is saved.

In [None]:
with open("Combined_Output.txt") as infile:
    data = infile.readlines()

print("\n\n".join((data[0][:500], data[1][:500], data[2][:500])))

## Fit Heat Exchanger Articleswith a Bag of Words Vectorizer

We will set up a bag of words vectorizer below.  This is a means of representing an article or document via the number of word occurences within it.  We will then fit this vectorizer to our heat exchanger data. 

In [6]:
bow_vector = CountVectorizer(tokenizer=get_entities, min_df=2, max_df=5, ngram_range=(1,3))
data_vectorized = bow_vector.fit_transform(data)

## Define the LDA Model and Fit to the Vectorized Data

We can now define our LDA model and fit it to our vectorized heat exchanger data.  LDA requires a user to pre-select the number of topic clusters to which the document corpus will be fit.  In this case, we will select 10 clusters for topic modeling.

In [7]:
cluster_count = 10

# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=cluster_count,               # Number of topics
                                      #doc_topic_prior=1.0,
                                      max_iter=100,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )

In [8]:
lda_output = lda_model.fit_transform(data_vectorized)

## View Fitted Topics

Now that we have fit our LDA model, let's interactively view the fitted model's in a 2D projection below.  A user can hover over topics on the left-hand side and see the top-30 most relevant terms for the topic on the right hand side.  Users can also click on words listed on the righthand side and see their usage throughout all of the fitted model clusters.

The interactive user-interface below helps a user intuitively understand goodness of fit from the original document corpus.

In [9]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, bow_vector, mds='tsne', sort_topics=False)
panel

# t-SNE and bokeh Visualization<a id='tsne_bokeh'></a>

We will now perform t-distributed stochasting neighbor embedding (t-SNE) on our vectorized heat exchanger articles from the previous step.  t-SNE is a non-linear means of reducing multiple dimensions to two or three dimensions for visualization.  After using t-SNE to project our vectorized SBIR documents into 2D, we will plot the documents and assign colors to them based on the previous LDA topic modeling cluster assigned.

First, we need to determine to which categories LDA assigned each document.

In [10]:
top_cluster = []
for n in range(lda_output.shape[0]):
    categ = np.argwhere(lda_output[n,:] == np.amax(lda_output[n,:]))
    top_cluster.append(str(categ[0][0]+1))

Next, we will convert our vectorized heat exchanger data from a sparse matrix to a dense one and then perform t-SNE.

In [11]:
test = data_vectorized.todense()

In [12]:
tsne = TSNE(random_state=2017, perplexity=30, n_iter=5000)#, early_exaggeration=120)
embedding = tsne.fit_transform(test)

Let's see how many unique categories were assigned to our documents during LDA topic modeling.  We originally told the algorithm to cluster the heat exchanger documents into 10 topics.  However, some of these topics were fairly small per our visualization above.  Some topics may not be assigned to any document as the primary category.

In [13]:
unique_cats = list(set(top_cluster))
unique_cats_count = len(unique_cats)
print('There are ' + str(unique_cats_count) + ' unique categories in our dataset.')

There are 10 unique categories in our dataset.


Only 10 of our categories were assigned as primary topics for our heat exchanger document collection.  A second iteration of LDA could be performed with only 6 categories in the future.  We will skip this step for now though and use our unique category count for building appropriate bokeh color palette. 

In [14]:
viridis_spec = viridis(unique_cats_count)
magma_spec = magma(unique_cats_count)
turbo_spec = linear_palette(Turbo256, unique_cats_count)

Next, let's visualize our t-SNE reduced dimensions and also use the original LDA cluster assignment for a color palette.  We will also use a hover functionality to show the index of the heat exchanger document, the document's title, and also the category # assigned to the document by the LDA clustering algorithm above.

In [15]:
source = ColumnDataSource(data=dict(
    x=embedding[:,0],
    y=embedding[:,1],
    top_cluster=top_cluster
))

TOOLTIPS = [
    ("index", '$index'),
    ("Title", '@desc'),
    ('Category', '@top_cluster')
]

In [16]:
p = figure(title="t-SNE Projection of Vectorized Heat Exchangers", tooltips=TOOLTIPS, plot_height=800, plot_width=800)
p.circle('x', 'y', size=10, source=source, fill_alpha=0.75, fill_color=factor_cmap('top_cluster', palette=magma_spec, factors=top_cluster))
show(p)



## Summary

This is a nice vizualization of the original vectorized heat exchanger documents.  It is possible to hover and see where similar documents are clustering together.  The t-SNE algorithm has projected the vectorized documents into a 2D space where similar documents should be grouped together.  For example, at roughly coordinates (7, 3), a user can hover their mouse over a darker blue grouping of heat exchanger documents.  They also all were assigned to category 4 by the original LDA clustering algorithm.  

This plot has several interactive features.  Users can scroll in and out as well as draw boxes for zooming in.  The number of colors is a bit much.  It would be desirable to rerun this notebook in the future with less color categories (and less LDA clusters).

This has been an initial exploration of leveraging scikit-learn for analysis as well as applying t-SNE to heat exchanger vectorized document and visualizing results using bokeh in Python.  For any questions on this notebook, please reach out to Charles Liles (charles.a.liles@nasa.gov) or Karsten Look (karsten.look@nasa.gov).