In [42]:
import os 
import numpy as np
import pandas as pd
import nltk

# Data Retrieval

In [30]:
import urllib.request

url = 'https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'
filename = 'nips12raw_str602'
urllib.request.urlretrieve(url, filename)

('nips12raw_str602', <http.client.HTTPMessage at 0x1c411e8c6a0>)

In [31]:
!tar -xzf nips12raw_str602

In [33]:
DATA_PATH = 'nipstxt/'
print(os.listdir(DATA_PATH))

['idx', 'MATLAB_NOTES', 'nips00', 'nips01', 'nips02', 'nips03', 'nips04', 'nips05', 'nips06', 'nips07', 'nips08', 'nips09', 'nips10', 'nips11', 'nips12', 'orig', 'RAW_DATA_NOTES', 'README_yann']


# Load and View Dataset

In [34]:
folders = ['nips{0:02}'.format(i) for i in range(0, 13)]
# Read all texts into a list
papers = []
for folder in folders:
    file_names = os.listdir(DATA_PATH + folder)
    for file_name in file_names:
        with open(DATA_PATH + folder + '/' + file_name, encoding='utf-8', errors='ignore', mode='r+') as f:#seperate 'em with /
            data = f.read()
        papers.append(data)
len(papers)        

1740

 However, it looks like the OCR hasn’t worked perfectly and we have
some missing characters here and there. This is expected, but also makes this task more
challenging!

In [41]:
print(papers[0][:1000])

1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a problem 
from examples using a local learning rule, we prove that the entropy of the 
problem becomes a lower bound for the connectivity of the network. 
INTRODUCTION 
The most distinguishing feature of neural networks is their ability to spon- 
taneously learn the desired function from 'training' samples, i.e., their ability 
to program themselves. Clearly, a given neural network cannot just learn any 
function, there must be some restrictions on which networks can learn which 
functions. One obv

# Basic Text Wrangling

In [44]:
stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')#any word
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in papers:
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]# word tokenization
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1]
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)
            
    return norm_papers

In [45]:
norm_papers = normalize_corpus(papers)
print(len(norm_papers))

1740


In [46]:
# Viewing a processed paper
print(norm_papers[0][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu', 'mostafa', 'california', 'institute', 'technology', 'pasadena', 'ca', 'abstract', 'doe', 'connectivity', 'neural', 'network', 'number', 'synapsis', 'per', 'neuron', 'relate', 'complexity', 'problem', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean', 'function', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gate', 'however', 'network', 'learns', 'problem', 'example', 'using']


We are now ready to start building topic models

Without further ado, let’s get started by looking at ways to generate phrases with
influential bi-grams and remove some terms that may not be useful before feature
engineering.

# Text Representation with Featuer Engineering

Before feature engineering and vectorization, we want to extract some useful bi-gram
based phrases from our research papers and remove some unnecessary terms

In [48]:
import gensim

bigram = gensim.models.Phrases(norm_papers, min_count=20, threshold=20, delimiter=b'_') # higher threshold fewer phrases.
bigram_model = gensim.models.phrases.Phraser(bigram)

print(bigram_model[norm_papers[0]][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu_mostafa', 'california_institute', 'technology_pasadena', 'ca_abstract', 'doe', 'connectivity', 'neural_network', 'number', 'synapsis', 'per', 'neuron', 'relate', 'complexity', 'problem', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean_function', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gate', 'however', 'network', 'learns', 'problem', 'example', 'using', 'local', 'learning', 'rule', 'prove', 'entropy', 'problem']


Let’s generate phrases for all our tokenized research papers and build a vocabulary
that will help us obtain a unique term/phrase to number mapping

In [49]:
norm_corpus_bigrams = [bigram_model[doc] for doc in norm_papers]

# Create a dictionary representationi of the docuemnts:
dictionary = gensim.corpora.Dictionary(norm_corpus_bigrams)
print('Sample word to number mappings: ', list(dictionary.items())[:15])
print('Total Vocabulary Size: ', len(dictionary))

Sample word to number mappings:  [(0, '0a'), (1, '2h'), (2, '2h2'), (3, '2he'), (4, '2n'), (5, '__c'), (6, '_c'), (7, '_k'), (8, 'a2'), (9, 'ability'), (10, 'abu_mostafa'), (11, 'access'), (12, 'accommodate'), (13, 'according'), (14, 'accumulated')]
Total Vocabulary Size:  78892


we have a lot of unique phrases in our corpus of research papers,
based on the preceding output. Several of these terms are not very useful since they are
specific to a paper or even a paragraph in a research paper

Hence, it is time to prune
our vocabulary and start removing terms. Leveraging document frequency is a great way
to achieve this

In [50]:
# fitler out words that occur less than 20 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=20, no_above=0.6)
print('Total Vocabulary Size: ', len(dictionary))

Total Vocabulary Size:  7756


We are interested in finding
different themes and topics and not recurring themes. Hence, this suits our scenario
perfectly.

**We can now perform feature engineering by leveraging a simple Bag of Words
model.**

In [51]:
# Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams]
print(bow_corpus[1][:50])

[(3, 1), (12, 3), (14, 1), (15, 1), (16, 1), (17, 16), (20, 1), (24, 1), (26, 1), (31, 3), (35, 1), (36, 1), (40, 3), (41, 5), (42, 1), (48, 1), (53, 3), (55, 1), (56, 2), (58, 1), (60, 3), (63, 5), (64, 4), (65, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 3), (82, 1), (83, 4), (84, 1), (85, 1), (86, 2), (94, 1), (96, 2), (97, 3), (106, 1), (110, 1), (119, 2), (120, 4), (121, 2), (124, 2), (127, 1), (128, 1), (132, 1), (133, 1), (135, 6), (136, 1), (144, 1)]


In [52]:
# Viewing actual terms and their counts
print([(dictionary[idx], freq) for idx, freq in bow_corpus[1][:50]])

[('ability', 1), ('aip', 3), ('although', 1), ('american_institute', 1), ('amount', 1), ('analog', 16), ('appears', 1), ('architecture', 1), ('aspect', 1), ('available', 3), ('become', 1), ('becomes', 1), ('binary', 3), ('biological', 5), ('bit', 1), ('cannot', 1), ('circuit', 3), ('collective', 1), ('compare', 2), ('complex', 1), ('computing', 3), ('conference', 5), ('connected', 4), ('connectivity', 2), ('define', 1), ('defined', 1), ('defines', 1), ('definition', 1), ('denker', 3), ('designed', 1), ('desired', 4), ('diagonal', 1), ('difference', 1), ('directly', 2), ('ed', 1), ('el', 2), ('element', 3), ('equivalent', 1), ('eventually', 1), ('feature', 2), ('final', 4), ('find', 2), ('fixed', 2), ('frequency', 1), ('furthermore', 1), ('generating', 1), ('get', 1), ('global', 6), ('go', 1), ('hence', 1)]


In [53]:
# total papers in the corpus
print('Total number of papers: ', len(bow_corpus))

Total number of papers:  1740


**Our documents are now processed and have a good enough representation with the
Bag of Words model to begin modeling.**

# Building LDA

In [61]:
lda_model = gensim.models.LdaModel(corpus=bow_corpus, id2word=dictionary, chunksize=1740, alpha='auto', eta='auto', 
                                   random_state=42, iterations=500, num_topics=TOTAL_TOPICS, passes=20, eval_every=None)

Viewing the topics in our trained topic model is quite easy and we can generate them
with the following code

In [62]:
for topic_id, topic in lda_model.print_topics(num_topics=10, num_words=20):
    print('Topic #'+str(topic_id+1)+':')
    print(topic)
    print()

Topic #1:
0.013*"circuit" + 0.012*"chip" + 0.008*"neuron" + 0.008*"analog" + 0.007*"current" + 0.007*"bit" + 0.006*"voltage" + 0.005*"node" + 0.005*"word" + 0.005*"vector" + 0.005*"processor" + 0.004*"implementation" + 0.004*"threshold" + 0.004*"computation" + 0.004*"element" + 0.004*"signal" + 0.004*"pattern" + 0.004*"design" + 0.004*"memory" + 0.004*"parallel"

Topic #2:
0.030*"image" + 0.012*"object" + 0.011*"feature" + 0.006*"pixel" + 0.006*"visual" + 0.005*"representation" + 0.005*"recognition" + 0.005*"unit" + 0.005*"motion" + 0.005*"face" + 0.005*"task" + 0.004*"view" + 0.004*"layer" + 0.004*"human" + 0.004*"training" + 0.004*"position" + 0.004*"location" + 0.004*"region" + 0.004*"character" + 0.003*"vector"

Topic #3:
0.020*"neuron" + 0.017*"cell" + 0.012*"response" + 0.010*"stimulus" + 0.007*"spike" + 0.007*"signal" + 0.006*"activity" + 0.006*"synaptic" + 0.005*"firing" + 0.005*"frequency" + 0.005*"pattern" + 0.004*"current" + 0.004*"effect" + 0.004*"neural" + 0.004*"change" +

can also view the overall mean coherence score of the model

In [63]:
topics_coherences = lda_model.top_topics(bow_corpus, topn=20)
avg_coherence_score = np.mean([item[1] for item in topics_coherences])
print('Avg. Coherence Score:', avg_coherence_score)

Avg. Coherence Score: -0.9858031202745918


 Let’s
now look at the output of our LDA topic model in an easier to understand format

One
way is to visualize the topics as tuples of terms and weights

In [64]:
topics_with_wts = [item[0] for item in topics_coherences]
print('LDA Topics with Weights')
print('='*50)
for idx, topic in enumerate(topics_with_wts):
    print('Topic #'+str(idx+1)+':')
    print([(term, round(wt, 3)) for wt, term in topic])
    print()

LDA Topics with Weights
Topic #1:
[('vector', 0.007), ('equation', 0.006), ('let', 0.005), ('linear', 0.005), ('distribution', 0.005), ('approximation', 0.005), ('matrix', 0.005), ('theorem', 0.004), ('convergence', 0.004), ('bound', 0.004), ('class', 0.004), ('training', 0.004), ('optimal', 0.004), ('theory', 0.004), ('consider', 0.004), ('solution', 0.004), ('probability', 0.004), ('estimate', 0.004), ('noise', 0.004), ('rate', 0.003)]

Topic #2:
[('training', 0.017), ('classifier', 0.01), ('classification', 0.008), ('class', 0.008), ('pattern', 0.006), ('feature', 0.006), ('test', 0.006), ('training_set', 0.006), ('vector', 0.005), ('prediction', 0.005), ('kernel', 0.004), ('experiment', 0.004), ('trained', 0.004), ('linear', 0.004), ('technique', 0.003), ('rbf', 0.003), ('task', 0.003), ('size', 0.003), ('table', 0.003), ('sample', 0.003)]

Topic #3:
[('neuron', 0.02), ('cell', 0.017), ('response', 0.012), ('stimulus', 0.01), ('spike', 0.007), ('signal', 0.007), ('activity', 0.006)

We can also view the topics as a list of terms without the weights when we want to
understand the context or theme conveyed by each topic

In [65]:
print('LDA Topics without Weights')
print('='*50)
for idx, topic in enumerate(topics_with_wts):
    print('Topic #'+str(idx+1)+':')
    print([term for wt, term in topic])
    print()

LDA Topics without Weights
Topic #1:
['vector', 'equation', 'let', 'linear', 'distribution', 'approximation', 'matrix', 'theorem', 'convergence', 'bound', 'class', 'training', 'optimal', 'theory', 'consider', 'solution', 'probability', 'estimate', 'noise', 'rate']

Topic #2:
['training', 'classifier', 'classification', 'class', 'pattern', 'feature', 'test', 'training_set', 'vector', 'prediction', 'kernel', 'experiment', 'trained', 'linear', 'technique', 'rbf', 'task', 'size', 'table', 'sample']

Topic #3:
['neuron', 'cell', 'response', 'stimulus', 'spike', 'signal', 'activity', 'synaptic', 'firing', 'frequency', 'pattern', 'current', 'effect', 'neural', 'change', 'et_al', 'channel', 'synapsis', 'motion', 'unit']

Topic #4:
['unit', 'state', 'training', 'rule', 'net', 'word', 'pattern', 'sequence', 'node', 'layer', 'hidden_unit', 'activation', 'architecture', 'recurrent', 'recognition', 'task', 'vector', 'trained', 'context', 'connection']

Topic #5:
['circuit', 'chip', 'neuron', 'analo

In [66]:
cv_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus, 
                                                      texts=norm_corpus_bigrams,
                                                      dictionary=dictionary, 
                                                      coherence='c_v')
avg_coherence_cv = cv_coherence_model_lda.get_coherence()

umass_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus, 
                                                         texts=norm_corpus_bigrams,
                                                         dictionary=dictionary, 
                                                         coherence='u_mass')
avg_coherence_umass = umass_coherence_model_lda.get_coherence()

perplexity = lda_model.log_perplexity(bow_corpus)

print('Avg. Coherence Score (Cv):', avg_coherence_cv)
print('Avg. Coherence Score (UMass):', avg_coherence_umass)
print('Model Perplexity:', perplexity)

Avg. Coherence Score (Cv): 0.4930044902277785
Avg. Coherence Score (UMass): -0.9858031202745918
Model Perplexity: -7.787864245063152
