In [42]:
import os 
import numpy as np
import pandas as pd
import nltk

# Data Retrieval

In [30]:
import urllib.request

url = 'https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'
filename = 'nips12raw_str602'
urllib.request.urlretrieve(url, filename)

('nips12raw_str602', <http.client.HTTPMessage at 0x1c411e8c6a0>)

In [31]:
!tar -xzf nips12raw_str602

In [33]:
DATA_PATH = 'nipstxt/'
print(os.listdir(DATA_PATH))

['idx', 'MATLAB_NOTES', 'nips00', 'nips01', 'nips02', 'nips03', 'nips04', 'nips05', 'nips06', 'nips07', 'nips08', 'nips09', 'nips10', 'nips11', 'nips12', 'orig', 'RAW_DATA_NOTES', 'README_yann']


# Load and View Dataset

In [34]:
folders = ['nips{0:02}'.format(i) for i in range(0, 13)]
# Read all texts into a list
papers = []
for folder in folders:
    file_names = os.listdir(DATA_PATH + folder)
    for file_name in file_names:
        with open(DATA_PATH + folder + '/' + file_name, encoding='utf-8', errors='ignore', mode='r+') as f:#seperate 'em with /
            data = f.read()
        papers.append(data)
len(papers)        

1740

 However, it looks like the OCR hasn’t worked perfectly and we have
some missing characters here and there. This is expected, but also makes this task more
challenging!

In [41]:
print(papers[0][:1000])

1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a problem 
from examples using a local learning rule, we prove that the entropy of the 
problem becomes a lower bound for the connectivity of the network. 
INTRODUCTION 
The most distinguishing feature of neural networks is their ability to spon- 
taneously learn the desired function from 'training' samples, i.e., their ability 
to program themselves. Clearly, a given neural network cannot just learn any 
function, there must be some restrictions on which networks can learn which 
functions. One obv

# Basic Text Wrangling

In [44]:
stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')#any word
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in papers:
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]# word tokenization
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1]
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)
            
    return norm_papers

In [45]:
norm_papers = normalize_corpus(papers)
print(len(norm_papers))

1740


In [46]:
# Viewing a processed paper
print(norm_papers[0][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu', 'mostafa', 'california', 'institute', 'technology', 'pasadena', 'ca', 'abstract', 'doe', 'connectivity', 'neural', 'network', 'number', 'synapsis', 'per', 'neuron', 'relate', 'complexity', 'problem', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean', 'function', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gate', 'however', 'network', 'learns', 'problem', 'example', 'using']


We are now ready to start building topic models

Without further ado, let’s get started by looking at ways to generate phrases with
influential bi-grams and remove some terms that may not be useful before feature
engineering.

# Text Representation with Featuer Engineering

Before feature engineering and vectorization, we want to extract some useful bi-gram
based phrases from our research papers and remove some unnecessary terms

In [48]:
import gensim

bigram = gensim.models.Phrases(norm_papers, min_count=20, threshold=20, delimiter=b'_') # higher threshold fewer phrases.
bigram_model = gensim.models.phrases.Phraser(bigram)

print(bigram_model[norm_papers[0]][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu_mostafa', 'california_institute', 'technology_pasadena', 'ca_abstract', 'doe', 'connectivity', 'neural_network', 'number', 'synapsis', 'per', 'neuron', 'relate', 'complexity', 'problem', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean_function', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gate', 'however', 'network', 'learns', 'problem', 'example', 'using', 'local', 'learning', 'rule', 'prove', 'entropy', 'problem']


Let’s generate phrases for all our tokenized research papers and build a vocabulary
that will help us obtain a unique term/phrase to number mapping

In [49]:
norm_corpus_bigrams = [bigram_model[doc] for doc in norm_papers]

# Create a dictionary representationi of the docuemnts:
dictionary = gensim.corpora.Dictionary(norm_corpus_bigrams)
print('Sample word to number mappings: ', list(dictionary.items())[:15])
print('Total Vocabulary Size: ', len(dictionary))

Sample word to number mappings:  [(0, '0a'), (1, '2h'), (2, '2h2'), (3, '2he'), (4, '2n'), (5, '__c'), (6, '_c'), (7, '_k'), (8, 'a2'), (9, 'ability'), (10, 'abu_mostafa'), (11, 'access'), (12, 'accommodate'), (13, 'according'), (14, 'accumulated')]
Total Vocabulary Size:  78892


we have a lot of unique phrases in our corpus of research papers,
based on the preceding output. Several of these terms are not very useful since they are
specific to a paper or even a paragraph in a research paper

Hence, it is time to prune
our vocabulary and start removing terms. Leveraging document frequency is a great way
to achieve this

In [50]:
# fitler out words that occur less than 20 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=20, no_above=0.6)
print('Total Vocabulary Size: ', len(dictionary))

Total Vocabulary Size:  7756


We are interested in finding
different themes and topics and not recurring themes. Hence, this suits our scenario
perfectly.

**We can now perform feature engineering by leveraging a simple Bag of Words
model.**

In [51]:
# Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams]
print(bow_corpus[1][:50])

[(3, 1), (12, 3), (14, 1), (15, 1), (16, 1), (17, 16), (20, 1), (24, 1), (26, 1), (31, 3), (35, 1), (36, 1), (40, 3), (41, 5), (42, 1), (48, 1), (53, 3), (55, 1), (56, 2), (58, 1), (60, 3), (63, 5), (64, 4), (65, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 3), (82, 1), (83, 4), (84, 1), (85, 1), (86, 2), (94, 1), (96, 2), (97, 3), (106, 1), (110, 1), (119, 2), (120, 4), (121, 2), (124, 2), (127, 1), (128, 1), (132, 1), (133, 1), (135, 6), (136, 1), (144, 1)]


In [52]:
# Viewing actual terms and their counts
print([(dictionary[idx], freq) for idx, freq in bow_corpus[1][:50]])

[('ability', 1), ('aip', 3), ('although', 1), ('american_institute', 1), ('amount', 1), ('analog', 16), ('appears', 1), ('architecture', 1), ('aspect', 1), ('available', 3), ('become', 1), ('becomes', 1), ('binary', 3), ('biological', 5), ('bit', 1), ('cannot', 1), ('circuit', 3), ('collective', 1), ('compare', 2), ('complex', 1), ('computing', 3), ('conference', 5), ('connected', 4), ('connectivity', 2), ('define', 1), ('defined', 1), ('defines', 1), ('definition', 1), ('denker', 3), ('designed', 1), ('desired', 4), ('diagonal', 1), ('difference', 1), ('directly', 2), ('ed', 1), ('el', 2), ('element', 3), ('equivalent', 1), ('eventually', 1), ('feature', 2), ('final', 4), ('find', 2), ('fixed', 2), ('frequency', 1), ('furthermore', 1), ('generating', 1), ('get', 1), ('global', 6), ('go', 1), ('hence', 1)]


In [53]:
# total papers in the corpus
print('Total number of papers: ', len(bow_corpus))

Total number of papers:  1740


**Our documents are now processed and have a good enough representation with the
Bag of Words model to begin modeling.**

# Building LSI

In [54]:
TOTAL_TOPICS = 10
lsi_bow=gensim.models.LsiModel(bow_corpus,id2word=dictionary,num_topics=TOTAL_TOPICS,onepass=True,chunksize=1740,power_iters=1000)

we can view the major topics or themes in our corpus by
using the following code

In [56]:
for topic_id, topic in lsi_bow.print_topics(num_topics=10,num_words=20):
    print('Topic #'+str(topic_id+1)+":")
    print(topic)
    print()

Topic #1:
0.215*"unit" + 0.212*"state" + 0.187*"training" + 0.177*"neuron" + 0.162*"pattern" + 0.145*"image" + 0.140*"vector" + 0.125*"feature" + 0.122*"cell" + 0.110*"layer" + 0.101*"task" + 0.097*"class" + 0.091*"probability" + 0.089*"signal" + 0.087*"step" + 0.086*"response" + 0.085*"representation" + 0.083*"noise" + 0.082*"rule" + 0.081*"distribution"

Topic #2:
0.487*"neuron" + 0.396*"cell" + -0.257*"state" + 0.191*"response" + -0.187*"training" + 0.170*"stimulus" + 0.117*"activity" + -0.109*"class" + 0.099*"spike" + 0.097*"pattern" + 0.096*"circuit" + 0.096*"synaptic" + -0.095*"vector" + 0.090*"signal" + 0.090*"firing" + 0.088*"visual" + -0.084*"classifier" + -0.083*"action" + -0.078*"word" + 0.078*"cortical"

Topic #3:
0.627*"state" + -0.395*"image" + 0.219*"neuron" + -0.209*"feature" + 0.188*"action" + -0.137*"unit" + -0.131*"object" + 0.130*"control" + -0.129*"training" + 0.109*"policy" + -0.103*"classifier" + -0.090*"class" + 0.081*"step" + 0.081*"dynamic" + -0.080*"classific

Let’s
separate these terms and try to interpret the topics again

In [57]:
for n in range(TOTAL_TOPICS):
    print('Topic #'+str(n+1)+':')
    print(' ='*50)
    d1 = []
    d2 = []
    for term, wt in lsi_bow.show_topic(n, topn=20):
        if wt>=0:
            d1.append((term, round(wt, 3)))
        else:
            d2.append((term, round(wt, 3)))
    print('Direction 1: ', d1)
    print('-'*50)
    print('Direction 2: ', d2)
    print('-'*50)
    print()

Topic #1:
 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Direction 1:  [('unit', 0.215), ('state', 0.212), ('training', 0.187), ('neuron', 0.177), ('pattern', 0.162), ('image', 0.145), ('vector', 0.14), ('feature', 0.125), ('cell', 0.122), ('layer', 0.11), ('task', 0.101), ('class', 0.097), ('probability', 0.091), ('signal', 0.089), ('step', 0.087), ('response', 0.086), ('representation', 0.085), ('noise', 0.083), ('rule', 0.082), ('distribution', 0.081)]
--------------------------------------------------
Direction 2:  []
--------------------------------------------------

Topic #2:
 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Direction 1:  [('neuron', 0.487), ('cell', 0.396), ('response', 0.191), ('stimulus', 0.17), ('activity', 0.117), ('spike', 0.099), ('pattern', 0.097), ('circuit', 0.096), ('synaptic', 0.096), ('signal', 0.09), ('firing', 0.09), ('visual', 0.088), ('cortic

Let’s try to get the three major matrices (U, S, and VT) from our topic model, which
uses SVD 

In [58]:
term_topic = lsi_bow.projection.u 
singular_values = lsi_bow.projection.s
# Convert corpus into a dense numpy array (documents will be columns).
topic_document = (gensim.matutils.corpus2dense(lsi_bow[bow_corpus], len(singular_values)).T / singular_values).T
term_topic.shape, singular_values.shape, topic_document.shape

((7756, 10), (10,), (10, 1740))

term-topic matrix, singular values, and a topic-document matrix

We can transpose the topic-document matrix to form a documenttopic matrix and that would help us see the proportion of each topic per document

**Document-topic matrix from our LSI model**

In [59]:
document_topics = pd.DataFrame(np.round(topic_document.T, 3), columns=['T'+str(i) for i in range(1, TOTAL_TOPICS+1)])
document_topics.head(5)

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10
0,0.016,0.017,0.013,0.008,-0.024,0.028,-0.0,0.019,0.008,-0.006
1,0.041,0.03,0.019,-0.021,-0.019,0.056,-0.018,-0.009,-0.018,-0.011
2,0.022,-0.0,0.022,0.008,-0.011,0.016,-0.013,0.017,0.001,0.007
3,0.032,0.036,0.011,-0.014,-0.035,0.052,0.016,0.043,0.01,-0.029
4,0.035,-0.002,0.017,-0.008,-0.016,0.017,-0.032,0.022,-0.05,0.029


Ignoring the sign, we can try to find out the most important topics for a few sample
papers and see if they make sense

In [60]:
document_numbers = [13, 250, 500]

for document_number in document_numbers:
    top_topics = list(document_topics.columns[np.argsort(-np.absolute(document_topics.iloc[document_number].values))[:3]])
    print('Document #'+str(document_number)+':')
    print('Dominant Topics (top 3):', top_topics)
    print('Paper Summary:')
    print('/n')
    print(papers[document_number][:500])
    print()

Document #13:
Dominant Topics (top 3): ['T3', 'T8', 'T9']
Paper Summary:
/n
137 
On the 
Power of Neural Networks for 
Solving Hard Problems 
Jehoshua Bruck 
Joseph W. Goodman 
Information Systems Laboratory 
Department of Electrical Engineering 
Stanford University 
Stanford, CA 94305 
Abstract 
This paper deals with a neural network model in which each neuron 
performs a threshold logic function. An important property of the model 
is that it always converges to a stable state when operating in a serial 
mode [2,5]. This property is the basis of the potential applicat

Document #250:
Dominant Topics (top 3): ['T9', 'T8', 'T1']
Paper Summary:
/n
542 Kassebaum, Tenorio and Schaefers 
The Cocktail Party Problem: 
Speech/Data Signal Separation Comparison 
between Backpropagation and SONN 
John Kassebaum 
jakec.ecn.purdue.edu 
Manoel Fernando Tenorio 
tenorioee.ecn.purdue.edu 
Chrlstoph Schaefers 
Parallel Distributed Structures Laboratory 
School of Electrical Engineering 
Purdue Unive

Clearly SVD is a very powerful
mathematical operation