Homework 8 <br>
Sam Odle

# LDA Model

In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [4]:
import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            for member in tar.getmembers():
                if member.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', member.name):
                    member_bytes = tar.extractfile(member).read()
                    yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

print(len(docs))
print(docs[0][:500])

1740
387 
Neural Net and Traditional Classifiers  
William Y. Huang and Richard P. Lippmann 
MIT Lincoln Laboratory 
Lexington, MA 02173, USA 
Abstract
Previous work on nets with continuous-valued inputs led to generative 
procedures to construct convex decision regions with two-layer percepttons (one hidden 
layer) and arbitrary decision regions with three-layer percepttons (two hidden layers). 
Here we demonstrate that two-layer perceptton classifiers trained with back propagation 
can form both c


In [5]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    #docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]


In [8]:
import nltk
nltk.download('wordnet')
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]


[nltk_data] Downloading package wordnet to /Users/samodle/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2021-11-10 21:41:52,306 : INFO : collecting all words and their counts
2021-11-10 21:41:52,306 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2021-11-10 21:41:57,264 : INFO : collected 1275394 token types (unigram + bigrams) from a corpus of 4629808 words and 1740 sentences
2021-11-10 21:41:57,265 : INFO : merged Phrases<1275394 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
2021-11-10 21:41:57,265 : INFO : Phrases lifecycle event {'msg': 'built Phrases<1275394 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> in 4.96s', 'datetime': '2021-11-10T21:41:57.265898', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


In [None]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [11]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

2021-11-11 09:21:17,418 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-11-11 09:21:19,529 : INFO : built Dictionary(101581 unique tokens: ['1OOOOO', '1st', '25OO', '2O00', '4OOO']...) from 1740 documents (total 4979481 corpus positions)
2021-11-11 09:21:19,530 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(101581 unique tokens: ['1OOOOO', '1st', '25OO', '2O00', '4OOO']...) from 1740 documents (total 4979481 corpus positions)", 'datetime': '2021-11-11T09:21:19.530032', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2021-11-11 09:21:19,623 : INFO : discarding 91331 tokens: [('1OOOOO', 1), ('25OO', 2), ('2O00', 4), ('4OOO', 2), ('5OO', 17), ('5oo', 4), ('64K', 5), ('ALTERNATIVE', 5), ('ASSOCIATIVE', 17), ('Abstract', 1447)]...
2021-11-11 09:21:19,624 : INFO : keeping 10250 tokens which were in no less than 20 and no more than 870 (=50.0%) documents

Number of unique tokens: 10250
Number of documents: 1740


In [14]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 8
chunksize = 2000
passes = 30
iterations = 400
eval_every = 1 #None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

2021-11-11 10:02:36,986 : INFO : using autotuned alpha, starting with [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]
2021-11-11 10:02:36,989 : INFO : using serial LDA version on this node
2021-11-11 10:02:36,999 : INFO : running online (multi-pass) LDA training, 8 topics, 30 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 1740 documents, iterating 400x with a convergence threshold of 0.001000
2021-11-11 10:02:48,197 : INFO : -9.781 per-word bound, 879.8 perplexity estimate based on a held-out corpus of 1740 documents with 2135113 words
2021-11-11 10:02:48,198 : INFO : PROGRESS: pass 0, at document #1740/1740
2021-11-11 10:02:57,180 : INFO : optimized alpha [0.0709496, 0.09195595, 0.12683836, 0.14087811, 0.08282176, 0.078641735, 0.12634164, 0.074467935]
2021-11-11 10:02:57,185 : INFO : topic #0 (0.071): 0.004*"image" + 0.003*"class" + 0.002*"noise" + 0.002*"prediction" + 0.002*"classifier" + 0.002*"optima

In [15]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2021-11-11 10:23:22,029 : INFO : CorpusAccumulator accumulated stats from 1000 documents


Average topic coherence: -1.1390.
[([(0.018553942, 'image'),
   (0.008650569, 'object'),
   (0.0078098476, 'representation'),
   (0.0055411872, 'layer'),
   (0.0051581096, 'hidden'),
   (0.0050204527, 'recognition'),
   (0.004121713, 'pixel'),
   (0.004076152, 'distance'),
   (0.0035091918, 'face'),
   (0.0032248527, 'hidden_unit'),
   (0.003061289, 'map'),
   (0.0030149561, 'net'),
   (0.003007786, 'view'),
   (0.0027138127, 'node'),
   (0.002429571, 'position'),
   (0.0023849183, 'human'),
   (0.0022844067, 'part'),
   (0.0022767077, 'trained'),
   (0.0022683963, 'region'),
   (0.0021967283, 'dimensional')],
  -0.9766582047815727),
 ([(0.01575232, 'cell'),
   (0.008879799, 'response'),
   (0.008612073, 'stimulus'),
   (0.008312414, 'visual'),
   (0.006856329, 'neuron'),
   (0.0065025156, 'field'),
   (0.0058385157, 'motion'),
   (0.0056922724, 'activity'),
   (0.005299649, 'direction'),
   (0.0046893796, 'cortex'),
   (0.004329598, 'orientation'),
   (0.0042747026, 'eye'),
   (0.0041

### Paragraph on Next Steps

An interesting next project to use this type of model on could be detecting bias in sports commentary.  Anecdotally, commentators of American football have used different phrases to describe athletes and coaches of different races (e.g. a black coach is more likely to be called a "player's coach" vs an "Xs and Os coach".  Similarly, a white wide receiver is more likely to be associated with reliability and consistent route running while a black wide receiver is more likely to be associated with athleticism. Similar examples are found all over the field and a few are discussed in this article: https://www.nfl.com/news/sidelines/does-race-remain-a-factor-in-the-evaluation-of-nfl-quarterbacks)  I wonder if you would be able to model topics discussed on panel-style analysis shows and compare how the topics change as the athletes race changes.  A quick google search reveals there may be some substantial up front work to do to establish such a corpus. <br>
I'd also like to experiment with different data preparation & NLP techniques, including using a stemmer instead of a lemmatizer and removing fewer or different stopwords to see how those changes alter the output.  Removing words that appear in more than 50% of the documents as they did in the tutorial seems like it may be removing valuable information.
I played around a little with the eval_every property as mentioned in the tutorial and saw how you can see as most of the documents converge, tuning this type of model similarly to how we did the grid search in the classifiers would be interesting too.