<a href="https://colab.research.google.com/github/todnewman/coe_training/blob/master/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling Example - NeurIPS Paper Topic Modeling

**Author**: W. Tod Newman

**Updates**: Updated data input from Brown Corpus to NIPS Papers

## Learning Objectives

*   Learn how to import data and conduct formatting using NLTK
*   Use SKlearn CountVectorization and LDA algorithms to do topic modeling
*   Demonstrate LDA visualization using PyLDAVis
In this exercise we're going to bring in data from a corpus of Neural Information Processing Society (NeurIPS) papers and uncover a specified number of topics embedded in this large set of documents.

NOTE:  The corpus contains 1740 documents, and not particularly long ones. So keep in mind that this tutorial is not geared towards efficiency, and be careful before applying the code to a large dataset.

In [None]:
!pip install pyLDAvis
!pip install --upgrade pandas==1.3.1
#!pip install --upgrade pandas==1.2
import pandas as pd
import io
import os.path
import re
import tarfile

import smart_open
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

## First download the NeurIPS papers

RIght now these are available from Dr. Sam Roweis' website at NYU.  It is a collection of papers from NeurIPS conferences 1-12 that were OCR'ed by Yann Lecun and organized by Sam Roweis.

We use smart_open (https://pypi.org/project/smart-open/) to stream the file from the NYU server more efficiently than the standard python 'open()' command.

Then we use the NLTK RegExpTokenizer to tokenize the documents based on whitespace.

Finally we do cleanup and lemmatize the words to improve the effectiveness of the LDA topic modeling algorithm

TIME:  This normally takes about 25 seconds.

In [None]:
%%time

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    fname = url.split('/')[-1]

    # Download the file to local storage first.
    # We can't read it on the fly because of
    # https://github.com/RaRe-Technologies/smart_open/issues/331
    if not os.path.isfile(fname):
        with smart_open.open(url, "rb") as fin:
            with smart_open.open(fname, 'wb') as fout:
                while True:
                    buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                    if not buf:
                        break
                    fout.write(buf)

    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

CPU times: user 27.6 s, sys: 358 ms, total: 27.9 s
Wall time: 28 s


In [None]:
%%time
data = []
 
for fileid in docs:
    document = ' '.join(fileid)
    data.append(document)

CPU times: user 193 ms, sys: 0 ns, total: 193 ms
Wall time: 192 ms


## Count Vectorization and Latent Dirichlet Allocation

First we vectorize the data using SKlearn's CountVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).  

CountVectorizer implements both tokenization and occurrence counting in a single class.  

Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

Here's a great description of LDA from the SKlearn site:  https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

TIME:  This block runs in about 30 seconds.


In [None]:
%%time
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=20, max_df=0.5, 
                             stop_words='english', lowercase=False, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')

data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=20, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)

CPU times: user 55 s, sys: 2.36 s, total: 57.4 s
Wall time: 54.9 s


In [None]:
print(pd.__version__)

1.3.1


## Visualization using pyLDAvis

**pyLDAvis** is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

To read about the methodology behind pyLDAvis, see the original paper (https://colab.research.google.com/drive/1m1ElwaKvR_0czJQH9rg4hkMdVUkADvPx#scrollTo=W5fevzV1t8_D&line=5&uniqifier=1), which was presented at the 2014 ACL Workshop on Interactive Language Learning, Visualization, and Interfaces in Baltimore on June 27, 2014.

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

  from collections import Iterable
  from collections import Mapping
  by='saliency', ascending=False).head(R).drop('saliency', 1)
