# Topic Modelling on NeurIPS Papers

In this notebook, we will explore a dataset containing more than 9,000 documents, which are papers from <br>
The Conference and Workshop on Neural Information Processing Systems (abbreviated as NeurIPS and formerly NIPS). <br>
It is a machine learning and computational neuroscience conference. <br>
<br>
We will conduct LDA topic modelling on these papers, and explore the groups in an interative manner.

Table of Content
* Environment Setup
* Load and Preprocess Data
* Word Cloud
* LDA Topic Modelling
* Result Visualisation
* Further Study

<a id="#section-1"></a>
# Environment Setup

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="#section-2"></a>
# Load and Preprocess Data

In [None]:
df = pd.read_csv('../input/nips-papers-1987-2019-updated/papers.csv')
df

A quick look on missing data. <br>
In this notebook we will focus on topic-modelling on `full_text`[](http://) column only, which has no missing data. <br>
Good to go.

In [None]:
import seaborn as sns

sns.heatmap(df.isna())

Preprocess the `full_text` column with a series of functions. <br>
The operations are quite obvious from their function names so not to be repeated here. <br>
Lemmatization instead of Port Stemmer is used to preserve more meaningful full words from the documents. <br>
Noun is used as the part of speech in lemmatization.

In [None]:
%%time
# 2min 30s
import nltk
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, preprocess_string

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

df['full_text'] = df['full_text'].astype('str')
df['full_text_tokenized'] = df['full_text'].apply(lambda text: preprocess_string(text, [
    strip_tags, 
    strip_punctuation, 
    strip_multiple_whitespaces, 
    strip_numeric, 
    remove_stopwords, 
    strip_short, 
    lemmatizer.lemmatize, 
    lambda x: x.lower()
]))

In [None]:
df['full_text_tokenized'].sample(n=20)

A quick look on tokenized documents' lengths

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
df['full_text_tokenized_len'] = df['full_text_tokenized'].apply(lambda text: len(text))
sns.histplot(df[df['full_text_tokenized_len'] > 0]['full_text_tokenized_len'], log_scale=True)
plt.show()

<a id="#section-3"></a>
# Word Cloud
Next, the typical word cloud. <br>
Not surpisingly, machine learning, processing system, neural network, information processing, reinforcement learning, loss function are some common terms.

In [None]:
%%time
# 2min 50s
from wordcloud import WordCloud

long_string = ' '.join([' '.join(words) for words in df['full_text_tokenized'].values])
wordcloud = WordCloud(width=800, height=400)
wordcloud.generate(long_string)
wordcloud.to_image()

<a id="#section-4"></a>
# LDA Topic Modelling

We will go through the following steps:
* Create a "dictionary" containing all unique words in all documents
* Create a "corpus". Each document will be converted into a bag of words, e.g. [(0, 1), (1, 1), (4, 2), ...]. <br> 
Each tuple means (word index, word occurrence in the document)
* Train the LDA model. Tune hyper-parameter `passes` and `iterations` until most documents are "converged"
* Visualise the result, exploring different groups of documents

In [None]:
%%time
# 30s
import gensim

dictionary = gensim.corpora.Dictionary(df['full_text_tokenized'].values)
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [None]:
%%time
# 13.2s
corpus = [dictionary.doc2bow(doc) for doc in df['full_text_tokenized'].values]

In [None]:
print(f'Number of unique tokens: {len(dictionary):,}')
print(f'Number of documents: {len(corpus):,}')

In [None]:
%%time 
# 13mins
import logging
from gensim.models.ldamulticore import LdaMulticore

# Take too much time for kaggle save version. Set it to True during development.
enable_debug = False

if enable_debug:
    for handler in logging.root.handlers[:]:
        logging.root.removeHandler(handler)
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

model = LdaMulticore(corpus, num_topics=10, id2word=dictionary, passes=40, iterations=100)

if enable_debug:
    logger.setLevel(logging.WARNING)    

`passess` and `iterations` are tuned to be high enough so that most of the documents are converged at the last pass.
Below are some logs when running the cell in DEBUG logging mode:

> 2021-03-15 15:28:37,074 : DEBUG : 1666/2000 documents converged within 100 iterations <br>
2021-03-15 15:28:40,471 : DEBUG : 1509/2000 documents converged within 100 iterations <br>
2021-03-15 15:28:42,532 : DEBUG : 1454/2000 documents converged within 100 iterations <br>
2021-03-15 15:28:46,568 : DEBUG : 1489/2000 documents converged within 100 iterations <br>
2021-03-15 15:28:49,296 : DEBUG : 1265/1680 documents converged within 100 iterations <br>

<a id="#section-5"></a>
# Result Visualisation

Here comes the fruit. <br>
We will use gensim model's `print_topics` function to see popular terms in each group. <br>
Also, pyLDAvis will be used to see the groups in a graph. <br>

In [None]:
model.print_topics(num_topics=10)

In [None]:
%%time
import pyLDAvis
import pyLDAvis.gensim

prep_display = pyLDAvis.gensim.prepare(model, corpus, dictionary)
pyLDAvis.display(prep_display)

<a id="#section-6"></a>
# Further Study

Here are some possible directions to study the dataset further:
* Change `num_topics` from 10 to 20, 50, etc to explore more detailed papers groupings
* Compare and contrast the resulted groups using other simlarity algorithms such as TF-IDF, LSA

# Thank you for reading

Let me know your thoughts in the comments below :D 