### 1)  Data Pre-Processing

This section contains code for pre-processing the data used in section 6 of [Hierarchical Dirichlet Processes](https://people.eecs.berkeley.edu/~jordan/papers/hdp.pdf).

The data used comes from [this website](https://web.archive.org/web/20040328153507/http://elegans.swmed.edu/wli/cgcbib). 

In [1]:
import urllib
import string

url = 'https://raw.githubusercontent.com/tdhopper/topic-modeling-datasets/master/data/raw/Nematode%20biology%20abstracts/cgcbib.txt'
file = urllib.request.urlopen(url)
data = file.read().decode("ISO-8859-1")

In [2]:
# Remove '\n' and '\r'
data = data.lower().translate(str.maketrans('\n', ' '))
data = data.translate(str.maketrans('\r', ' '))
# Remove punctuation except for '-' so we can split after each abstract
data = data.translate(str.maketrans('','', '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~'))
# Remove numbers
data = data.translate(str.maketrans('','', string.digits))

In [3]:
data.count('abstract')

6224

In [4]:
len(data.split('-------------------'))

6216

The number of abstracts is 6216. However, in the paper it is stated that there were only 5838 abstracts. We need to remove the ones that only say "in French". 

In [5]:
tmp = data.split('-------------------')
# Remove the '-' now
tmp = [abstract.translate(str.maketrans('-', ' ')) for abstract in tmp]

In [6]:
# Remove entries without the word "abstract" in it
tmp = tmp[:-1]

# Only keep the words after 'abstract'
tmp = [abstract.split('abstract')[1] for abstract in tmp]

In [7]:
from itertools import compress

# Remove French Abstracts
not_french = ['in french' not in i for i in tmp]
tmp = list(compress(tmp, not_french))
len(tmp)

6189

There are still 6189 abstracts in our data. This means that we are not working with the same dataset as in the paper, but it is a close representation and hopefully will deliver similar results.

Now, we need to remove stop words and words appearing fewer than 10 times from our abstracts.

In [8]:
from nltk.corpus import stopwords 
import pandas as pd

stop_words = set(stopwords.words('english')) 

# Remove stop words
words = ''.join([i for ab in tmp  for i in ab]).split()
words = [i for i in words if not i in stop_words]

# Remove words that appear less than 10 times
s_words = pd.Series(words)
ten_ = list(compress(s_words.value_counts().index, s_words.value_counts() > 9))

In [9]:
# Convert abstracts to have only these words
lists_of_words = [i.split() for i in tmp]
final_ = []
for j in range(len(lists_of_words)):
    final_.append([i for i in lists_of_words[j] if i in ten_])

In [10]:
# Find number of words and number of distinct words
print('Number of words:', len([i for sub in final_ for i in sub]))
print('Number of distinct words:', len(set([i for sub in final_ for i in sub])))

Number of words: 550318
Number of distinct words: 5911


The data used in **Hierarchical Dirichlet Processes** has 476,441 words and 5,699 distinct words. 

The last step is transforming the data into a matrix with the count of each word per document. Our final dataset can be accessed through calling `final_`. 

In [11]:
unique_words = ten_

In [13]:
import numpy as np

word_counts = np.zeros((len(unique_words), len(final_)))

for i, word in enumerate(unique_words):
    for j, doc in enumerate(final_):
        word_counts[i,j] = doc.count(word)
        
df = pd.DataFrame(word_counts, index = unique_words)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6179,6180,6181,6182,6183,6184,6185,6186,6187,6188
elegans,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,2.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0
c,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
cell,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
gene,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caenorhabditis,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,3.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
operating,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
feminize,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
index,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
reproducibly,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
df.to_csv('final_project_data.csv')