# Latent Dirichlet Allocation (LDA)

This Colab is taken and modified from [here](https://colab.research.google.com/github/dudaspm/LDA_Bias_Data/blob/main/Latent%20Dirichlet%20Allocation%20(LDA).ipynb)

<center>
<img src="https://www.gutenberg.org/files/55/55-h/images/cover.jpg"  width="300"></img>
</center>

The Wonderful Wizard of Oz via https://www.gutenberg.org/ebooks/55

* What are topics? 
    * The topics will X number of sets of terms (we define this X) which will (could) have a common theme. 
* How are they defined? 
    * This is what we will get to in this notebook.     
* Do we define or does the computer? 
    * LDA is unsupervised, so we define the number of topics. The computer provides the topics themselves. 
* What is a large corpus? and How many documents do we need? 
    * A bit subjective here. There is a *great* discussion about this by Tang et al.  {cite:p}`tang2014understanding` regarding this. If you have a chance, read all the points, but to sum it up
        * The number of documents does matter, but at some point, increasing the number does not warrant better results. Even sampling 1000 papers from 1000000 papers could result in the same, if not better, results than 1000000 documents. 
        * The size of the documents also plays a role, so documents should not be short. Larger documents can be sampled and again receive the same desired output. 



For more information about LDA, please review [this high level article](https://www.cs.columbia.edu/~blei/papers/Blei2012.pdf). If interested, review at my office hours, and consider taking CS6120: Natural Language Processing. In this course, we will be applying LDA with some base understandings of the approach.

<center>
<img src="http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg"  width="600"></img>
</center>
Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of "topics," which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. The topics and topic assignments in this figure are illustrative—they are not fit from real data. (Page 3)


<center>
<img src="https://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f2.jpg"  width="600"></img>
</center>
Figure 2. Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found in this article.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# %%capture
!pip install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=f1e490515c159f406ea3295117a62d93e95e3800d1d8aec6466f6e764114da94
  Stored in directory: /root/.cache/pip/wheels/14/25/f7/1cc0956978ae479e75140219088deb7a36f60459df242b1a72
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1


## Let's Try an Example

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# The Wonderful Wizard of Oz
url = "https://www.gutenberg.org/files/55/55-h/55-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# Remove CSS (style) or Javascript (script)
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents = []
documents.append(text)

# The Marvellous Land of Oz
url = "https://www.gutenberg.org/files/54/54-h/54-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

# Ozma of Oz
url = "https://www.gutenberg.org/files/33361/33361-h/33361-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

# Dorothy and the Wizard of Oz
url = "https://www.gutenberg.org/files/22566/22566-h/22566-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

# The Road to Oz
url = "https://www.gutenberg.org/files/26624/26624-h/26624-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

# Earliest Years at Vassar
url = "https://www.gutenberg.org/cache/epub/46080/pg46080-images.html" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

# Days in Queensland
url = "https://www.gutenberg.org/cache/epub/38649/pg38649-images.html" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

### Create Tokens and Vocabulary

Now that we have our books, we need to tokenize the stories by word and then create a vocabulary out of these tokens. Note that we eliminate extremely common words that do not contribute much to the meaning of a document and topic (like `the`, `and`, `or`, etc.). These are called *stop words*.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
df = cv.fit_transform(documents)
vocab = cv.get_feature_names()



Let's take a look at the tokens and the number of occurrence for the tokens. 

Question: What do the dimensions in df mean?

In [None]:
# PLAY AROUND WITH `df` HERE
print(df[0])

  (0, 9720)	89
  (0, 5998)	99
  (0, 4342)	14
  (0, 13956)	28
  (0, 13933)	44
  (0, 8868)	169
  (0, 5434)	5
  (0, 1431)	5
  (0, 13349)	28
  (0, 13234)	15
  (0, 11900)	19
  (0, 8980)	2
  (0, 13999)	19
  (0, 3170)	4
  (0, 10458)	2
  (0, 13772)	2
  (0, 3122)	12
  (0, 1278)	71
  (0, 12552)	22
  (0, 7428)	18
  (0, 6639)	3
  (0, 8693)	4
  (0, 14057)	10
  (0, 8741)	10
  (0, 7540)	7
  :	:
  (0, 8888)	1
  (0, 4095)	1
  (0, 8013)	1
  (0, 639)	1
  (0, 13690)	1
  (0, 3279)	1
  (0, 2231)	1
  (0, 9701)	1
  (0, 8018)	1
  (0, 6119)	1
  (0, 8758)	1
  (0, 11172)	1
  (0, 9691)	1
  (0, 7584)	1
  (0, 8443)	1
  (0, 13543)	1
  (0, 2938)	1
  (0, 8397)	1
  (0, 4360)	1
  (0, 7711)	1
  (0, 9201)	1
  (0, 4918)	1
  (0, 9690)	1
  (0, 12159)	1
  (0, 8454)	1


In the variable `df`, the second number listed is the token number, and we use the vocab list to see what the actual word. An example would be to look at the first line. 

```python
(0, 8074) 3198
```
The 8074 token was used 3198 times. The 8074 token is:

Question: What word/vocab does token 8074 correspond to? How many times is it used? Is this surprising?

In [None]:
# YOUR CODE HERE
vocab[8074]

'mins'

From here, we are actually at the point where we can run LDA.

There are much more than two inputs available for LDA, but we will focus on two. 
> Here are the other inputs: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

The two we will focus on are:

* n_components - the number of topics, again, we need to specify this
* doc_topic_prior - this relates the Dirichlet distribution (the next notebook goes into full detail about Dirichlet and how it relates to LDA. 


In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 5, doc_topic_prior=1)

# YOUR CODE HERE
lda.fit(df)

LatentDirichletAllocation(doc_topic_prior=1, n_components=5)

To print out the top-5 words per topic, we used a solution from StackOverflow.

In [None]:
import numpy as np 
topic_words = {}
n_top_words = 10

try:
  for topic, comp in enumerate(lda.components_):
      # for the n-dimensional array "arr":
      # argsort() returns a ranked n-dimensional array of arr, call it "ranked_array"
      # which contains the indices that would sort arr in a descending fashion
      # for the ith element in ranked_array, ranked_array[i] represents the index of the
      # element in arr that should be at the ith index in ranked_array
      # ex. arr = [3,7,1,0,3,6]
      # np.argsort(arr) -> [3, 2, 0, 4, 5, 1]
      # word_idx contains the indices in "topic" of the top num_top_words most relevant
      # to a given topic ... it is sorted ascending to begin with and then reversed (desc. now)    
      word_idx = np.argsort(comp)[::-1][:n_top_words]

      # store the words most relevant to the topic
      topic_words[topic] = [vocab[i] for i in word_idx]
      
  for topic, words in topic_words.items():
      print('Topic: %d' % topic)
      print('  %s' % ', '.join(words))
except:
  print("Did you fit the data?")

Topic: 0
  gutenberg, college, project, miss, work, day, time, mitchell, students, vassar
Topic: 1
  said, dorothy, scarecrow, man, woodman, tin, little, asked, oz, tip
Topic: 2
  river, country, pg, mr, new, queensland, great, north, stock, water
Topic: 3
  pumpkinhead, husband, lesson, impression, tastes, depths, series, holiday, suited, lamb
Topic: 4
  dorothy, said, pg, little, wizard, king, ozma, girl, asked, gutenberg
