# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
from collections import defaultdict
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
import re

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
commonWords = ["student","students","learning","course","courses"]

In [19]:
# get RDD of courses
data = sc.parallelize(courses)

In [None]:
# get a document and return a tuple (id , vector of word occurrencies )
def documentToVector(d):
    
    wordOccurrencies = defaultdict(int) #initialize dict
    for w in d[0]: 
        wordOccurrencies[wordsList[w]] += 1 # add one at word_id position
            
    wordOccurrencies = sorted(wordOccurrencies.items()) # in order to obtain a list for each element
    wordOcc_0 = [x[0] for x in wordOccurrencies]
    wordOcc_1 = [x[1] for x in wordOccurrencies]
    
    return (d[1], Vectors.sparse(distinctNbWords, wordOcc_0, wordOcc_1))

In [55]:
# Usage of pyspark LDA is highly inspired by the following: http://seanlane.net/blog/2016/PySpark_and_LDA

#transform documents into bags of word
    #take only the description into account
    #split depending on whitespace and punctuation
    #only keep letter words
    #remove words whose length is smaller than 3
    #lowercase words
    #remove stopwords and  non-relevant words
wordsInDocument = data.map(lambda x: x["description"])\
.map(lambda x: re.split("[\s\.\?\!\,\;\:\-\(\)\[\]\{\}\"\/]", x))\
.map(lambda w: [x for x in w if x.isalpha()])\
.map(lambda w: [x for x in w if len(x)>3])\
.map(lambda w: [x.lower() for x in w])\
.map(lambda w: [x for x in w if (x not in stopwords) and (x not in commonWords)])


#Group words in order to obtain a word list
    #add a counter
    #reducebykey to group words
    #descending sort on word recurrences
mostReccurentWords = wordsInDocument.flatMap(lambda x: x)\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x,y: x + y)\
.map(lambda x: (x[1], x[0]))\
.sortByKey(False)

    
# get all possible words and map them to an id
wordsList = mostReccurentWords.map(lambda x: x[1]).zipWithIndex().collectAsMap()
distinctNbWords = len(wordsList)


# get correct format to submit to lda
    # zipWithIndex to obtain an id for the document
    # get a vector of used words
documents = wordsInDocument.zipWithIndex().map(documentToVector).map(list)

In [61]:
wordsList_helper = mostReccurentWords.map(lambda x: x[1]).collect()

def displayTopicsWords(lda_topics, alphaValues, betaValues):
    for i in range(len(lda_topics)):
        print("alpha = ",alphaValues[i],"& beta =",betaValues[i])
        for j in range(10):
            print("- Topic ",j+1,": ",end="")
            for w in lda_topics[i][j][0]:
                print(wordsList_helper[w],"-", end="")
            print()
        print()

## Exercise 4.8: Topics extraction

### Using your pre-processed courses dataset, extract topics using LDA. Print k = 10 topics extracted using LDA and give them labels.

In [60]:
# create lda model
lda = LDA.train(documents, k = 10)

# get topics
topic_indices = lda.describeTopics(maxTermsPerTopic = 10) # 10 words to display

In [62]:
displayTopicsWords([topic_indices], ["default"],["default"])

alpha =  default & beta = default
- Topic  1 : methods -models -model -stochastic -theory -time -introduction -exam -financial -risk -
- Topic  2 : methods -case -energy -management -content -work -assessment -outcomes -business -evaluate -
- Topic  3 : methods -energy -equations -content -processes -transfer -applications -lecture -chemical -organic -
- Topic  4 : optical -optics -microscopy -imaging -epfl -electron -engineering -methods -laser -light -
- Topic  5 : methods -materials -content -assessment -paper -teaching -work -activities -theory -outcomes -
- Topic  6 : methods -cell -protein -molecular -flow -content -chemical -research -biology -structure -
- Topic  7 : design -methods -data -analysis -structures -mechanics -materials -content -work -engineering -
- Topic  8 : project -plan -skills -methods -data -programming -scientific -systems -content -software -
- Topic  9 : systems -design -system -modeling -methods -circuits -processing -digital -techniques -concepts -
- To

alpha =  default & beta = default
- **financial risk** : methods -models -model -stochastic -theory -time -introduction -exam -financial -risk -
- **energy management** : methods -case -energy -management -content -work -assessment -outcomes -business -evaluate -
- **energy applications** : methods -energy -equations -content -processes -transfer -applications -lecture -chemical -organic -
- **optical microscopy** : optical -optics -microscopy -imaging -epfl -electron -engineering -methods -laser -light -
- **materials theory** : methods -materials -content -assessment -paper -teaching -work -activities -theory -outcomes -
- **molecular research** : methods -cell -protein -molecular -flow -content -chemical -research -biology -structure -
- **mechanics engineering** : design -methods -data -analysis -structures -mechanics -materials -content -work -engineering -
- **software project** : project -plan -skills -methods -data -programming -scientific -systems -content -software -
- **digital circuits modeling** : systems -design -system -modeling -methods -circuits -processing -digital -techniques -concepts
- **linear algorithms** : models -linear -analysis -methods -control -data -algorithms -basic -statistical -statistics -

### How does it compare with LSI?

In LSI, nearly every topic was related to either finances (containing words such as: finance, market, data, price, ...) or pharmacology (containing words such as: drug, disease, bioprocess, kinetics, ...). The only topic that was very different was based around projects, containing words as report, research and so on.

In LDA the topics seem more varied, while finances, the project and biology in a broad sence are still present, there are other topics about energy, microscopy, software, circuits and similar.

## Exercise 4.9: Dirichlet hyperparameters

### Analyse the effects of α and β.

From : https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda we have the following definitions:

docConcentration ($\alpha$): Dirichlet parameter for prior over documents’ distributions over topics. Larger values encourage smoother inferred distributions.

topicConcentration ($\beta$): Dirichlet parameter for prior over topics’ distributions over terms (words). Larger values encourage smoother inferred distributions.

### Fix k = 10 and β = 1.01, and vary α. How does it impact the topics?

When the $\alpha$ parameter is really high, the documents tend to be consisted of more topics (the probability the document is part of a topic tends to be equiprobable for all topics). We can see that when $\alpha$ = 20, the topics retrieved are pretty similar. These topics are retrieved instead of others because "methods" is a top word. On the other hand, when $\alpha$ = 1.01 is used, the top retrieved topics are the really specific.

In [40]:
def createLdaTopics(alphaValues, betaValues):
    topics_indices = []
    for i in range(len(alphaValues)):
        lda = LDA.train(documents, k = 10,  docConcentration = alphaValues[i], topicConcentration = betaValues[i])
        topics_indices.append(lda.describeTopics(maxTermsPerTopic = 10))
    return topics_indices

In [41]:
# create lda model
alphaValues1 = [20., 10., 6., 3., 1.01]
betaValues1 = [1.01, 1.01, 1.01, 1.01, 1.01]
lda_topics1 = createLdaTopics(alphaValues1, betaValues1)

In [42]:
displayTopicsWords(lda_topics1, alphaValues1, betaValues1)

alpha =  20.0 & beta = 1.01
Topic  1 : methods -content -management -project -work -solid -energy -analysis -assessment -teaching -
Topic  2 : methods -design -content -data -project -analysis -systems -assessment -basic -outcomes -
Topic  3 : methods -data -content -analysis -report -project -skills -outcomes -assessment -prerequisites -
Topic  4 : methods -content -analysis -data -models -outcomes -control -assessment -basic -prerequisites -
Topic  5 : design -methods -systems -analysis -content -concepts -week -system -assessment -teaching -
Topic  6 : methods -risk -content -theory -basic -model -probability -introduction -assessment -concepts -
Topic  7 : methods -content -outcomes -concepts -energy -basic -design -prerequisites -assessment -analysis -
Topic  8 : methods -architecture -content -project -research -work -design -assessment -semester -teaching -
Topic  9 : methods -materials -systems -content -organic -devices -exercises -basic -analysis -assessment -
Topic  10 : met

### Fix k = 10 and α = 6, and vary β. How does it impact the topics?

Larger is the parameter $\beta$, more homogeneous become the topics. It is clearly the case when $\beta$ is equal to 20, almost all topics contain the same words. It does not happen when $\beta$ is small, like 1.01, the topics are more specific.

In [83]:
# create lda model
alphaValues2 = [6.,6.,6.,6.,6.]
betaValues2 = [20., 10., 6., 3., 1.01]
lda_topics2 = createLdaTopics(alphaValues2, betaValues2)

In [84]:
displayTopicsWords(lda_topics2, alphaValues2, betaValues2)

alpha =  6.0  & beta =  20.0 : 
Topic  1 : methods -content -design -systems -analysis -end -assessment -outcomes -project -concepts -
Topic  2 : methods -content -design -systems -analysis -end -assessment -data -outcomes -teaching -
Topic  3 : methods -content -design -systems -analysis -end -assessment -outcomes -basic -teaching -
Topic  4 : methods -content -design -analysis -end -systems -assessment -outcomes -basic -prerequisites -
Topic  5 : methods -content -design -analysis -end -systems -assessment -outcomes -prerequisites -keywords -
Topic  6 : methods -content -design -analysis -systems -end -assessment -outcomes -basic -data -
Topic  7 : methods -content -systems -design -analysis -end -assessment -outcomes -basic -teaching -
Topic  8 : methods -content -design -analysis -systems -end -assessment -outcomes -keywords -prerequisites -
Topic  9 : methods -content -design -systems -analysis -end -assessment -outcomes -keywords -prerequisites -
Topic  10 : methods -content -des

## Exercise 4.10: EPFL's taught subjects

### List the subjects of EPFL’s classes.

### Find the combination of k, α and β that gives most interpretable topics. Explain why you chose these values.

Large values of $\alpha, \beta$ (> 1) tend towards a uniform posterior, which means the topics will be very similar.

Small values of $\alpha, \beta$ (< 1) tend towards a small set of dominant topics, which means the topics will lose their relation and meaning.

Therefore, we chose $\alpha = \beta = 1.01$

Concerning the k parameter, EPFL has documents about lots of topics, so k needs to be high enough to create sufficiently enough topics

In [46]:
# create lda model
lda3 = LDA.train(documents, k = 15, docConcentration = 1.01, topicConcentration = 1.01)

# get topics
lda_topics3 = lda3.describeTopics(maxTermsPerTopic = 10) # 10 words to display
displayTopicsWords([lda_topics3], [1.01],[1.01])

alpha =  1.01 & beta = 1.01
Topic  1 : design -data -systems -system -methods -programming -modeling -digital -tools -teaching -
Topic  2 : magnetic -methods -cell -materials -content -drug -cells -note -molecular -introduction -
Topic  3 : materials -chemical -chemistry -methods -properties -protein -structure -molecular -reaction -content -
Topic  4 : methods -microscopy -electron -design -content -business -analysis -keywords -assessment -class -
Topic  5 : methods -skills -transversal -work -content -concepts -outcomes -assessment -presentation -physical -
Topic  6 : models -methods -theory -model -time -stochastic -financial -risk -finance -heat -
Topic  7 : quantum -methods -content -theory -properties -basic -outcomes -systems -prerequisites -snow -
Topic  8 : energy -project -methods -plan -skills -process -design -systems -conversion -outcomes -
Topic  9 : methods -circuits -content -design -systems -noise -basic -devices -exercises -organic -
Topic  10 : data -project -method

### Report the values of these hyperparameters that you used and your labels for the topics.

k = 15 & $\alpha$ =  1.01  & $\beta$ =  1.01 : 
- **digital tools** : 
design -data -systems -system -methods -programming -modeling -digital -tools -teaching -
- **magnetic cells** : 
magnetic -methods -cell -materials -content -drug -cells -note -molecular -introduction -
- **chemical properties of protein** : 
materials -chemical -chemistry -methods -properties -protein -structure -molecular -reaction -content -
- **electron analysis** : 
 methods -microscopy -electron -design -content -business -analysis -keywords -assessment -class -
- **transversal skills** : 
 methods -skills -transversal -work -content -concepts -outcomes -assessment -presentation -physical -
- **stochastic models in finance** : 
models -methods -theory -model -time -stochastic -financial -risk -finance -heat -
- **quantum methods** : 
quantum -methods -content -theory -properties -basic -outcomes -systems -prerequisites -snow -
- **energy systems** : 
 energy -project -methods -plan -skills -process -design -systems -conversion -outcomes -
- **circuits and systems noise** : 
methods -circuits -content -design -systems -noise -basic -devices -exercises -organic -
- **project group** : 
data -project -methods -paper -assessment -work -group -content -research -activities -

## Exercise 4.11: Wikipedia structure

### Extract the structure in terms of topics from the wikipedia-for-school dataset. Use your intuition about how many topics might be covered by the articles and how they are distributed. Report the values for k, α and β that you chose a priori and why you picked them.

By following the same reasoning as for the EPFL dataset, we fixed the $\alpha$ and $\beta$ to their minimum 1.01.

The variable k is also increased since wikipedia can contain a very large set of topics.

In [2]:
import json
wikipedia = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

In [3]:
#Group words in order to obtain a word list
#even if the wikipedia set is preprocessed, we need to clean it a bit more
wikiProcessedWords = wikipedia.map(lambda x: x["tokens"])\
.map(lambda w : [x for x in w if x.isalpha()])\
.map(lambda w : [x for x in w if len(x)>3])

mostReccurentWords1 = processedWords.flatMap(lambda x: x)\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x,y: x + y)\
.map(lambda x: (x[1], x[0]))\
.sortByKey(False)

    
# get all possible words and map them to an id
wordsList1 = mostReccurentWords1.map(lambda x: x[1]).zipWithIndex().collectAsMap()
distinctNbWords1 = len(wordsList1)


# transform indices into corresponding words
wordsList_helper1 = mostReccurentWords1.map(lambda x: x[1]).collect()


# get correct format to submit to lda
    # zipWithIndex to obtain an id for the document
    # get a vector of used words
documents1 = wikiProcessedWords.zipWithIndex().map(documentToVector1).map(list)

    
# get a document and return a tuple (id , vector of word occurrencies )
def documentToVector1(d):
    
    wordOccurrencies = defaultdict(int) #initialize dict
    for w in d[0]: 
        wordOccurrencies[wordsList1[w]] += 1 # add one at word_id position
            
    wordOccurrencies = sorted(wordOccurrencies.items()) # in order to obtain a list for each element
    wordOcc_0 = [x[0] for x in wordOccurrencies]
    wordOcc_1 = [x[1] for x in wordOccurrencies]
    
    return (d[1], Vectors.sparse(distinctNbWords1, wordOcc_0, wordOcc_1))

In [7]:
def displayTopicsWords(lda_topics, alphaValues, betaValues):
    for i in range(len(lda_topics)):
        print("alpha = ",alphaValues[i],"& beta =",betaValues[i])
        for j in range(10):
            print("- Topic ",j+1,": ",end="")
            for w in lda_topics[i][j][0]:
                print(wordsList_helper1[w],"-", end="")
            print()
        print()

In [17]:
# create lda model
alphaValue_wiki = 1.01
betaValue_wiki = 1.01
k_wiki = 100

lda_wiki = LDA.train(documents1, k = k_wiki, maxIterations=10, docConcentration = alphaValue_wiki, topicConcentration = betaValue_wiki)

# get topics
topic_indices_wiki = lda_wiki.describeTopics(maxTermsPerTopic = 10) # 10 words to display
displayTopicsWords([topic_indices_wiki], [alphaValue_wiki], [betaValue_wiki])

alpha =  1.01 & beta = 1.01
- Topic  1 : time -world -united -years -american -states -city -british -people -century -
- Topic  2 : city -time -government -years -world -states -united -century -state -american -
- Topic  3 : time -years -american -world -united -city -states -number -century -made -
- Topic  4 : time -world -years -states -united -number -century -city -called -government -
- Topic  5 : american -time -years -world -united -city -states -century -government -british -
- Topic  6 : american -time -years -city -world -united -states -century -french -made -
- Topic  7 : time -years -world -century -united -number -states -city -called -american -
- Topic  8 : american -time -united -years -world -british -states -government -city -century -
- Topic  9 : error -time -world -years -american -city -states -century -united -unexpected -
- Topic  10 : time -years -world -rajah -century -united -states -american -city -called -



### Are you convinced by the results? Give labels to the topics if possible.

Even if we used smallest possible values for $\alpha$ and $\beta$, the results are quite bad, since the topics are very alike (they contain very similar words at their top) and giving them distinct labels would thus be difficult.