# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [111]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
from collections import defaultdict
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
import re

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
commonWords = ["student","students","learning","course","courses"]

In [112]:
# get RDD of courses
data = sc.parallelize(courses)

In [113]:
#transform documents into bags of word
    #take only the description into account
    #split depending on whitespace and punctuation
    #only keep letter words
    #lowercase words
    #remove stopwords
    #remove non-relevant words
wordsInDocument = data\
.map(lambda x: x["description"])\
.map(lambda x: re.split("[\s\.\?\!\,\;\:\-\(\)\[\]\{\}\"\/]", x))\
.map(lambda w: [x for x in w if x.isalpha()])\
.map(lambda w: [x.lower() for x in w])\
.map(lambda w: [x for x in w if (x not in stopwords) and (x not in commonWords)])


#Group words in order to obtain a word list
    #add a counter
    #reducebykey to group words
    #descending sort on number of recurrence
mostReccurentWords = wordsInDocument\
.flatMap(lambda x: x)\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x,y: x + y)\
.map(lambda x: (x[1], x[0]))\
.sortByKey(False)

    
# get all possible words and map them to an id
wordsList = mostReccurentWords\
.map(lambda x: x[1])\
.zipWithIndex()\
.collectAsMap()

distinctNbWords = len(wordsList)

    
# get a document and return a tuple (id , vector of word occurrencies )
def documentToVector(d):
    
    wordOccurrencies = defaultdict(int) #initialize dict
    for w in d[0]: 
        wordOccurrencies[wordsList[w]] += 1 # add one at word_id position
            
    wordOccurrencies = sorted(wordOccurrencies.items()) # in order to obtain a list for each element
    wordOcc_0 = [x[0] for x in wordOccurrencies]
    wordOcc_1 = [x[1] for x in wordOccurrencies]
    
    return (d[1], Vectors.sparse(distinctNbWords, wordOcc_0, wordOcc_1))


## Exercise 4.8: Topics extraction

### Using your pre-processed courses dataset, extract topics using LDA. Print k = 10 topics extracted using LDA and give them labels.

In [119]:
def displayTopicsWords(topic_indices):
    for topic in topic_indices:
        for w in topic[0]:
            print(wordsList_helper[w])
        print("\n")

In [114]:
# transform indices into corresponding words
wordsList_helper = mostReccurentWords.map(lambda x: x[1]).collect()

# get right format to submit to lda
    # zipWithIndex to obtain an id for the document
    # get a vector of used words
documents = wordsInDocument\
.zipWithIndex()\
.map(documentToVector)\
.map(list)

In [120]:
# create lda model
lda = LDA.train(documents, k = 10)

# get topics
topic_indices = lda.describeTopics(maxTermsPerTopic = 10) # 10 words to display
displayTopicsWords(topic_indices)

methods
chemical
chemistry
processes
properties
equations
concepts
transfer
heat
basic


project
methods
end
evaluate
content
work
outcomes
report
skills
systems


energy
cell
biology
methods
development
teaching
cells
information
content
presentation


project
report
scientific
research
based
skills
data
plan
methods
analysis


methods
linear
analysis
data
control
theory
algorithms
models
problems
optimization


models
methods
design
data
assessment
work
basic
time
tools
lectures


optical
optics
content
methods
applications
electron
microscopy
imaging
note
introduction


materials
design
methods
quantum
content
mechanical
assessment
class
keywords
material


systems
design
circuits
modeling
methods
exercises
signal
power
system
content


methods
energy
risk
content
space
applications
outcomes
flow
introduction
end




### How does it compare with LSI?

## Exercise 4.9: Dirichlet hyperparameters

### Analyse the effects of α and β.

From : https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda we have the following definitions:

docConcentration (alpha): Dirichlet parameter for prior over documents’ distributions over topics. Larger values encourage smoother inferred distributions.

topicConcentration (beta): Dirichlet parameter for prior over topics’ distributions over terms (words). Larger values encourage smoother inferred distributions.

### Fix k = 10 and β = 1.01, and vary α. How does it impact the topics?

In [None]:
## FAIS UN GRAPH ROBIN

### Fix k = 10 and α = 6, and vary β. How does it impact the topics?

## Exercise 4.10: EPFL's taught subjects

### List the subjects of EPFL’s classes.

### Find the combination of k, α and β that gives most interpretable topics.

### Explain why you chose these values.

### Report the values of these hyperparameters that you used and your labels for the topics.

## Exercise 4.11: Wikipedia structure

### Extract the structure in terms of topics from the wikipedia-for-school dataset. Use your intuition about how many topics might be covered by the articles and how they are distributed.

### Report the values for k, α and β that you chose a priori and why you picked them.

### Are you convinced by the results? Give labels to the topics if possible.