# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import numpy as np
from collections import defaultdict
import json
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()

numTerms = len(term2Idx.keys())
numCourses = len(course2Idx.keys())

bagOfWords = np.load('bagOfWords.npy').item()

In [2]:
tf = np.zeros((numTerms,numCourses))
for courseIdx, course in idx2Course.items():
    if(len(bagOfWords[course]) == 0):
        continue
    for termIdx, term in idx2Term.items():
        if(term not in bagOfWords[course]):
            continue
        tf[termIdx][courseIdx] = bagOfWords[course][term]

In [3]:
courses = sc.parallelize(course2Idx.keys())
def course_vector(course):
    id = course2Idx[course]
    counts = {}
    for term in bagOfWords[course]:
        counts[term2Idx[term]] = bagOfWords[course][term]
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(numTerms, keys, values))
courses = courses.map(course_vector).map(list)

## Exercise 4.8: Topics extraction

In [25]:
model = LDA.train(courses,k=10,seed=1)

In [31]:
for idx, topics in enumerate(model.describeTopics(5)):
    print('Topic #%d'%(idx+1))
    for termIdx, term in enumerate(topics[0]):
        print('   - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

Topic #1
   - theory	0.015
   - processing	0.014
   - image	0.014
   - technique	0.013
   - basic	0.012
Topic #2
   - exercise	0.009
   - hour	0.008
   - concept	0.008
   - risk	0.007
   - control	0.007
Topic #3
   - project	0.029
   - report	0.026
   - data	0.019
   - research	0.016
   - scientific	0.016
Topic #4
   - flow	0.013
   - microscopy	0.010
   - cell	0.009
   - process	0.009
   - electron	0.009
Topic #5
   - problem	0.015
   - linear	0.015
   - basic	0.011
   - algorithm	0.011
   - process	0.011
Topic #6
   - material	0.035
   - presentation	0.011
   - technology	0.011
   - property	0.011
   - structure	0.010
Topic #7
   - chemical	0.014
   - chemistry	0.014
   - molecular	0.013
   - cell	0.013
   - reaction	0.012
Topic #8
   - energy	0.024
   - project	0.014
   - work	0.009
   - architecture	0.009
   - concept	0.008
Topic #9
   - circuit	0.012
   - optical	0.012
   - basic	0.011
   - device	0.011
   - application	0.010
Topic #10
   - lecture	0.012
   - class	0.011
   - data

## Exercise 4.9: Dirichlet hyperparameters

In [47]:
for alpha in [1.01,1.1,2.0,5.0,10.0,100.0]:
    print('For alpha = %.2f'%alpha)
    model2 = LDA.train(courses,k=10,docConcentration=alpha,topicConcentration=1.01,seed=1)
    for idx, topics in enumerate(model2.describeTopics(5)):
        print('   Topic #%d'%(idx+1))
        for termIdx, term in enumerate(topics[0]):
            print('      - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

For alpha = 1.01
   Topic #1
      - image	0.017
      - technique	0.010
      - processing	0.010
      - basic	0.010
      - theory	0.009
   Topic #2
      - risk	0.014
      - market	0.009
      - concept	0.008
      - business	0.008
      - evaluate	0.007
   Topic #3
      - project	0.025
      - report	0.020
      - data	0.017
      - scientific	0.014
      - skill	0.014
   Topic #4
      - cell	0.014
      - process	0.014
      - flow	0.013
      - equation	0.011
      - energy	0.010
   Topic #5
      - algorithm	0.013
      - linear	0.010
      - basic	0.010
      - control	0.010
      - signal	0.010
   Topic #6
      - material	0.023
      - presentation	0.011
      - technology	0.009
      - group	0.009
      - policy	0.007
   Topic #7
      - chemical	0.014
      - molecular	0.012
      - reaction	0.012
      - chemistry	0.011
      - property	0.011
   Topic #8
      - project	0.015
      - energy	0.015
      - concept	0.009
      - work	0.009
      - information	0.007
   Topi

In [48]:
for beta in [1.01,1.1,2.0,5.0,10.0,100.0]:
    print('For beta = %.2f'%beta)
    model2 = LDA.train(courses,k=10,docConcentration=6.0,topicConcentration=beta,seed=1)
    for idx, topics in enumerate(model2.describeTopics(5)):
        print('   Topic #%d'%(idx+1))
        for termIdx, term in enumerate(topics[0]):
            print('      - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

For beta = 1.01
   Topic #1
      - image	0.014
      - technique	0.013
      - architecture	0.012
      - processing	0.011
      - theory	0.010
   Topic #2
      - risk	0.013
      - concept	0.008
      - market	0.008
      - exam	0.007
      - presentation	0.007
   Topic #3
      - project	0.025
      - report	0.025
      - data	0.018
      - scientific	0.016
      - research	0.015
   Topic #4
      - cell	0.016
      - process	0.014
      - flow	0.013
      - transfer	0.011
      - equation	0.010
   Topic #5
      - linear	0.015
      - algorithm	0.014
      - problem	0.013
      - signal	0.012
      - basic	0.012
   Topic #6
      - material	0.034
      - presentation	0.011
      - technology	0.011
      - property	0.010
      - structure	0.010
   Topic #7
      - chemical	0.013
      - dynamic	0.011
      - structure	0.010
      - molecular	0.010
      - interaction	0.009
   Topic #8
      - project	0.019
      - energy	0.017
      - concept	0.011
      - work	0.010
      - end	0.

## Exercise 4.10: EPFL's taught subjects

## Exercise 4.11: Wikipedia structure

In [4]:
data = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

In [5]:
data.count()

5554

We assume that the data has already been preprocessed.

In [6]:
vocabulary = data \
    .flatMap(lambda page: page['tokens']) \
    .zipWithIndex() \
    .collectAsMap()

In [7]:
id2voc = {v: k for k,v in vocabulary.items()}

In [8]:
def page_vector(page):
    id = page['page_id']-1
    counts = defaultdict(int)
    for token in page['tokens']:
        counts[vocabulary[token]] += 1
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(len(vocabulary), keys, values))
pages = data.map(page_vector).map(list)

In [None]:
modelWiki = LDA.train(pages,k=5000,docConcentration=1.01,topicConcentration=5.0,seed=1)

In [17]:
for idx, topics in enumerate(modelWiki.describeTopics(1)):
    if(idx > 10):
        break
    print('Topic #%d'%(idx+1))
    for termIdx, term in enumerate(topics[0]):
        print('  - %s\t%.3f'%(id2voc[term],topics[1][termIdx]))

Topic #1
  - species	0.005
Topic #2
  - time	0.003
Topic #3
  - city	0.007
Topic #4
  - number	0.005
Topic #5
  - american	0.008
