# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [2]:
import numpy as np
from utils import load_pkl
from collections import defaultdict
import json
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()

numTerms = len(term2Idx.keys())
numCourses = len(course2Idx.keys())

bagOfWords = np.load('bagOfWords.npy').item()

stopwords = load_pkl('data/stopwords.pkl')

In [3]:
# We construct a matrix of sparse vectors with the column being the index of the course
# and the row being the term, the value is the count of the term in the course.
courses = sc.parallelize(course2Idx.keys())
def course_vector(course):
    id = course2Idx[course]
    counts = {}
    for term in bagOfWords[course]:
        counts[term2Idx[term]] = bagOfWords[course][term]
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(numTerms, keys, values))
courses = courses.map(course_vector).map(list)

## Exercise 4.8: Topics extraction

In [11]:
model = LDA.train(courses,k=10,seed=1)
for idx, topics in enumerate(model.describeTopics(10)):
    print('Topic #%d'%(idx+1))
    for termIdx, term in enumerate(topics[0]):
        print('   - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

Topic #1
   - algorithm	0.017
   - programming	0.015
   - problem	0.012
   - structure	0.011
   - computer	0.010
   - software	0.010
   - material	0.009
   - quantum	0.009
   - tool	0.009
   - communication	0.009
Topic #2
   - optical	0.015
   - optic	0.014
   - image	0.013
   - material	0.013
   - microscopy	0.012
   - imaging	0.011
   - electron	0.010
   - processing	0.009
   - principle	0.007
   - technique	0.006
Topic #3
   - chemical	0.017
   - biology	0.015
   - molecular	0.014
   - protein	0.012
   - engineering	0.011
   - reaction	0.010
   - interaction	0.008
   - process	0.008
   - cell	0.008
   - biological	0.007
Topic #4
   - modeling	0.013
   - information	0.010
   - presentation	0.009
   - innovation	0.008
   - strategy	0.008
   - work	0.008
   - business	0.008
   - tool	0.007
   - plan	0.007
   - class	0.007
Topic #5
   - report	0.029
   - skill	0.018
   - scientific	0.017
   - plan	0.016
   - research	0.015
   - written	0.013
   - paper	0.013
   - ass	0.011
   - risk	0.0

1. Algorithmics
2. Optics
3. Bio-chemistry
4. Statistical finances
5. Research
6. Probability and statistics
7. Electrical engineering
8. Materials Science and Engineering
9. Environmental science
10. Projects

Reminder of LSI:
1. Research
2. Laboratory
3. Finances
4. Architecture
5. Bio economy 
6. Microscopy
7. Life science
8. Micro technology
9. Biomechanics
10. Cultural Heritage

It looks similar for some, but most of the time, the terms in the topics are more precise and related to their subject.

## Exercise 4.9: Dirichlet hyperparameters

In [18]:
for alpha in [1.01,5.0,50.0]:
    print('For alpha = %.2f'%alpha)
    model2 = LDA.train(courses,k=10,docConcentration=alpha,topicConcentration=1.01,seed=1)
    for idx, topics in enumerate(model2.describeTopics(5)):
        print('   Topic #%d'%(idx+1))
        for termIdx, term in enumerate(topics[0]):
            print('      - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

For alpha = 1.01
   Topic #1
      - algorithm	0.016
      - programming	0.015
      - problem	0.013
      - theory	0.011
      - optimization	0.010
   Topic #2
      - material	0.015
      - optical	0.013
      - optic	0.013
      - image	0.012
      - microscopy	0.011
   Topic #3
      - chemical	0.015
      - biology	0.013
      - molecular	0.013
      - protein	0.012
      - cell	0.012
   Topic #4
      - research	0.009
      - presentation	0.009
      - innovation	0.008
      - work	0.008
      - plan	0.008
   Topic #5
      - report	0.024
      - skill	0.015
      - risk	0.014
      - plan	0.014
      - scientific	0.013
   Topic #6
      - linear	0.016
      - theory	0.015
      - problem	0.011
      - control	0.010
      - probability	0.010
   Topic #7
      - energy	0.022
      - circuit	0.017
      - sensor	0.012
      - power	0.011
      - device	0.010
   Topic #8
      - flow	0.013
      - cell	0.013
      - heat	0.011
      - mass	0.010
      - equation	0.010
   Topic #9
  

With a big value of alpha, we should have a uniform distribution of topics over documents. It means that all documents are very similar.

It becomes hard to extract relevant topics of such a set as each topic will have the same distribution of words, because we try to analyse the topics over similars documents. The words per topics will simply be the most popular words over the whole documents.

In [16]:
for beta in [1.01,2.5,10.0]:
    print('For beta = %.2f'%beta)
    model2 = LDA.train(courses,k=10,docConcentration=6.0,topicConcentration=beta,seed=1)
    for idx, topics in enumerate(model2.describeTopics(5)):
        print('   Topic #%d'%(idx+1))
        for termIdx, term in enumerate(topics[0]):
            print('      - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

For beta = 1.01
   Topic #1
      - algorithm	0.018
      - programming	0.014
      - problem	0.012
      - structure	0.012
      - communication	0.011
   Topic #2
      - optical	0.014
      - optic	0.014
      - image	0.014
      - material	0.013
      - microscopy	0.012
   Topic #3
      - chemical	0.017
      - biology	0.015
      - molecular	0.015
      - protein	0.012
      - reaction	0.011
   Topic #4
      - modeling	0.013
      - research	0.010
      - information	0.008
      - presentation	0.008
      - tool	0.008
   Topic #5
      - report	0.029
      - skill	0.019
      - plan	0.017
      - scientific	0.017
      - research	0.014
   Topic #6
      - theory	0.019
      - linear	0.019
      - problem	0.013
      - probability	0.011
      - exam	0.011
   Topic #7
      - energy	0.021
      - circuit	0.017
      - sensor	0.013
      - power	0.012
      - technology	0.009
   Topic #8
      - flow	0.015
      - application	0.014
      - cell	0.013
      - mass	0.011
      - heat	

With a big value of beta, we can see that, as expected, the term-distribution of each topic is very likely to be uniform, so the topics are all similar.

With a smaller value, the distribution is more sparse and random, but it quickly increase to a uniform distribution.

## Exercise 4.10: EPFL's taught subjects

In [28]:
model = LDA.train(courses,k=20,docConcentration=1.0001,topicConcentration=3.0,seed=1)
for idx, topics in enumerate(model.describeTopics(5)):
    print('Topic #%d'%(idx+1))
    for termIdx, term in enumerate(topics[0]):
        print('   - %s\t%.3f'%(idx2Term[term],topics[1][termIdx]))

Topic #1
   - molecular	0.024
   - protein	0.017
   - biology	0.017
   - paper	0.016
   - reaction	0.016
Topic #2
   - programming	0.017
   - digital	0.015
   - language	0.012
   - modeling	0.010
   - signal	0.010
Topic #3
   - control	0.039
   - stability	0.012
   - circuit	0.009
   - work	0.009
   - session	0.008
Topic #4
   - brain	0.011
   - scientific	0.009
   - field	0.009
   - neuroscience	0.008
   - architecture	0.008
Topic #5
   - cell	0.015
   - structure	0.012
   - week	0.012
   - electrochemical	0.011
   - problem	0.010
Topic #6
   - skill	0.013
   - plan	0.012
   - presentation	0.012
   - class	0.011
   - evaluate	0.011
Topic #7
   - technology	0.021
   - policy	0.019
   - communication	0.010
   - development	0.009
   - engineering	0.008
Topic #8
   - optic	0.024
   - image	0.022
   - imaging	0.022
   - optical	0.015
   - microscopy	0.012
Topic #9
   - risk	0.020
   - report	0.015
   - skill	0.012
   - plan	0.011
   - information	0.010
Topic #10
   - device	0.025
   - mate

## 2.
- $k$ = 20, since there is a lot of subjects being thaught at EPFL
- $\alpha = 1.0001$, smallest value possible, since a course rarely teaches more than on topic
- $\beta = 3.0$, we wanted the word per document to be quite uniform, but not too much in since a lot of terms repeat from one course to another since they are often linked

## 3.
1. Biology
2. IC
3. Micro-engineering
4. Neurosciences
5. Cells
6. Professional skills
7. Role of engineer
8. Imagery
9. Statistics
10. Hardware
11. Fiber networks
12. Projects
13. Mathematics
14. Physics
15. Sound
16. Electricity
17. Chemistry
18. Medical engineering
19. Ecology
20. Measurement

## Exercise 4.11: Wikipedia structure

In [3]:
data = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)
data.count()

We assume that the data has already been mostly preprocessed, we only remove all words which have length smaller than 2, remove numbers and stopwords.

In [29]:
# We filter the words, make a distinct list with all words and indices.
vocabulary = data \
    .map(lambda page: [word.lower() for word in page['tokens'] if (word.isalpha() and len(word) > 2 and word.lower() not in stopwords)]) \
    .flatMap(lambda x: x) \
    .distinct() \
    .zipWithIndex() \
    .collectAsMap()
    
id2voc = {v: k for k,v in vocabulary.items()}

# We construct the vectors by giving them as indices the page id
# Then we make the count of each term in the page
def page_vector(page):
    id = page['page_id']-1
    counts = defaultdict(int)
    for token in page['tokens']:
        if token in vocabulary:
            counts[vocabulary[token]] += 1
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(len(vocabulary), keys, values))
pages = data.map(page_vector).map(list)

For the value of $\alpha$, what makes most sense is to have a small value, since there should rarely be more than one or two topics per wikipedia page, thus we chose $\alpha = 0.1$

For the value of $\beta$, we felt that the word term distribution per topic should be quite uniform, since wikipedia covers of wide range of subjet on the topics, we chose $\beta = 2.0$

For $k$, we wanted the value to be quite big, however, the cluster was not able to run LDA with a value bigger than 10. It would have made more sense to choose a $k = 50$ or even more (since we have more than 5000 pages)

In [8]:
modelWiki = LDA.train(pages,k=10,docConcentration=0.1,topicConcentration=2.0,seed=1,optimizer='online')

for idx, topics in enumerate(modelWiki.describeTopics(5)):
    if(idx > 10):
        break
    print('Topic #%d'%(idx+1))
    for termIdx, term in enumerate(topics[0]):
        print('  - %s\t%f'%(id2voc[term],topics[1][termIdx]))

Topic #1
  - water	0.000128
  - acid	0.000110
  - coupler	0.000107
  - love	0.000082
  - system	0.000080
Topic #2
  - tolkien	0.000067
  - bush	0.000059
  - war	0.000057
  - states	0.000051
  - united	0.000051
Topic #3
  - ilex	0.000173
  - bass	0.000145
  - galaxy	0.000113
  - cormorant	0.000112
  - phalacrocorax	0.000111
Topic #4
  - time	0.002039
  - years	0.001808
  - world	0.001788
  - american	0.001661
  - war	0.001625
Topic #5
  - hippos	0.000108
  - hippopotamus	0.000085
  - water	0.000054
  - nematodes	0.000047
  - chinese	0.000041
Topic #6
  - pluto	0.000156
  - set	0.000126
  - theory	0.000110
  - time	0.000085
  - string	0.000079
Topic #7
  - american	0.000064
  - states	0.000061
  - united	0.000058
  - hamilton	0.000056
  - war	0.000047
Topic #8
  - tubman	0.000143
  - theatre	0.000109
  - city	0.000100
  - house	0.000081
  - hänsel	0.000081
Topic #9
  - shinto	0.000190
  - god	0.000140
  - sheep	0.000096
  - theory	0.000087
  - kami	0.000087
Topic #10
  - acid	0.000082
  

We are not really conviced by the results, the number of topic is certainly way too small...

Attempt of giving labels:
1. ?
2. ?
3. ?
4. World war
5. Hippopotamus
6. Planet theory ?
7. American Revolutionary War
8. ?
9. Religion
10. ?