# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** J

**Names:**

* Rafael Bischof
* Jeniffer Lima Graf
* Alexander Sanchez

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [171]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from collections import Counter
import sklearn.cluster as cl

In [1]:
import pickle

with open('data/preprocessedcourses.pickle', 'rb') as handle:
    preprocessedcourses = pickle.load(handle)
with open('data/wordsIdx.pickle', 'rb') as handle:
    wordsIdx = pickle.load(handle)
with open('data/coursesIdx.pickle', 'rb') as handle:
    coursesIdx = pickle.load(handle)
with open('data/idxWords.pickle', 'rb') as handle:
    idxWords = pickle.load(handle)
with open('data/idxCourses.pickle', 'rb') as handle:
    idxCourses = pickle.load(handle)

In [173]:
def vectorize(course):
    wordcounts = Counter(preprocessedcourses[course])
    idxToCount = {}
    for w in wordcounts:
        idxToCount[wordsIdx[w]] = wordcounts[w]
    idxToCount = sorted(idxToCount.items())
    keys = [x[0] for x in idxToCount]
    values = [x[1] for x in idxToCount]
    return [coursesIdx[course], Vectors.sparse(len(wordsIdx), keys, values)]

In [174]:
ppc = sc.parallelize(preprocessedcourses.items()).map(lambda x: vectorize(x[0]))

In [175]:
def printModel(model, num=10):
    i = 1
    for topic in model.describeTopics(num):
        print(i,[idxWords[idx] for idx in topic[0]])
        i += 1

## Exercise 4.8: Topics extraction

In [176]:
model = LDA.train(ppc, k=10)
printModel(model)

1 ['energy', 'property', 'structure', 'basic', 'concept', 'state', 'principle', 'quantum', 'device', 'material']
2 ['model', 'analysis', 'linear', 'data', 'algorithm', 'theory', 'exercise', 'modeling', 'problem', 'signal']
3 ['chemistry', 'chemical', 'field', 'application', 'material', 'organic', 'engineering', 'process', 'biological', 'interaction']
4 ['design', 'material', 'control', 'work', 'electronics', 'end', 'device', 'technique', 'sensor', 'teaching']
5 ['model', 'cell', 'paper', 'process', 'lecture', 'theory', 'stochastic', 'risk', 'time', 'financial']
6 ['flow', 'exercise', 'numerical', 'image', 'basic', 'concept', 'processing', 'equation', 'lecture', 'teaching method']
7 ['teaching', 'skill', 'cell', 'lecture', 'plan', 'evaluate', 'magnetic', 'process', 'outcome', 'structure']
8 ['report', 'week', 'skill', 'laboratory', 'written', 'data', 'scientific', 'problem', 'plan', 'analysis']
9 ['research', 'technology', 'design', 'management', 'innovation', 'information', 'hour', 'de

## Exercise 4.9: Dirichlet hyperparameters

In [182]:
for a in [1.01, 5.0, 100.0]:
    print('alpha = ', a)
    modelD = LDA.train(ppc, k=10, topicConcentration=1.01, docConcentration=a)
    printModel(modelD)

alpha =  1.01
1 ['design', 'image', 'work', 'activity', 'skill', 'research', 'information', 'science', 'end', 'plan']
2 ['control', 'chemical', 'biology', 'reaction', 'design', 'activity', 'exercise', 'teaching', 'protein', 'model']
3 ['material', 'cell', 'application', 'lecture', 'design', 'concept', 'process', 'teaching', 'sensor', 'device']
4 ['network', 'lecture', 'exercise', 'communication', 'problem', 'theory', 'concept', 'basic', 'analysis', 'end']
5 ['model', 'theory', 'probability', 'analysis', 'linear', 'stochastic', 'problem', 'space', 'application', 'time']
6 ['optical', 'material', 'property', 'optic', 'electron', 'laser', 'basic', 'microscopy', 'structure', 'application']
7 ['report', 'paper', 'research', 'scientific', 'data', 'field', 'written', 'molecular', 'skill', 'biology']
8 ['data', 'processing', 'model', 'programming', 'analysis', 'algorithm', 'signal', 'basic', 'exercise', 'end']
9 ['energy', 'process', 'heat', 'flow', 'concept', 'quantum', 'basic', 'transfer', '

The words describing the topics carry less and less information. This happens because with $\alpha$ = 100 we have a uniform distribution of topics within the documents, while $\alpha$ = 1.01 favours very specific topics. With larger $\alpha$, the topics therefore become more and more similar.

In [184]:
for b in [1.01, 5.0, 100.0]:
    print('beta = ', b)
    modelD = LDA.train(ppc, k=10, topicConcentration=b, docConcentration=6.0)
    printModel(modelD)

beta =  1.01
1 ['design', 'analysis', 'lecture', 'concept', 'engineering', 'exercise', 'space', 'signal', 'fourier', 'flow']
2 ['class', 'cell', 'presentation', 'study', 'hour', 'activity', 'lecture', 'note', 'analysis', 'case']
3 ['paper', 'field', 'chemical', 'physical', 'microscopy', 'end', 'keywords', 'lecture', 'discussion', 'research']
4 ['material', 'introduction', 'structure', 'application', 'magnetic', 'property', 'lecture', 'technique', 'organic', 'device']
5 ['basic', 'design', 'reaction', 'application', 'model', 'chemistry', 'property', 'circuit', 'surface', 'simulation']
6 ['model', 'programming', 'time', 'basic', 'probability', 'algorithm', 'network', 'theory', 'data', 'exercise']
7 ['group', 'material', 'management', 'design', 'work', 'energy', 'concept', 'week', 'teaching', 'exercise']
8 ['data', 'report', 'research', 'image', 'scientific', 'skill', 'design', 'work', 'plan', 'laboratory']
9 ['energy', 'process', 'cell', 'protein', 'molecular', 'chemical', 'biology', 'bi

With large $\beta$ the probability that a word belongs to a topic is uniform over all topics. All topics will therefore contain more or less the same words. 

## Exercise 4.10: EPFL's taught subjects
We expect there to be a lot of different topics in the entire corpus.

In [177]:
model = LDA.train(ppc,k=40,docConcentration=1.0000000000001,topicConcentration=1.0000000000001)
printModel(model)

1 ['skill', 'data', 'plan', 'work', 'optimal', 'specific', 'problem', 'scientific', 'experiment', 'report']
2 ['circuit', 'design', 'device', 'integrated', 'analog', 'sensor', 'electrical', 'noise', 'transistor', 'basic']
3 ['communication', 'exercise', 'algebra', 'basic', 'design', 'numerical', 'computer', 'programming', 'linear', 'digital']
4 ['engineering', 'chemical', 'concept', 'hour', 'lecture', 'biology', 'chemistry', 'stem', 'class', 'system engineering']
5 ['design', 'process', 'film', 'technique', 'device', 'thin', 'thin film', 'lecture', 'electronics', 'application']
6 ['quantum', 'industry', 'technology', 'firm', 'introduction', 'corporate', 'finance', 'theory', 'business', 'policy']
7 ['processing', 'signal', 'control', 'image', 'material', 'signal processing', 'video', 'basic', 'skill', 'introduction']
8 ['paper', 'carlo', 'monte', 'monte carlo', 'discussion', 'written', 'model', 'presentation', 'material', 'analysis']
9 ['cell', 'skill', 'engineering', 'scientific', 'des

## Exercise 4.11: Wikipedia structure

In [189]:
import json
wikipedia = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

In [187]:
wikipedia = wikipedia.map(lambda x: vectorize(x[0]))

In [190]:
wikipedia.take(1)

[{'page_id': 1,
  'title': 'Áedán mac Gabráin',
  'tokens': ['áedán',
   'mac',
   'gabráinschools',
   'wikipedia',
   'selection',
   'related',
   'subjects',
   'british',
   'history',
   'including',
   'roman',
   'britain',
   'historical',
   'figures',
   'satellite',
   'image',
   'northern',
   'britain',
   'ireland',
   'showing',
   'approximate',
   'area',
   'dál',
   'riata',
   'shaded',
   'áedán',
   'mac',
   'gabráin',
   'irish',
   'pronunciation',
   'aiðaːn',
   'mak',
   'gavraːnʲ',
   'king',
   'dál',
   'riata',
   'circa',
   'onwards',
   'kingdom',
   'dál',
   'riata',
   'situated',
   'modern',
   'argyll',
   'bute',
   'scotland',
   'parts',
   'county',
   'antrim',
   'ireland',
   'genealogies',
   'record',
   'áedán',
   'son',
   'gabrán',
   'mac',
   'domangairthe',
   'contemporary',
   'saint',
   'columba',
   'recorded',
   'life',
   'career',
   'hagiography',
   'adomnán',
   'ionas',
   'life',
   'saint',
   'columba',
   'áedá