# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [144]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()

X = np.load('X.npy').item()

## Exercise 4.4: Latent semantic indexing

In [145]:
k = 300
U, S, Vt = svds(X,k)

- U relates the terms to the topic
- S gives the importance of the topic
- Vt relates the course to the topic

In [146]:
list(S[-20:][::-1])

[39.618057844047669,
 24.758720723538737,
 23.448809699379229,
 21.962264474910715,
 21.706140430739843,
 21.466850777239362,
 20.848467973352697,
 20.501338803726238,
 20.306806488528316,
 19.413737146608653,
 19.222148129678974,
 18.848295806639435,
 18.657737275327246,
 18.433197650275194,
 18.192954137546717,
 17.973093088904104,
 17.884078387317338,
 17.534507263359178,
 17.287439622275365,
 17.278103009242049]

## Exercise 4.5: Topic extraction

In [147]:
for i in reversed(range(k-10,k)):
    print('Topic %d top 10 terms:'%(k-i))
    topTerms = np.argsort(-np.abs(U.T[i]))[:10]
    for term in topTerms:
        print('  - %s\t%.3f'%(idx2Term[term],U.T[i][term]))
    print('Topic %d top 10 courses:'%(k-i))
    topCourses = np.argsort(-np.abs(Vt[i]))[:10]
    for course in topCourses:
        print('  - %s\t%.3f'%(id2name[idx2Course[course]],Vt[i][course]))

Topic 1 top 10 terms:
  - data	0.096
  - problem	0.090
  - research	0.088
  - material	0.086
  - plan	0.082
  - presentation	0.081
  - process	0.081
  - report	0.081
  - cell	0.076
  - work	0.075
Topic 1 top 10 courses:
  - Bioprocesses and downstream processing	0.131
  - Renewable energy and solar architecture in Davos	0.119
  - Lab immersion III	0.117
  - Nanobiotechnology and biophysics	0.110
  - General physics (English) II	0.097
  - Lab immersion I	0.092
  - History of globalization I	0.083
  - Project in bioengineering and biosciences	0.083
  - Biochemical engineering	0.081
  - Cognitive psychology B	0.072
Topic 2 top 10 terms:
  - electron	0.151
  - microscopy	0.120
  - optical	0.118
  - data	-0.108
  - semiconductor	0.103
  - research	-0.100
  - experimentation	-0.099
  - plan	-0.099
  - wetlab	-0.097
  - spectroscopy	0.094
Topic 2 top 10 courses:
  - Lab immersion III	-0.249
  - Nanofabrication with focused electron and ion beams	0.235
  - Lab immersion I	-0.192
  - Renewable 

In [134]:
topicCourses = np.diag(S) @ Vt
topicTerms = np.diag(S) @ U.T
topTopics = (np.sum(topicCourses,axis=1)+np.sum(topicTerms, axis=1)).argsort()[-10:][::-1]
for idx, topic in enumerate(topTopics):
    print("Topic %d top 5 terms:"%(idx+1))
    for term in map(lambda x: idx2Term[x], topicTerms[topic].argsort()[-5:][::-1]):
        print('  - %s'%term)
    print("Topic %d top 5 courses:"%(idx+1))
    for course in map(lambda x: id2name[idx2Course[x]], topicCourses[topic].argsort()[-5:][::-1]):
        print('  - %s'%course)

Topic 1 top 5 terms:
  - problem
  - report
  - research
  - presentation
  - plan
Topic 1 top 5 courses:
  - Lab immersion III
  - Lab immersion I
  - Bioprocesses and downstream processing
  - Lab immersion II
  - Renewable energy and solar architecture in Davos
Topic 2 top 5 terms:
  - electron
  - equation
  - optical
  - property
  - microscopy
Topic 2 top 5 courses:
  - Nanofabrication with focused electron and ion beams
  - Bioprocesses and downstream processing
  - Biochemical engineering
  - Nanobiotechnology and biophysics
  - Solid state physics II
Topic 3 top 5 terms:
  - architecture
  - circuit
  - studio
  - architectural
  - digital
Topic 3 top 5 courses:
  - Théorie et critique du projet MA2 (Gugger)
  - Théorie et critique du projet MA1 (Gugger)
  - Théorie et critique du projet MA2 (Huang)
  - Nanocomputing: Devices, Circuits and Architectures
  - Théorie et critique du projet MA1 (Huang)
Topic 4 top 5 terms:
  - concise
  - stress
  - micro-engineering
  - encounter

## Exercise 4.6: Document similarity search in concept-space

In [135]:
def sim(t,d):
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(U[t],SVtd)/(np.sqrt(np.dot(U[t],U[t]))*np.sqrt(np.dot(SVd,SVd)))
def sim2Terms(t1,t2,d):
    Ut = U[t1]+U[t2]
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(Ut,SVtd)/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

In [136]:
top = np.argsort(-np.array(list(map(lambda d: sim(term2Idx['facebook'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim(term2Idx['facebook'],x)), top))
print('Top courses for vsm:')
print("""   - Computational Social Media 	 3.908""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%(course))

Top courses for vsm:
   - Computational Social Media 	 3.908
Top courses for lsi:
   - Computational Social Media	0.918
   - Social media	0.678
   - How people learn I	0.437
   - How people learn II	0.353
   - Media security	0.298


In [137]:
top = np.argsort(-np.array(list(map(lambda d: sim2Terms(term2Idx['markov'],term2Idx['chain'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim2Terms(term2Idx['markov'],term2Idx['chain'],x)), top))
print('Top courses for vsm:')
print("""   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193
""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%course)

Top courses for vsm:
   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193

Top courses for lsi:
   - Applied probability & stochastic processes	0.757
   - Applied stochastic processes	0.708
   - Markov chains and algorithmic applications	0.672
   - Supply chain management	0.657
   - Mathematical models in supply chain management	0.591


## Exercise 4.7: Document-document similarity

We can simply to a cosine similarity between $S\mathbf{v}^T_{d1}$ and $S\mathbf{v}^T_{d2}$ 

In [138]:
def docSim(d1,d2):
    Sig = np.diag(S)
    SVtd1 = np.dot(Sig,Vt[:,d1])
    SVtd2 = np.dot(Sig,Vt[:,d2])
    return np.dot(SVtd1,SVtd2)/(np.sqrt(np.dot(SVtd1,SVtd1))*np.sqrt(np.dot(SVtd2,SVtd2)))

In [139]:
com308 = course2Idx['COM-308']
# All courses without com 308
coursesFiltered = [x for x in idx2Course.keys() if not x == com308]
top = np.argsort(-np.array(list(map(lambda d: docSim(com308,d), coursesFiltered))))[:5]
list(map(lambda x: id2name[idx2Course[x]], top))

['Martingales in financial mathematics',
 'Risk management',
 'D. Thinking: real problems, human-focused solutions',
 'Neurosciences II : cellular mechanisms of brain function',
 'Working group in Topology II']