# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [70]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()

X = np.load('X.npy')


## Exercise 4.4: Latent semantic indexing

In [3]:
U, S, Vt = svds(X,300)

- U relates the terms to the topic
- S gives the importance of the topic
- Vt relates the course to the topic

In [10]:
list(reversed(S[-20:]))

[112.89121295811678,
 62.134533391836847,
 61.07624544574157,
 55.509001415866798,
 53.724520126932838,
 50.969248496464914,
 50.138053964750121,
 49.202610571777953,
 45.002340726377092,
 44.834905521568743,
 43.999509819940528,
 42.685815856573846,
 41.472874750293798,
 40.313796399951215,
 38.551404968418041,
 38.039367733532963,
 37.864860149138543,
 37.631913076279936,
 37.150842242320614,
 36.643986662341959]

## Exercise 4.5: Topic extraction

In [77]:
topicCourses = np.diag(S) @ Vt
topicTerms = np.diag(S) @ U.T
topTopics = (np.sum(topicCourses,axis=1)+np.sum(topicTerms, axis=1)).argsort()[-10:][::-1]
for idx, topic in enumerate(topTopics):
    print("Topic %d top 5 terms:"%(idx+1))
    for term in map(lambda x: idx2Term[x], topicTerms[topic].argsort()[-5:][::-1]):
        print('  - %s'%term)
    print("Topic %d top 5 courses:"%(idx+1))
    for course in map(lambda x: id2name[idx2Course[x]], topicCourses[topic].argsort()[-5:][::-1]):
        print('  - %s'%course)

Topic 1 top 5 terms:
  - city
  - architectural
  - architecture
  - narrative
  - urban
Topic 1 top 5 courses:
  - Théorie et critique du projet MA2 (Geers)
  - Théorie et critique du projet MA1 (Geers)
  - Théorie et critique du projet MA1 (Gugger)
  - Théorie et critique du projet MA2 (Gugger)
  - Théorie et critique du projet MA1 (Huang)
Topic 2 top 5 terms:
  - market
  - financial
  - pricing
  - finance
  - asset
Topic 2 top 5 courses:
  - Quantitative methods in finance
  - Fundamentals in Systems Engineering
  - Derivatives
  - Investments
  - 2D Layered Materials: Synthesis, Properties and Applications
Topic 3 top 5 terms:
  - cell
  - chemical
  - molecular
  - energy
  - thermal
Topic 3 top 5 courses:
  - Bioprocesses and downstream processing
  - Biochemical engineering
  - Théorie et critique du projet MA1 (Gugger)
  - Théorie et critique du projet MA2 (Gugger)
  - 2D Layered Materials: Synthesis, Properties and Applications
Topic 4 top 5 terms:
  - optical
  - device
  -

## Exercise 4.6: Document similarity search in concept-space

In [101]:
def sim(t,d):
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(U[t],SVtd)/(np.sqrt(np.dot(U[t],U[t]))*np.sqrt(np.dot(SVd,SVd)))
def sim2Terms(t1,t2,d):
    Ut = U[t1]+U[t2]
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(Ut,SVtd)/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

In [99]:
vsm = list(map(lambda x: id2name[idx2Course[x]], X[term2Idx['facebook']].argsort()[-5:][::-1]))
lsi = list(map(lambda x: id2name[idx2Course[x]], np.array(list(map(lambda d: sim(term2Idx['facebook'],d), idx2Course.keys()))).argsort()[-5:][::-1]))
print('Top courses for vsm:')
for course in vsm:
    print('   - %s'%course)
print('Top courses for lsi:')
for course in lsi:
    print('   - %s'%course)

Top courses for vsm:
   - Computational Social Media
   - Hydrogeophysics
   - Electronic properties of solids and superconductivity
   - CCMX Advanced Course - Instrumented Nanoindentation
   - Molecular and cellular biophysic II
Top courses for lsi:
   - Computational Social Media
   - Transport phenomena II
   - Media security
   - History of globalization II
   - Mobile networks


In [103]:
vsm = list(map(lambda x: id2name[idx2Course[x]], (X[term2Idx['markov']]+X[term2Idx['chain']]).argsort()[-5:][::-1]))
lsi = list(map(lambda x: id2name[idx2Course[x]], np.array(list(map(lambda d: sim2Terms(term2Idx['markov'],term2Idx['chain'],d), idx2Course.keys()))).argsort()[-5:][::-1]))
print('Top courses for vsm:')
for course in vsm:
    print('   - %s'%course)
print('Top courses for lsi:')
for course in lsi:
    print('   - %s'%course)

Top courses for vsm:
   - Markov chains and algorithmic applications
   - Applied probability & stochastic processes
   - Applied stochastic processes
   - Internet analytics
   - Optimization and simulation
Top courses for lsi:
   - Markov chains and algorithmic applications
   - Applied stochastic processes
   - Applied probability & stochastic processes
   - Optimization and simulation
   - Networks out of control


## Exercise 4.7: Document-document similarity

We can simply to a cosine similarity between $S\vec{v}^T_{d1}$ and $S\vec{v}^T_{d2}$ 

In [104]:
def docSim(d1,d2):
    Sig = np.diag(S)
    SVtd1 = np.dot(Sig,Vt[:,d1])
    SVtd2 = np.dot(Sig,Vt[:,d2])
    return np.dot(SVtd1,SVtd2)/(np.sqrt(np.dot(SVtd1,SVtd1))*np.sqrt(np.dot(SVtd2,SVtd2)))

In [109]:
com308 = course2Idx['COM-308']
list(map(lambda x: id2name[idx2Course[x]], np.array(list(map(lambda d: docSim(com308,d), [x for x in idx2Course.keys() if not x == com308]))).argsort()[-5:][::-1]))

['Risk management',
 'Working group in Topology II',
 'Neurosciences II : cellular mechanisms of brain function',
 'D. Thinking: real problems, human-focused solutions',
 'Introduction to chemical engineering']