# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [7]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()

X = np.load('X.npy').item()

## Exercise 4.4: Latent semantic indexing

In [8]:
U, S, Vt = svds(X,300)

- U relates the terms to the topic
- S gives the importance of the topic
- Vt relates the course to the topic

In [9]:
list(reversed(S[-20:]))

[112.87395448064626,
 63.335399830964342,
 61.433731136378157,
 55.410643913787901,
 53.629186727736126,
 51.043791399150237,
 50.07348261608756,
 49.265475696268851,
 45.001911051494922,
 44.730573826720061,
 44.052742858966162,
 42.85248368957312,
 41.494005207412471,
 40.579484784396023,
 39.499945744535836,
 38.499074253856108,
 37.916223620402342,
 37.68186675547404,
 37.575266560806938,
 37.195664009496099]

## Exercise 4.5: Topic extraction

In [10]:
topicCourses = np.diag(S) @ Vt
topicTerms = np.diag(S) @ U.T
topTopics = (np.sum(topicCourses,axis=1)+np.sum(topicTerms, axis=1)).argsort()[-10:][::-1]
for idx, topic in enumerate(topTopics):
    print("Topic %d top 5 terms:"%(idx+1))
    for term in map(lambda x: idx2Term[x], topicTerms[topic].argsort()[-5:][::-1]):
        print('  - %s'%term)
    print("Topic %d top 5 courses:"%(idx+1))
    for course in map(lambda x: id2name[idx2Course[x]], topicCourses[topic].argsort()[-5:][::-1]):
        print('  - %s'%course)

Topic 1 top 5 terms:
  - plan
  - report
  - presentation
  - data
  - work
Topic 1 top 5 courses:
  - Fundamentals in Systems Engineering
  - Solid waste engineering
  - Bioprocesses and downstream processing
  - Lab immersion in industry B
  - Biochemical engineering
Topic 2 top 5 terms:
  - market
  - financial
  - pricing
  - finance
  - asset
Topic 2 top 5 courses:
  - Quantitative methods in finance
  - Fundamentals in Systems Engineering
  - Derivatives
  - Investments
  - Lab immersion in industry A
Topic 3 top 5 terms:
  - cell
  - chemical
  - molecular
  - thermal
  - energy
Topic 3 top 5 courses:
  - Bioprocesses and downstream processing
  - Biochemical engineering
  - Théorie et critique du projet MA1 (Gugger)
  - Théorie et critique du projet MA2 (Gugger)
  - 2D Layered Materials: Synthesis, Properties and Applications
Topic 4 top 5 terms:
  - optical
  - device
  - optic
  - semiconductor
  - sensor
Topic 4 top 5 courses:
  - Théorie et critique du projet MA1 (Huang)
  

## Exercise 4.6: Document similarity search in concept-space

In [11]:
def sim(t,d):
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(U[t],SVtd)/(np.sqrt(np.dot(U[t],U[t]))*np.sqrt(np.dot(SVd,SVd)))
def sim2Terms(t1,t2,d):
    Ut = U[t1]+U[t2]
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(Ut,SVtd)/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

In [31]:
top = np.argsort(-np.array(list(map(lambda d: sim(term2Idx['facebook'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim(term2Idx['facebook'],x)), top))
print('Top courses for vsm:')
print("""   - Computational Social Media 	 3.908""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%(course))

Top courses for vsm:
   - Computational Social Media 	 3.908
Top courses for lsi:
   - Computational Social Media	0.953
   - Transport phenomena II	0.203
   - History of globalization II	0.181
   - Games for Crowds and Networks	0.178
   - Media security	0.178


In [29]:
top = np.argsort(-np.array(list(map(lambda d: sim2Terms(term2Idx['markov'],term2Idx['chain'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim2Terms(term2Idx['markov'],term2Idx['chain'],x)), top))
print('Top courses for vsm:')
print("""   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193
""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%course)

Top courses for vsm:
   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193

Top courses for lsi:
   - Applied stochastic processes	0.561
   - Markov chains and algorithmic applications	0.551
   - Optimization and simulation	0.498
   - Applied probability & stochastic processes	0.497
   - Networks out of control	0.413


## Exercise 4.7: Document-document similarity

We can simply to a cosine similarity between $S\vec{v}^T_{d1}$ and $S\vec{v}^T_{d2}$ 

In [32]:
def docSim(d1,d2):
    Sig = np.diag(S)
    SVtd1 = np.dot(Sig,Vt[:,d1])
    SVtd2 = np.dot(Sig,Vt[:,d2])
    return np.dot(SVtd1,SVtd2)/(np.sqrt(np.dot(SVtd1,SVtd1))*np.sqrt(np.dot(SVtd2,SVtd2)))

In [36]:
com308 = course2Idx['COM-308']
# All courses without com 308
coursesFiltered = [x for x in idx2Course.keys() if not x == com308]
top = np.argsort(-np.array(list(map(lambda d: docSim(com308,d), coursesFiltered))))[:5]
list(map(lambda x: id2name[idx2Course[x]], top))

['Risk management',
 'Working group in Topology II',
 'D. Thinking: real problems, human-focused solutions',
 'Neurosciences II : cellular mechanisms of brain function',
 'Foundations of software']