# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [24]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()
numTerms = len(term2Idx)
numCourses = len(course2Idx)

idf = np.load('idf.npy')

X = np.load('X.npy').item()

## Exercise 4.4: Latent semantic indexing

In [15]:
k = 300
U, S, Vt = svds(X,k)

1 loop, best of 3: 2.17 s per loop


- U relates the terms to the topic
- S gives the importance of the topic
- Vt relates the course to the topic

In [16]:
print("The top-20 eigenvalues of X : ")
list(S[-20:][::-1])

The top-20 eigenvalues of X : 


[39.08928548649714,
 24.73264670937084,
 23.51299042422734,
 21.934011231985618,
 21.700080365620373,
 21.610618584800157,
 20.862342250939097,
 20.476572836438603,
 20.260554697447549,
 19.292993564037143,
 19.021476260915772,
 18.789530688592901,
 18.621696302435758,
 18.351176418004744,
 18.081469334249938,
 17.866904877203627,
 17.807177689815024,
 17.519950430817588,
 17.38204739640604,
 17.197000806977691]

## Exercise 4.5: Topic extraction

In [17]:
for i in reversed(range(k-10,k)):
    print('Topic %d top 10 terms:'%(k-i))
    topTerms = np.argsort(-np.abs(U.T[i]))[:10]
    for term in topTerms:
        print('  - %s\t%.3f'%(idx2Term[term],U.T[i][term]))
    print('Topic %d top 10 courses:'%(k-i))
    topCourses = np.argsort(-np.abs(Vt[i]))[:10]
    for course in topCourses:
        print('  - %s\t%.3f'%(id2name[idx2Course[course]],Vt[i][course]))

Topic 1 top 10 terms:
  - data	0.097
  - problem	0.092
  - research	0.090
  - material	0.087
  - plan	0.083
  - process	0.083
  - report	0.082
  - presentation	0.082
  - engineering	0.077
  - work	0.077
Topic 1 top 10 courses:
  - Bioprocesses and downstream processing	0.129
  - Lab immersion III	0.118
  - Renewable energy and solar architecture in Davos	0.116
  - Nanobiotechnology and biophysics	0.109
  - General physics (English) II	0.097
  - Lab immersion I	0.093
  - Principles of finance	0.091
  - Project in bioengineering and biosciences	0.083
  - History of globalization I	0.083
  - Biochemical engineering	0.081
Topic 2 top 10 terms:
  - electron	-0.152
  - microscopy	-0.124
  - optical	-0.120
  - data	0.105
  - semiconductor	-0.104
  - cell	-0.098
  - financial	0.098
  - spectroscopy	-0.097
  - plan	0.095
  - sem	-0.094
Topic 2 top 10 courses:
  - Nanofabrication with focused electron and ion beams	-0.236
  - Lab immersion III	0.225
  - Lab immersion I	0.175
  - Principles of fi

In [5]:
topicCourses = np.diag(S) @ Vt
topicTerms = np.diag(S) @ U.T
topTopics = (np.sum(topicCourses,axis=1)+np.sum(topicTerms, axis=1)).argsort()[-10:][::-1]
for idx, topic in enumerate(topTopics):
    print("Topic %d top 5 terms:"%(idx+1))
    for term in map(lambda x: idx2Term[x], topicTerms[topic].argsort()[-5:][::-1]):
        print('  - %s'%term)
    print("Topic %d top 5 courses:"%(idx+1))
    for course in map(lambda x: id2name[idx2Course[x]], topicCourses[topic].argsort()[-5:][::-1]):
        print('  - %s'%course)

Topic 1 top 5 terms:
  - architectural
  - architecture
  - studio
  - urban
  - laba
Topic 1 top 5 courses:
  - Théorie et critique du projet MA1 (Gugger)
  - Studio MA2 (Escher et GuneWardena)
  - Théorie et critique du projet MA2 (Gugger)
  - Théorie et critique du projet MA2 (Huang)
  - Théorie et critique du projet MA1 (Huang)
Topic 2 top 5 terms:
  - cell
  - protein
  - edms
  - doctoral
  - note
Topic 2 top 5 courses:
  - Practical - Constam Lab
  - Nanobiotechnology and biophysics
  - Machine Learning for Engineers
  - Practical - Trono Lab
  - Practical - Fellay Lab
Topic 3 top 5 terms:
  - reactor
  - reaction
  - 8:15-12:00
  - optic
  - chemical
Topic 3 top 5 courses:
  - Lab methods : biosafety
  - Structural mechanics (for SV)
  - Inorganic chemistry "Applications and spin-offs"
  - Frederic Joliot/Otto Hahn Summer School on nuclear reactors Physics, fuels and systems
  - Neutronics
Topic 4 top 5 terms:
  - edms
  - doctoral
  - architecture
  - circuit
  - note
Topic 4 

## Exercise 4.6: Document similarity search in concept-space

In [37]:
def sim(Ut,d):
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(Ut,SVtd)/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

def query(terms):
    terms = terms.split(' ')
    Ut = U[term2Idx[terms[0]]]
    for term in terms[1:]:
        Ut += U[term2Idx[term]]
    results = np.array([sim(Ut,d) for d in range(numCourses)])
    top = np.argsort(-results)[:5]
    # We filter all courses with null scores
    top = [course for course in top if results[course] != 0.0]
    
    for course in top:
        print('   -',id2name[idx2Course[course]],'\t%.3f'%results[course])
query('facebook')    

   - Computational Social Media 	0.929
   - Social media 	0.663
   - How people learn I 	0.400
   - How people learn II 	0.320
   - Internet analytics 	0.298


In [34]:
def sim(t,d):
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(U[t],SVtd)/(np.sqrt(np.dot(U[t],U[t]))*np.sqrt(np.dot(SVd,SVd)))
def sim2Terms(t1,t2,d):
    Ut = U[t1]+U[t2]
    Sig = np.diag(S)
    SVtd = np.dot(Sig,Vt[:,d])
    SVd = np.dot(Sig,Vt.T[d])
    return np.dot(Ut,SVtd)/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

In [136]:
top = np.argsort(-np.array(list(map(lambda d: sim(term2Idx['facebook'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim(term2Idx['facebook'],x)), top))
print('Top courses for vsm:')
print("""   - Computational Social Media 	 3.908""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%(course))

Top courses for vsm:
   - Computational Social Media 	 3.908
Top courses for lsi:
   - Computational Social Media	0.918
   - Social media	0.678
   - How people learn I	0.437
   - How people learn II	0.353
   - Media security	0.298


In [35]:
top = np.argsort(-np.array(list(map(lambda d: sim2Terms(term2Idx['markov'],term2Idx['chain'],d), idx2Course.keys()))))[:5]
lsi = list(map(lambda x: (id2name[idx2Course[x]],sim2Terms(term2Idx['markov'],term2Idx['chain'],x)), top))
print('Top courses for vsm:')
print("""   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193
""")
print('Top courses for lsi:')
for course in lsi:
    print('   - %s\t%.3f'%course)

Top courses for vsm:
   - Applied probability & stochastic processes 	6.342
   - Markov chains and algorithmic applications 	5.990
   - Applied stochastic processes 	5.159
   - Internet analytics 	4.193
   - Stochastic calculus I 	4.193

Top courses for lsi:
   - Supply chain management	0.843
   - Mathematical models in supply chain management	0.780
   - Applied probability & stochastic processes	0.607
   - Markov chains and algorithmic applications	0.580
   - Applied stochastic processes	0.571


## Exercise 4.7: Document-document similarity

We can simply to a cosine similarity between $S\mathbf{v}^T_{d1}$ and $S\mathbf{v}^T_{d2}$ 

In [38]:
def docSim(d1,d2):
    Sig = np.diag(S)
    SVtd1 = np.dot(Sig,Vt[:,d1])
    SVtd2 = np.dot(Sig,Vt[:,d2])
    return np.dot(SVtd1,SVtd2)/(np.sqrt(np.dot(SVtd1,SVtd1))*np.sqrt(np.dot(SVtd2,SVtd2)))

In [39]:
com308 = course2Idx['COM-308']
# All courses without com 308
coursesFiltered = [x for x in idx2Course.keys() if not x == com308]
top = np.argsort(-np.array(list(map(lambda d: docSim(com308,d), coursesFiltered))))[:5]
print("Top 5 classes most similar to com-308 :")
list(map(lambda x: id2name[idx2Course[x]], top))

Top 5 classes most similar to com-308 :


['Risk management',
 'Soft Microsystems Processing and Devices',
 'Martingales in financial mathematics',
 'Statistical thermodynamics',
 'D. Thinking: real problems, human-focused solutions']