# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** R

**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

id2name = np.load('id2name.npy').item()
name2id = np.load('name2id.npy').item()
idx2Term = np.load('idx2Term.npy').item()
term2Idx = np.load('term2Idx.npy').item()
idx2Course = np.load('idx2Course.npy').item()
course2Idx = np.load('course2Idx.npy').item()
numTerms = len(term2Idx)
numCourses = len(course2Idx)

idf = np.load('idf.npy')

X = np.load('X.npy').item()

## Exercise 4.4: Latent semantic indexing

In [2]:
k = 300
U, S, Vt = svds(X,k)

- S is the scaling of the matrix, thus it represents the importance of the topic
- U in SVD represents the row space, so in our case it relates the terms to the topics
- Vt is the column space, so it relates the course to the topics

In [3]:
print("The top-20 eigenvalues of X : ")
for val in S[-20:][::-1]:
    print(val)

The top-20 eigenvalues of X : 
39.4021191763
25.9185495559
24.2083222301
22.5214587075
21.8281447463
21.6825048095
20.9470019875
20.345246589
20.165944542
19.405631305
18.9725390588
18.7981133945
18.7352250225
18.4012226199
17.9634010338
17.735473807
17.7320502856
17.4728501849
17.4167092251
17.1811733303


## Exercise 4.5: Topic extraction

In [4]:
for i in reversed(range(k-10,k)):
    print('Topic %d top 10 terms:'%(k-i))
    topTerms = np.argsort(-np.abs(U.T[i]))[:10]
    for term in topTerms:
        print('  - %s\t%.3f'%(idx2Term[term],U.T[i][term]))
    print('Topic %d top 10 courses:'%(k-i))
    topCourses = np.argsort(-np.abs(Vt[i]))[:10]
    for course in topCourses:
        print('  - %s\t%.3f'%(id2name[idx2Course[course]],Vt[i][course]))

Topic 1 top 10 terms:
  - problem	0.099
  - report	0.094
  - research	0.093
  - presentation	0.090
  - plan	0.089
  - material	0.085
  - process	0.083
  - scientific	0.078
  - engineering	0.078
  - work	0.077
Topic 1 top 10 courses:
  - Lab immersion III	0.131
  - Lab immersion I	0.128
  - Bioprocesses and downstream processing	0.124
  - Lab immersion II	0.113
  - Renewable energy and solar architecture in Davos	0.109
  - Nanobiotechnology and biophysics	0.106
  - Semester project in Bioengineering	0.106
  - General physics (English) II	0.093
  - Principles of finance	0.089
  - Project in bioengineering and biosciences	0.087
Topic 2 top 10 terms:
  - wetlab	0.222
  - experimentation	0.205
  - obtained	0.173
  - experiment	0.157
  - laboratory-based	0.144
  - ou	0.126
  - semaines	0.126
  - plan	0.115
  - research	0.105
  - neuroscience	0.105
Topic 2 top 10 courses:
  - Lab immersion III	0.409
  - Lab immersion I	0.393
  - Lab immersion II	0.356
  - Semester project in Bioengineering	0.

Labels :
1. Research
2. Laboratory
3. Finances
4. Architecture
5. Bio economy 
6. Microscopy
7. Life science
8. Micro technology
9. Biomechanics
10. Cultural Heritage

## Exercise 4.6: Document similarity search in concept-space

In [5]:
def sim(Ut,d):
    Sig = np.diag(S)
    Vd = Vt.T[d]
    SVd = np.dot(Sig,Vd)
    return np.dot(Ut,np.dot(Sig,Vd.T))/(np.sqrt(np.dot(Ut,Ut))*np.sqrt(np.dot(SVd,SVd)))

def query(terms):
    terms = terms.split(' ')
    
    Ut = U[term2Idx[terms[0]]]
    for term in terms[1:]:
        Ut += U[term2Idx[term]]
    results = np.array([sim(Ut,d) for d in range(numCourses)])
    top = np.argsort(-results)[:5]
    # We filter all courses with null scores
    top = [course for course in top if results[course] != 0.0]
    
    for course in top:
        print('   -',id2name[idx2Course[course]],'\t%.3f'%results[course])

In [6]:
print('Top courses for vsm:')
print("""   - Computational Social Media 	0.186""")
print('Top courses for lsi:')
query('facebook')

Top courses for vsm:
   - Computational Social Media 	0.186
Top courses for lsi:
   - Computational Social Media 	0.918
   - Social media 	0.679
   - How people learn I 	0.435
   - How people learn II 	0.353
   - Privacy Protection 	0.297


We have more course suggestions than previously. This is due to the dimensionality reduction. Before, the exact term 'facebook' only appeared in Computational Social Media but now some other terms have been "assimilated" in the same topic as 'facebook', thus giving us more results. The course Computational Social Media is still at the top, since it has the term 'facebook'.

In [7]:
print('Top courses for vsm:')
print("""   - Applied probability & stochastic processes 	0.568
   - Applied stochastic processes 	0.566
   - Markov chains and algorithmic applications 	0.358
   - Supply chain management 	0.357
   - Statistical Sequence Processing 	0.299""")
print('Top courses for lsi:')
query('markov chain')

Top courses for vsm:
   - Applied probability & stochastic processes 	0.568
   - Applied stochastic processes 	0.566
   - Markov chains and algorithmic applications 	0.358
   - Supply chain management 	0.357
   - Statistical Sequence Processing 	0.299
Top courses for lsi:
   - Applied probability & stochastic processes 	0.758
   - Applied stochastic processes 	0.708
   - Markov chains and algorithmic applications 	0.671
   - Supply chain management 	0.656
   - Mathematical models in supply chain management 	0.590


The similarity formula tries to 'reconstruct' the term/document value of the intial matrix $X$, but it approximates it with only 300 topics.

Hence it is likely that 'markov' has been associated with 'chain', and thus has less weight than previously. As a result, the course with a high tf-idf for 'chain' is the most relevant to the query.

## Exercise 4.7: Document-document similarity

We can simply do a cosine similarity between $S\mathbf{v}^T_{d1}$ and $S\mathbf{v}^T_{d2}$.
The idea is to compare the two documents to topic similarity, but also weight them by the importance of the topic.

For example, let's say we have 3 documents $d_{1,2,3}$, with $2$ topics $t_{1,2}$ with the importance singular value of $t_1$ bigger than the one of $t_2$ and that $d_1$ and $d_2$ have a very similar topic $t_1$ and a dissimalr topic $t_2$ and that $d_1$ and $d_3$ have the inverse (similar $t_2$, dissimilar $t_1$)

Then it makes more sense that $d_1$ is rated as more similar to $d_2$ than to $d_3$, since the $t_1$ is more important that $t_2$.

In [8]:
def docSim(d1,d2):
    Sig = np.diag(S)
    SVtd1 = np.dot(Sig,Vt[:,d1])
    SVtd2 = np.dot(Sig,Vt[:,d2])
    return np.dot(SVtd1,SVtd2)/(np.sqrt(np.dot(SVtd1,SVtd1))*np.sqrt(np.dot(SVtd2,SVtd2)))

In [9]:
com308 = course2Idx['COM-308']

# All courses without com 308
coursesFiltered = [x for x in idx2Course.keys() if not x == com308]
# Indexes of the 5 most similar courses
top = np.argsort(-np.array(list(map(lambda d: docSim(com308,d), coursesFiltered))))[:5]

print("Top 5 classes most similar to com-308 :")
for course in map(lambda x: id2name[idx2Course[x]], top):
    print("   -",course)

Top 5 classes most similar to com-308 :
   - Parallel and high-performance computing
   - Scientific literature analysis in cell and developmental biology
   - Experimental biochemistry and biophysics
   - Numerical Analysis and Computational Mathematics
   - Seminar series on advances in materials (autumn)
