# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

In [2]:
X = np.load("data/TFIDF.npy")
X = X.T
words = np.load("data/words.npy")
courses = np.load("data/courses.npy")

M = len(words)
N = len(courses)
K = 300

## Exercise 4.4: Latent semantic indexing

### Apply SVD with K = 300 to your term-document matrix X from the previous exercise.

In [3]:
U, S, VT = svds(X, k=K)
V = VT.T

### Describe the rows and columns of U and V, and the values of S.

The rows of U are the left eigenvectors.

The columns of V are the right eigenvectors.

The values in S are the eigenvalues, ordered from smallest to largest.

### Print the top-20 eigenvalues of X

In [5]:
print("LARGEST 20 EIGENVALUES:")
print(S[:-21:-1])

LARGEST 20 EIGENVALUES:
[ 61.4911109   41.11869589  38.06116804  36.89655454  35.64803951
  34.57357261  33.17108239  32.71968738  32.20601846  31.4253953
  31.10123689  30.36687001  29.97372375  29.73694418  28.73211126
  28.65421312  28.46731951  28.27977773  28.22180541  27.95464965]


## Exercise 4.5: Topic extraction

### Extract the topics from the term-document matrix X using the low-rank approximation.

In [6]:
topics = []

for i in range(K-1, 0, -1):
    topics.append([S[i], U.T[i], VT[i]])

### Print the top-10 topics as a combination of terms and a combination of documents.

For this, we will print the top 10 terms and decuments respectively of the top-10 topics

In [7]:
for i in range(10):
    curr = topics[i]
    print("topic:", i+1, "( eigenvalue:", curr[0], ")")
    print("  top10 terms:")
    for t in np.argsort(-curr[2])[:10]:
        print("   ", curr[2][t], ":", words[t])
    print("  top10 documents:")
    for t in np.argsort(-curr[1])[:10]:
        print("   ", curr[1][t], ":", courses[t])
    print("")

topic: 1 ( eigenvalue: 61.4911108973 )
  top10 terms:
    -0.00121034656606 : flat plate
    -0.00121034656606 : previously
    -0.00143706686495 : optimal solutionstake
    -0.00623850065061 : surface energy
    -0.00698697273088 : course general
    -0.00789013541407 : actors
    -0.00791435701751 : electronic devices
    -0.00830468322203 : future
    -0.00834070848096 : presentation project
    -0.00863985144737 : critique
  top10 documents:


IndexError: index 5415 is out of bounds for axis 0 with size 854

### Give a label to each of them.

In [None]:
# TODO

## Exercise 4.6: Document similarity search in concept-space

In [8]:
def docsim(t, d):
    return ( np.dot(U[t], S*V[d]) ) / ( np.linalg.norm(U[t]) * np.linalg.norm(S*V[d]) )

### Implement a search function using LSI concept-space.

In [9]:
words_dict = np.load("data/words_dict.npy").item()
courses_dict = np.load("data/courses_dict.npy").item()

In [10]:
words_dict['markov chain']

7043

In [11]:
V.shape

(854, 300)

### Search for "facebook" in both VSM and LSI. How does it compare? Why?

As before, the term "facebook" does not exist in the filtered list of terms.

### Search for "markov chains" and compare with the previous section.

In [13]:
markov_index = words_dict['markov chain']
markov_sim = []

for i in range(M):
    markov_sim.append(docsim(markov_index, i))

IndexError: index 854 is out of bounds for axis 0 with size 854

## Exercise 4.7: Document-document similarity

### Find the classes that are the most similar to Internet Analytics.

### Write down the equation to efficiently compute the similarity between documents.

### Print the top 5 classes most similar to COM-308.