# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

In [2]:
# load the arrays from the first part
X = np.load("data/TFIDF.npy")
X = X.T
words = np.load("data/words.npy")
courses = np.load("data/courses.npy")

# initialize the array sizes for SVD
M = len(words)
N = len(courses)
K = 300

## Exercise 4.4: Latent semantic indexing

### Apply SVD with K = 300 to your term-document matrix X from the previous exercise.

In [3]:
# compute the SVD of X
U, S, VT = svds(X, k=K)
V = VT.T

### Describe the rows and columns of U and V, and the values of S.

The rows of U are the left singular eigenvectors. Each row is the topic distribution of a term.

The columns of V are the right singular eigenvectors. Each row is the topic distribution of a course description.

The values in S are the eigenvalues, ordered from smallest to largest. This gives an "importance" to each topic. So a high value in S for a topic means that topic is more frequent within the corpus.

### Print the top-20 eigenvalues of X

In [4]:
print("LARGEST 20 EIGENVALUES:")
print(S[:-21:-1])

LARGEST 20 EIGENVALUES:
[ 61.4911109   41.11869589  38.06116804  36.89655454  35.64803951
  34.57357261  33.17108239  32.71968738  32.20601846  31.4253953
  31.10123689  30.36687001  29.97372375  29.73694418  28.73211126
  28.65421312  28.46731951  28.27977773  28.22180541  27.95464965]


## Exercise 4.5: Topic extraction

### Extract the topics from the term-document matrix X using the low-rank approximation.

In [5]:
# extract the topics, containing their eigenvalue,
# as well as the lists containing the weights of
# the terms and courses within this topic respectively
topics = []

# note: this array is computet inversly over S,
# therefore the eigenvalues will be sorted
# from largest to smallest
for i in range(K-1, 0, -1):
    topics.append([S[i], U.T[i], VT[i]])

### Print the top-10 topics as a combination of terms and a combination of documents.

For this, we will print the top 10 terms and documents respectively of the top-10 topics

In [6]:
# the first 10 terms will be the top 10
for i in range(10):
    curr = topics[i]
    
    # title of each topic with it eigenvalue
    print("topic:", i+1, "( eigenvalue:", curr[0], ")")
    
    # print the top 10 terms within that topic
    print("  top10 terms:")
    for t in np.argsort(-curr[1])[:10]:
        print("   ", curr[1][t], ":", words[t])
    
    # print the top 10 courses within that topic
    print("  top10 courses:")
    for t in np.argsort(-curr[2])[:10]:
        print("   ", curr[2][t], ":", courses[t])
    print("")

topic: 1 ( eigenvalue: 61.4911108973 )
  top10 terms:
    0.108270211004 : project
    0.0816817895981 : data
    0.0756358641883 : report
    0.0755159062971 : materials
    0.0726962641445 : research
    0.0726730141339 : work
    0.0705602992579 : engineer
    0.0701417892484 : basic
    0.0681158492668 : study
    0.0676815304476 : theory
  top10 courses:
    0.110097689162 : ChE-437
    0.105206639795 : CH-413
    0.10155141235 : PENS-210
    0.0980084587656 : BIO-487
    0.0850466515802 : PHYS-106(en)
    0.0806883548355 : MATH-468
    0.0801779660628 : ChE-311
    0.0788014289508 : BIO-105
    0.0780009478045 : ME-475
    0.0779429329881 : CH-492

topic: 2 ( eigenvalue: 41.1186958882 )
  top10 terms:
    0.0500541127238 : financial
    0.0439373949129 : price
    0.0412431684583 : finance
    0.0380250945786 : linear
    0.0348625575018 : numerical
    0.0337247795024 : probability
    0.0332250663039 : stochastic
    0.0329423109591 : algorithms
    0.0312462318759 : optimizati

### Give a label to each of them.

**topic 1: semester project** <br>
project, data, report, materials, research, work, engineer, basic, study, theory

**topic 2: finance analysis** <br>
financial, price, finance, linear, numerical, probability, stochastic, algorithms, optimization, market

**topic 3: financial market** <br>
financial, project, finance, price, data, market, edms, doctoral students, risk, corporate

**topic 4: probabilistic finance analysis** <br>
stochastic, linear, probability, financial, price, numerical, algorithms, equations, finance, optimization

**topic 5: body physics** <br>
energy, heat, kinetics, transport, cell, bioprocess, chemistry, fluid, molecular, properties

**topic 6: corproate finances** <br>
financial, finance, price, risk, corporate finance, corporate, market, valuation, asset, derivatives

**topic 7: pharmacology** <br>
drug, molecular, protein, dna, pharmacology, biophysical, scientific, literature, disease, medicine

**topic 8: sensorial pharmacology** <br>
drug, circuit, optical, pharmacology, sensors, signal, devices, disease, nanobiotechnological, digital

**topic 9: biokinetics** <br>
bioprocess, studio, architectural, downstream, kinetics, urban, drug, pharmacology, adsorption, precipitation

**topic 10: reactionary bioprocesses** <br>
circuit, bioprocess, downstream, analog, kinetics, signal, integrate, adsorption, bioreactors, dsp

## Exercise 4.6: Document similarity search in concept-space

In [7]:
# document similarity given the document's and term's index
def docsim(t, d):
    return ( np.dot(U[t], S*V[d]) ) / ( np.linalg.norm(U[t]) * np.linalg.norm(S*V[d]) )

### Implement a search function using LSI concept-space.

In [8]:
# load the dictionaries to return the indexes
# of terms / courses from their name
words_dict = np.load("data/words_dict.npy").item()
courses_dict = np.load("data/courses_dict.npy").item()

### Search for "facebook" in both VSM and LSI. How does it compare? Why?

As before, the term "facebook" does not exist in the filtered list of terms.

### Search for "markov chains" and compare with the previous section.

In [9]:
# compute the docsim of each course with the term 'markov chain'
markov_index = words_dict['markov chain']
markov_sim = []

for i in range(N):
    markov_sim.append(docsim(markov_index, i))

In [10]:
# find and print the best 10 similarity scores
LSI_markov_top10 = np.argsort(-np.array(markov_sim))[:10]
top_markov_names = []

print("TOP 10 COURSES FOR 'MARKOV CHAIN'")
for i in LSI_markov_top10:
    top_markov_names.append([i, courses[i]])
    print(" ", markov_sim[i], ":", courses[i])

TOP 10 COURSES FOR 'MARKOV CHAIN'
  0.857005393917 : MATH-332
  0.767029694205 : COM-516
  0.725089702301 : MGT-484
  0.304445132341 : MATH-600
  0.268045333444 : COM-512
  0.209648482257 : FIN-408
  0.173026424099 : COM-308
  0.148197356075 : COM-417
  0.130716391711 : FIN-409
  0.119442811712 : EE-554


|         | top 1    | top 2   | top 3   | top 4    | top 5   |
|--------:|:--------:|:-------:|:-------:|:--------:|:-------:|
| **VSM** | MATH-332 | COM-516 | MGT-484 | FIN-408  | COM-308 |
| **LSI** | MATH-332 | COM-516 | MGT-484 | MATH-600 | COM-512 |

The first three courses are identical, the last two courses are different, but they are still from the same sections (namely mathematics "MATH" and communication systems "COM"). However, the top 4-5 courses from VSM are the top 6-7 courses in LSI, so they are still similarly placed. 

## Exercise 4.7: Document-document similarity

### Find the classes that are the most similar to Internet Analytics.

In [11]:
# index of the internet analytics course
ia_index = courses_dict['COM-308']

### Write down the equation to efficiently compute the similarity between documents.

Since the topic distribution of a document is entirely stored in V, we only have to compare the right singular vectors of X, or the rows of V.

Other than that, cosine similarity can be used between the vectors, as it is very efficient.

In [12]:
# similarity between two courses given their indexes
def sim(i, j):
    di = V[i]
    dj = V[j]
    return np.dot(di, dj) / ( np.linalg.norm(di) * np.linalg.norm(dj) )

### Print the top 5 classes most similar to COM-308.

In [13]:
# compute the similarity scores between
# the courses and 'COM-308'
ia_sim = []

for v in range(len(V)):
    if v == ia_index:
        # set the similarity of COM-308 with itself
        # to -1, to avoid it being one of the best
        ia_sim.append(-1)
    else:
        ia_sim.append(sim(v, ia_index))

In [14]:
# compute the best 5 similarity scores
ia_sim_top5 = np.argsort(-np.array(ia_sim))[:5]

print("TOP 5 COURSES SIMILAR TO 'COM-308'")
for i in ia_sim_top5:
    print(" ", ia_sim[i], ":", courses[i])

TOP 5 COURSES SIMILAR TO 'COM-308'
  0.501552676399 : CS-423
  0.394303998676 : EE-558
  0.353265127305 : CS-401
  0.332978132722 : EE-727
  0.305013338653 : FIN-525
