# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Xavier Jeanmonod
* Adrian Baudat
* Simon Wicky

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_json, load_pkl
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
all_words = load_pkl('all_words.pkl')
course2index = load_pkl('course2index.pkl')
word2index = load_pkl('word2index.pkl')

## Exercise 4.4: Latent semantic indexing

Before beginning, let's state that in this part we used the matrix with the words appearing only once (like facebook, since it's needed in further sub-exercise).

In [2]:
%store -r

In [3]:
matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [4]:
matrix.shape

(854, 15084)

In [5]:
k = 300
U, S, V_t = svds(matrix,k)

#### U :

In [6]:
print(U.shape)
U

(854, 300)


array([[-0.02268918,  0.02296011,  0.08074571, ..., -0.00615914,
        -0.0111635 , -0.01853261],
       [ 0.00644516, -0.001569  ,  0.02231831, ...,  0.00590896,
        -0.00927564, -0.01547253],
       [ 0.02474889,  0.0073882 ,  0.03458832, ..., -0.02814478,
        -0.01518194, -0.03296627],
       ...,
       [ 0.00025759,  0.03449281,  0.01753479, ..., -0.11729665,
        -0.03337629, -0.05017098],
       [ 0.00443657,  0.01049228, -0.00738276, ...,  0.05652301,
        -0.01425889, -0.02263149],
       [ 0.00844462,  0.00439351, -0.00815017, ..., -0.01893227,
        -0.0351451 , -0.03766727]])

U is mapping the rows, in this case courses, to topics. Each value in the row represent the force of the relation between course and topic.

#### V_t :

In [7]:
print(V_t.shape)
V_t

(300, 15084)


array([[ 2.67285860e-03,  1.23772981e-02, -3.82495177e-04, ...,
        -1.35336654e-03, -2.89373096e-02,  2.94617282e-03],
       [-5.23307048e-04, -1.39375297e-04,  8.74696329e-06, ...,
         1.29707445e-04,  4.59221938e-04,  7.08964128e-04],
       [ 2.78861709e-03,  2.43489122e-03,  6.48166365e-04, ...,
         2.89141047e-05,  1.49177048e-02,  1.80106420e-03],
       ...,
       [-7.27293897e-04,  1.67416621e-03,  3.60799701e-03, ...,
         5.58226583e-04,  2.43202958e-02,  8.32244498e-05],
       [-2.00508609e-03, -1.09775088e-03, -9.06381074e-04, ...,
        -3.61922682e-04, -1.33296148e-02, -1.81499829e-04],
       [-2.29073425e-03, -8.12133706e-04, -1.02802441e-03, ...,
        -4.43996313e-04, -1.19143717e-02, -1.37813739e-04]])

Vt is mapping the column(terms) to the topics. Each value in the column represent the force of the relation between term and topic.

#### S :

In [8]:
print(S.shape)
S

(300,)


array([14.58908517, 14.61303118, 14.62856051, 14.64281564, 14.68347555,
       14.71609496, 14.72414101, 14.7410776 , 14.7555014 , 14.77397972,
       14.79343359, 14.81247579, 14.84991953, 14.88279825, 14.8952375 ,
       14.9164187 , 14.95366958, 14.96127146, 15.00082855, 15.01803266,
       15.0403938 , 15.05462628, 15.08299704, 15.11826093, 15.13239177,
       15.14351906, 15.173214  , 15.18727269, 15.22265818, 15.23907516,
       15.2719911 , 15.29163237, 15.31346592, 15.32439113, 15.34173508,
       15.36811342, 15.38872544, 15.43358504, 15.45997672, 15.47191087,
       15.49025909, 15.53155289, 15.55252486, 15.55538985, 15.59463359,
       15.60288057, 15.64006408, 15.67274239, 15.69356033, 15.70964046,
       15.72402955, 15.77028582, 15.79309517, 15.81296436, 15.84243488,
       15.86848064, 15.9022868 , 15.94148775, 15.96034776, 15.99118629,
       16.00632682, 16.02124652, 16.05902298, 16.08720169, 16.13572849,
       16.1390365 , 16.16263213, 16.20007044, 16.2203879 , 16.24

S is the matrix of singular values. Thoses values shows how "strong" a topic is. A big value implies a "stronger" topic.

#### The top-20 eigenvalues of X : 

In [9]:
for singular_value in S[-20:][::-1]:
    print(singular_value*singular_value)

3489.809424671651
2043.7942798555757
1401.6218034267595
1360.0512253679583
1238.354481013834
1192.1456925719333
1146.9299796083835
1090.3571629562127
1077.3117669864378
1002.158658295746
935.7633478290616
935.2457879985973
905.7352957376381
883.6427459159333
876.6208789055595
856.2455539745112
825.9660303413124
812.3418842576767
803.7758108674012
777.775404544573


## Exercise 4.5: Topic extraction

As the singular values are ordered, the indexes that we will focus on are the last 10.

#### 10 most importants terms for 10 most important topics

In [10]:
for v in range(1, 11):
    terms = []
    indexes = np.argsort(V_t[-v,:])[-10:]
    for i in indexes:
        for word, index in word2index.items():
            if index == i:
                terms.append(word)
    print('For topic ', v, ':')
    print(terms)
    print('\n')

For topic  1 :
['snowpackformulate', 'snowatmosphere', 'eng272', 'transmits', 'vegetation', 'coverclimate', 'snowairground', 'pack', 'avalanches', 'metamorphism']


For topic  2 :
['reports', 'projects', 'host', 'head', 'supervising', 'laboratorybased', 'experiments', 'obtained', 'wetlab', 'experimentation']


For topic  3 :
['spectroscopy', 'electron', 'note', 'thin', 'chemical', 'optical', 'cell', 'protein', 'molecular', 'microscopy']


For topic  4 :
['expound', 'compose', 'assesses', 'audience', 'coherently', 'applies', 'subject', 'form', 'laboratories', 'acquired']


For topic  5 :
['territorial', 'urban', 'doctoral', 'architectural', 'cell', 'edms', 'laba', 'architecture', 'development', 'studio']


For topic  6 :
['note', 'drug', 'molecular', 'cells', 'cellular', 'biology', 'doctoral', 'edms', 'protein', 'cell']


For topic  7 :
['growth', 'alloys', 'asset', 'corporate', 'retreat', 'markets', 'finance', 'pricing', 'risk', 'financial']


For topic  8 :
['recrystallization', 'prio

#### 10 most important courses for 10 most important topics

In [11]:
for v in range(1, 11):
    top_courses = []
    indexes = np.argsort(U[:,-v])[-10:]
    for i in indexes:
        for courseId, index in course2index.items():
            if index == i:
                top_courses.append((courseId,courses[index]["name"]))
    print('For topic ', v, ':')
    print(top_courses)
    print('\n')

For topic  1 :
[('MICRO-600', 'Emerging Nanopatterning Methods'), ('COM-414', 'Satellite communications  systems and networks'), ('CH-404', 'Laboratory information management systems (LIMS)'), ('ENV-525', 'Physics and hydrology of snow'), ('EE-466', 'Energy storage in power systems: technologies, applications and future needs'), ('AR-202(c)', 'Studio BA4 (De Vylder & Taillieu)'), ('BIO-430', 'Multidisciplinary organization of medtechs/biotechs'), ('BIO-699(n)', 'Training Rotation (EDNE)'), ('MGT-690(A)', 'Field Research Project A'), ('MGT-690(B)', 'Field Research Project B')]


For topic  2 :
[('MSE-490(b)', 'Research project in materials II'), ('BIOENG-489', 'Semester project in Bioengineering'), ('BIO-502', 'Lab immersion II'), ('BIO-505', 'Lab immersion academic (outside EPFL) B'), ('BIO-504', 'Lab immersion academic (outside EPFL) A'), ('BIO-506', 'Lab immersion in industry A'), ('BIO-501', 'Lab immersion I'), ('BIO-507', 'Lab immersion in industry B'), ('BIO-503', 'Lab immersion I

We can then "label" the 10 topics from the informations taken from U and V_t (courses and terms) :

Topic 1 : Climate and energy

Topic 2 : Lab immersion / Bioengineering

Topic 3 : Biotechnology

Topic 4 : Projects

Topic 5 : Architecture

Topic 6 : Biology

Topic 7 : Finance

Topic 8 : Machine Learning - Artificial Intelligence

Topic 9 : Finance theory

Topic 10 : Bioelectronics

## Exercise 4.6: Document similarity search in concept-space

In [12]:
def sim(U_t,V_d):
    Sig = np.diag(S)
    return ((np.dot(U_t, np.dot(Sig, V_d)))/(np.linalg.norm(U_t) * np.linalg.norm(np.dot(S, V_d))))

In [27]:
def search(term):
    results = np.zeros(len(courses))
    for terms in term.split():
        t = word2index[terms]
        V_d = V_t[:,t]
        for d, i in course2index.items():
            U_t = U[i]
            results[i] += sim(U_t, V_d)
    top_5 = np.argsort(results)[::-1][0:5]
    for top in top_5:
        for course, index in course2index.items():
            if index == top:
                print(course, courses[index]["name"], results[index])

In [28]:
search('markov chains')

COM-516 Markov chains and algorithmic applications 3.244641440942266
MGT-484 Applied probability & stochastic processes 3.221051698608126
MATH-332 Applied stochastic processes 3.1718456268851747
EE-605 Statistical Sequence Processing 2.0175325733666796
COM-512 Networks out of control 1.7203296404840267


All courses seems pretty relevant, with really decent scores, and most of them are the same as in the previous section.

In [29]:
search('facebook')

EE-727 Computational Social Media 1.125913852505279
EE-593 Social media 0.8663649477410287
CS-622 Privacy Protection 0.4288917563976208
EE-552 Media security 0.3963143712998568
CS-423 Distributed information systems 0.3364855446809766


Here we get 5 courses, even tho facebook appears only once, this method allows to get more results, but we see that the similarity score is much lower than in the previous section. Also, the "best" course is the same in both cases.

## Exercise 4.7: Document-document similarity

To compute the document-document similarity, we will use the cosine similarity :

In [16]:
def cosine_similarity(doc1, doc2):
    Sig = np.diag(S)
    d1 = np.dot(Sig,U[doc1])
    d2 = np.dot(Sig,U[doc2])
    return (np.dot(d1, d2)/(np.linalg.norm(d1) * np.linalg.norm(d2)))

In [17]:
COM308 = course2index['COM-308']

In [18]:
all_other_courses_index = [index for course, index in course2index.items() if not index == COM308]

In [19]:
similar = np.array(list(map(lambda d: cosine_similarity(COM308,d), all_other_courses_index)))

In [21]:
top = np.argsort(-similar)[:5]

In [22]:
print("Top 5 courses similar to COM-308 :")
for i in top:
        for course, index in course2index.items():
            if index == i:
                print(course, courses[index]["name"])

Top 5 courses similar to COM-308 :
ENG-430 Risk management
MATH-470 Martingales in financial mathematics
MGT-409 D. Thinking: real problems, human-focused solutions
MATH-726(2) Working group in Topology II
CS-452 Foundations of software
