# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Xavier Jeanmonod
* Adrian Baudat
* Simon Wicky

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [2]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_json, load_pkl
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
all_words = load_pkl('all_words.pkl')
course2index = load_pkl('course2index.pkl')
word2index = load_pkl('word2index.pkl')

## Exercise 4.4: Latent semantic indexing

In [6]:
%store -r

In [7]:
matrix

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.46401394, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [8]:
k = 300
U, S, V_t = svds(matrix,k)

#### U :

In [18]:
print(U.shape)
U

(854, 300)


array([[ 5.35456348e-02,  8.16702956e-04,  2.47378324e-02, ...,
         6.78886542e-03, -1.18039680e-02,  2.00308552e-02],
       [ 1.13462061e-02,  3.31666387e-02,  1.05037204e-02, ...,
        -7.35668577e-03, -1.00297396e-02,  1.58853730e-02],
       [-5.00459598e-02, -5.47267600e-02,  5.37818466e-02, ...,
         3.39894021e-02, -1.48901372e-02,  3.40809218e-02],
       ...,
       [-3.76463192e-03, -2.32400307e-02, -1.35407911e-02, ...,
         1.48076862e-01, -3.25653381e-02,  5.09026304e-02],
       [-1.96057837e-02, -1.82307035e-02, -8.63681070e-03, ...,
        -7.18049937e-02, -1.40234591e-02,  2.23131307e-02],
       [ 2.08457052e-03,  8.89393867e-05,  4.14205207e-02, ...,
         3.68519900e-02, -3.66485676e-02,  3.85971342e-02]])

U is mapping the rows, in this case terms, to topics. Each value in the row represent the force of the relation between term and topic.

#### V_t :

In [10]:
print(V_t.shape)
V_t

(300, 2088)


array([[ 0.00500376, -0.00164216,  0.01067023, ..., -0.03164159,
        -0.00160419,  0.01626721],
       [ 0.01693939,  0.00936552, -0.02908367, ...,  0.01430534,
        -0.0050507 , -0.00552539],
       [ 0.00404197, -0.0059113 , -0.00096147, ..., -0.01470767,
         0.00543182, -0.0342661 ],
       ...,
       [ 0.00031406,  0.00294958,  0.00400119, ..., -0.00345994,
         0.00106087, -0.00713993],
       [-0.00761473, -0.00391726, -0.00314732, ..., -0.00434616,
        -0.00200593, -0.00714724],
       [ 0.00889954,  0.00640221,  0.0044029 , ...,  0.00826982,
         0.00252748,  0.00718987]])

Vt is mapping the column(courses) to the topics. Each value in the column represent the force of the relation between course and topic.

#### S :

In [11]:
print(S.shape)
S

(300,)


array([10.61783528, 10.62073574, 10.62915801, 10.66096399, 10.69496227,
       10.70380934, 10.7274979 , 10.75216881, 10.76308428, 10.80473429,
       10.81304268, 10.82345581, 10.8581238 , 10.87180348, 10.8807253 ,
       10.91553557, 10.92234272, 10.94709278, 10.97747027, 11.0013864 ,
       11.00978029, 11.04099556, 11.07121323, 11.09325526, 11.11449681,
       11.1237999 , 11.16404964, 11.19058686, 11.19563277, 11.20418916,
       11.2282125 , 11.24903995, 11.27808652, 11.32324127, 11.3375831 ,
       11.34916399, 11.37095357, 11.38628902, 11.42183748, 11.43118986,
       11.45704221, 11.4834273 , 11.48948351, 11.51019888, 11.52611872,
       11.54315391, 11.5717567 , 11.58902016, 11.62509098, 11.63364049,
       11.66669042, 11.69997386, 11.72661119, 11.7572766 , 11.76670464,
       11.79572542, 11.82821968, 11.84465247, 11.85422317, 11.87517831,
       11.88279721, 11.91927028, 11.93756703, 11.95902396, 11.98171641,
       12.04623006, 12.06790702, 12.10833621, 12.1270061 , 12.13

S is the matrix of singular values. Thoses values shows how "strong" a topic is. A big value implies a "stronger" topic.

#### The top-20 eigenvalues of X : 

In [16]:
for singular_value in S[-20:][::-1]:
    print(singular_value*singular_value)

3331.826342633921
1782.5840172912424
1202.8285615102295
1162.7114948217918
1022.6540433309004
927.3071625849542
915.8785920753626
864.4198459047226
806.9675313841718
751.4447932612719
725.3521959560571
688.8953553677352
661.4565085954019
658.4175238842671
652.1209015146225
615.5041201009582
601.4880901488509
585.2491121648318
563.0992365677002
554.6357229370088


## Exercise 4.5: Topic extraction

As the singular values are ordered, the indexes that we will focus on are the last 10.

#### 10 most importants terms for 10 most important topics

In [26]:
for v in range(1, 11):
    terms = []
    indexes = np.argsort(U[:,-v])[-10:]
    for i in indexes:
        for word, index in word2index.items():
            if index == i:
                terms.append(word)
    print('For topic ', v, ':')
    print(terms)
    print('\n')

For topic  1 :
['success', 'earth', 'discovery', 'categorize', 'adaptive', 'resolution', 'printing', 'order', 'broad', 'direct']


For topic  2 :
['success', 'earth', 'categorize', 'order', 'broad', 'discovery', 'adaptive', 'resolution', 'printing', 'direct']


For topic  3 :
['controlled', 'manipulation', 'security', 'manufacturing', 'integrals', 'economic', 'success', 'concept', 'geometry', 'reporting']


For topic  4 :
['bring', 'navigation', 'microfluidics', 'steady', 'legal', 'long', 'controlled', 'economic', 'integrals', 'success']


For topic  5 :
['plant', 'characterize', 'test', 'present', 'options', 'manipulation', 'signals', 'language', 'geometry', 'reporting']


For topic  6 :
['manipulation', 'multivariate', 'impacts', 'makes', 'randomized', 'geometry', 'continuum', 'reporting', 'converters', 'matched']


For topic  7 :
['taste', 'additional', 'investment', 'plant', 'themes', 'figures', 'matched', 'loading', 'actual', 'day']


For topic  8 :
['mission', 'efficient', 'manuf

#### 10 most important courses for 10 most important topics

In [33]:
for v in range(1, 11):
    top_courses = []
    indexes = np.argsort(V_t[:,-v])[-10:]
    for i in indexes:
        for courseId, index in course2index.items():
            if index == i:
                top_courses.append((courseId,courses[index]["name"]))
    print('For topic ', v, ':')
    print(top_courses)
    print('\n')

For topic  1 :
[('MSE-465', 'Thin film fabrication processes'), ('BIOENG-433', 'Biotechnology lab (for CGC)'), ('CH-711', 'Inorganic chemistry "Applications and spin-offs"'), ('MGT-453', 'Industry dynamics, models & trends'), ('HUM-429(a)', 'Philosophy of life sciences I'), ('ChE-421', 'Advanced principles and applications of systems biology'), ('PENS-306', 'Mapping urban history'), ('MATH-332', 'Applied stochastic processes'), ('CIVIL-429', 'Reservoir mechanics for geo-energy and the environment'), ('CS-422', 'Database systems')]


For topic  2 :
[('CH-442', 'Photochemistry I'), ('MGT-641(b)', 'Technology and Public Policy - (b) Technology, policy and regulation'), ('BIO-479', 'Immunology'), ('CS-622', 'Privacy Protection'), ('MGT-602', 'Mathematical models in supply chain management'), ('ChE-421', 'Advanced principles and applications of systems biology'), ('CH-711', 'Inorganic chemistry "Applications and spin-offs"'), ('ME-484', 'Numerical methods in biomechanics'), ('BIO-617', 'Pra

We can then "label" the 10 topics from the informations taken from U and V_t (terms and courses) :

Topic 1 : Science of life

Topic 2 : Chemistry/Biology

Topic 3 : Geology/Finance

Topic 4 : Ecology/Transport

Topic 5 : Biological Computer Science

Topic 6 : Markets/Financial Engineering

Topic 7 : Applied Physics

Topic 8 : "Solutions for the Future"

Topic 9 : Molecular Structures

Topic 10 : Theory of computation

## Exercise 4.6: Document similarity search in concept-space

In [18]:
def sim(t,d):
    Sig = np.diag(S)
    V_d = V_t.T[d]
    SV_d = np.dot(Sig,V_d)
    return np.dot(U[t],np.dot(Sig,V_d.T))/(np.sqrt(np.dot(U[t],U[t]))*np.sqrt(np.dot(SV_d,SV_d)))

In [None]:
def search(term):
    sim(term, d) for d in range(courses)

## Exercise 4.7: Document-document similarity