# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** J

**Names:**

* Rafael Bischof
* Jeniffer Lima Graf
* Alexander Sanchez

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [185]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from scipy import load, sparse
from utils import load_json, load_pkl

In [186]:
import pickle

with open('data/wordsIdx.pickle', 'rb') as handle:
    wordsIdx = pickle.load(handle)
with open('data/coursesIdx.pickle', 'rb') as handle:
    coursesIdx = pickle.load(handle)
with open('data/idxWords.pickle', 'rb') as handle:
    idxWords = pickle.load(handle)
with open('data/idxCourses.pickle', 'rb') as handle:
    idxCourses = pickle.load(handle)
courses = load_json('data/courses.txt')

In [187]:
TFIDF = sparse.load_npz("data/TFIDF.npz")

## Exercise 4.4: Latent semantic indexing

In [188]:
U, S, Vt = svds(TFIDF, k=300)

In [189]:
Vt.shape

(300, 854)

The rows in U are the topic distribution of a word.<br>
The rows in V are the topic distribution of the courses.<br>
The eigenvalues in S describe how important a topic is within a corpus.

In [190]:
print('Top 20 eigenvalues:')
print(S[:-21:-1])

Top 20 eigenvalues:
[ 61.35243133  51.44264116  48.31807753  39.9855274   37.95878698
  36.03210315  35.17925363  34.7430648   34.32952171  33.49450045
  32.97041157  32.68043096  32.2246354   31.5077868   31.21793785
  30.42262433  30.32133598  29.83256487  29.75958445  29.23623875]


## Exercise 4.5: Topic extraction

In [191]:
topWordsPerTopic = np.argsort(-U[:, :-11:-1].T)[:,:10]
topCoursesPerTopic = np.argsort(-Vt[:-11:-1])[:,:10]

In [197]:
labels = ['Semester project', 'Research project', 'Bio labs', 'Research project', 'Nanobiology', 'PhD', 'Chemistry', 'Microscopy', 'Conference', 'Electrical engineering']

In [198]:
dash = '-' * 100
for t in range(10):
    print('\nTopic', t+1, 'Eigenvalue:', S[-t-1])
    print('{:<40s}{:<40s}'.format('Top words', 'Top courses'))
    for i in range(10):
        print('{:<10f}{:<30s}{:<10f}{:<30s}'.format(U[topWordsPerTopic[t][i], -t-1], idxWords[topWordsPerTopic[t][i]],\
                                       Vt[-t-1, topCoursesPerTopic[t][i]], courses[topCoursesPerTopic[t][i]]['name'][:50]))
    print('\nLabel:', labels[t])
    print(dash)


Topic 1 Eigenvalue: 61.3524313251
Top words                               Top courses                             
0.095087  report                        0.257114  Project in computer science II
0.093745  research                      0.231480  Lab immersion III             
0.092791  data                          0.180830  Lab immersion I               
0.088504  scientific                    0.152012  Optional project in computer science
0.084590  presentation                  0.139024  Project in bioengineering and biosciences
0.079975  problem                       0.132726  Lab immersion II              
0.074883  oral                          0.119601  Semester project in Bioengineering
0.074724  literature                    0.108973  Optional project in communication systems
0.072443  plan                          0.104406  Renewable energy and solar architecture in Davos
0.072113  oral presentation             0.099614  Nanobiotechnology and biophysics

Label: Semester proje

## Exercise 4.6: Document similarity search in concept-space

In [213]:
def sim(t, d):
    return (U[t] @ (S * Vt.T[d])) / (np.linalg.norm(U[t]) * np.linalg.norm(S * Vt.T[d]))

In [216]:
markov = [sim(wordsIdx['markov chain'], d) for d in idxCourses]
facebook = [sim(wordsIdx['facebook'], d) for d in idxCourses]

In [228]:
topMarkov = [(markov[c], courses[c]['name']) for c in np.argsort(markov)[:-6:-1]]
topFacebook = [(facebook[c], courses[c]['name']) for c in np.argsort(facebook)[:-6:-1]]
print('Best courses for markov chain:')
[print('%.3f %s'%(x, y)) for x,y in topMarkov]

print('\nBest courses for facebook:')
[print('%.3f %s'%(x, y)) for x,y in topFacebook]


Best courses for markov chain:
0.893 Markov chains and algorithmic applications
0.781 Applied probability & stochastic processes
0.707 Applied stochastic processes
0.357 Networks out of control
0.331 Advanced probability and applications

Best courses for facebook:
0.941 Computational Social Media
0.693 Social media
0.375 How people learn I
0.285 Internet analytics
0.281 Media security


[None, None, None, None, None]

## Exercise 4.7: Document-document similarity

In [229]:
def simDocs(i, j):
    di = Vt.T[i]
    dj = Vt.T[j]
    return (di @ dj) / (np.linalg.norm(di) * np.linalg.norm(dj))

In [231]:
ix = [simDocs(coursesIdx['COM-308'], c) for c in idxCourses]
topIx = [(ix[c], courses[c]['name']) for c in np.argsort(ix)[:-7:-1]]
print('Most similar courses to Internet Analytics:')
[print('%.3f %s'%(x, y)) for x,y in topIx[1:]]

Most similar courses to Internet Analytics:
0.541 Distributed information systems
0.331 Financial big data
0.329 Applied data analysis
0.291 Computational Social Media
0.264 A Network Tour of Data Science


[None, None, None, None, None]