# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl

from nltk.stem.wordnet import WordNetLemmatizer

# load the files and initialize the punctiuation to remove
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
punctuation = '.?!,;:-–()[]{}"/\'0123456789%*+='

## Exercise 4.1: Pre-processing

### Pre-process the corpus to create bag-of-words representations of each document

In [2]:
# initialize the minimum and maximum times any term may appear
# terms that occur less than MIN
# or more than MAX times get filtered
MIN = 10
MAX = 2000

# note: downloading wordnet is required
# command: python -m nltk.downloader
#   in the console
lemmatizer = WordNetLemmatizer()

In [3]:
# function to construct n-grams of any length n
# based on the solution found on
# http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
def find_ngrams(input_list, n):
    return [str(x[0] + " " + x[1]) for x in list(zip(*[input_list[i:] for i in range(n)])) ]

In [4]:
courses_proc = {}
words = {}

# construct the dictionaries for the courses and terms
for c in courses:
    cid = c['courseId']
    
    # transfer all to lowercase, remove punctuation and numbers
    desc = c['description'].lower().translate(str.maketrans('', '', punctuation))
    # remove stopwords
    desc_proc = [lemmatizer.lemmatize(word, pos='v') for word in desc.split() if word not in stopwords]
    
    desc_2grams = find_ngrams(desc_proc, 2)
    desc_proc.extend(desc_2grams)
    
    # create a dict of all words
    for w in desc_proc:
        if w in words:
            words[w] += 1
        else:
            words[w] = 1
    
    if cid in courses_proc:
        # some courseIds appear multiple times
        # for those, we decided to append the desctiptions to each other
        courses_proc[cid].extend(desc_proc)
    else:
        courses_proc[cid] = desc_proc

In [5]:
words_filtered = []

# filter most and least frequent words
for w in words:
    if words[w] > MIN and words[w] < MAX:
        words_filtered.append(w)

In [6]:
courses_proc2 = {}

# remove all the words that have been filtered out from the descriptions
for c in courses_proc:
    courses_proc2[c] = [word for word in courses_proc[c] if word in words_filtered]

### Explain which kinds of cleaning you implemented and why

* Make sure every courseId only appears once. If it appears more than once, append the two descriptions
* Change all cheacters to lowercase, to avoid the same word being considered different. Additionally, all words in stopwords.pkl are in lowercase, so this ensures all words are removed correctly
* Remove punctuation, numbers and other symbols, for the same reason as above. Only the words are important, not percentages or numbers.
* Remove stopwords, as these don't carry any information about the content of the document
* filter most and least frequent words, as they only mess up the results when trying to find similarities between documents. commomn words will be in almost every document, rare ones in almost none.
* Lemmatization, to detect only words with different meaning. Identical words, such as "is" and "are" will automatically be transformed to "be", the same is true with multiples ("documents" -> "document").
* n-grams: 2-shingles, to detect frequent combinations of words. Infrequent shingles are filtered just like infrequent words. No higher degrees for the sake of performance.

### Print the terms in the pre-processed description of the $9^{th}$ class in alphabetical order.

In [7]:
# print the fill description of the 9th course
print(sorted(courses_proc2[courses[9]['courseId']]))

['abstract', 'abstract', 'abstract', 'activities', 'activities', 'activities', 'activities attend', 'activities attend', 'activities attend', 'al', 'al', 'al', 'algorithmic', 'algorithmic', 'algorithmic', 'architecture', 'architecture', 'architecture', 'assessment', 'assessment', 'assessment', 'assessment methods', 'assessment methods', 'assessment methods', 'assistants', 'assistants', 'assistants', 'assistants forum', 'assistants forum', 'assistants forum', 'attend', 'attend', 'attend', 'attend lecture', 'attend lecture', 'attend lecture', 'automation', 'automation', 'automation', 'basic', 'basic', 'basic', 'bibliography', 'bibliography', 'bibliography', 'bibliothèque', 'bibliothèque', 'bibliothèque', 'build', 'build', 'build', 'challenge', 'challenge', 'challenge', 'cod', 'cod', 'cod', 'combinational', 'combinational', 'combinational', 'comment', 'comment', 'comment', 'complete', 'complete', 'complete', 'complete exercise', 'complete exercise', 'complete exercise', 'complex', 'comple

## Exercise 4.2: Term-document matrix

### Construct an M×N term-document matrix X, where M is the number of terms and N is the number of documents. The matrix X should be sparse. You are not allowed to use libraries for this task.

In [8]:
# construct mappings to retrieve a term from its index
# as well as an index given a term
words_index = {}
i = 0
words_list = []

for w in words_filtered:
    words_index[w] = i
    i += 1
    words_list.append(w)

In [9]:
# construct mappings to retrieve a courseId from its index
# as well as an index given a courseId
courses_index = {}
i = 0
courses_list = []

for w in courses_proc2:
    courses_index[w] = i
    i += 1
    courses_list.append(w)

In [10]:
# parameters required for constructing the matrix
M = len(words_index)
N = len(courses_index)

In [11]:
# matrix containing how many times each words
# appears in each document
X = np.ndarray((M, N))

for c in courses_proc2:
    for t in courses_proc2[c]:
        X[words_index[t]][courses_index[c]] += 1

### Print the 15 terms in the description of the $9^{th}$ class with the highest TF-IDF scores.

The TFIDF matrix is constructed transposed, to allow for easier access to the courses, since these are mainly accessed in these exercises.

In [12]:
# construct the TF matrix, containing the
# normalized entries in X by the maximum of
# the respective row
TF = np.array([x / x.max() for x in X.T])

In [13]:
# construct the IDF array, containing the
# frequencies of the words within the corpus
IDF = np.ndarray(M)
for i in range(len(X)):
    count = 0
    for j in X[i]:
        if j > 0:
            count += 1
    IDF[i] = np.log2(N / count)

In [14]:
# construct the TFIDF matrix from the
# TF and IDF matrices
TFIDF = np.array([x * IDF for x in TF])

In [15]:
# select only the row related to the 9th course
TFIDF_9 = TFIDF[courses_index[courses[9]['courseId']]]

In [16]:
# retrieve the 15 highest entries in TFIDF_9
TFIDF_9_top15 = np.argsort(-TFIDF_9)[:15]

In [17]:
print("TOP 15 TERMS FROM TFIDF")
for i in TFIDF_9_top15:
    print(" ", TFIDF_9[i], ":", words_list[i])

TOP 15 TERMS FROM TFIDF
  6.03765254148 : verification
  3.89523690385 : systemverilog
  3.89523690385 : systemc
  3.89523690385 : functional verification
  3.81500253808 : vhdl
  3.26125190356 : rtl
  2.59682460257 : systemverilog systemc
  2.49469492386 : hardware
  2.3301579359 : socs
  2.3301579359 : design vhdl
  1.87141350903 : functional
  1.74761845192 : digital hardware
  1.56208089818 : digital
  1.54761845192 : system level
  1.08708396785 : systemsonchip


### Explain where the difference between the large scores and the small ones comes from.

TF gives the importance if a word in a document, between 0 and 1. 1 is the most frequent word, 0 means they never appear.

IDF gives the frequency of a word in the corpus, bigger than 0. 0 means the word is in every description, infinity means it's in none of them.

TFIDF then is simply the mutliplication of the two, where a high TFIDF means that word is frequent within the desctiption, but rare within the corpus.

## Exercise 4.3: Document similarity search

### Search for "markov chains" and "facebook".

In [18]:
# the word "chains" has been replaced by "chain" by lemmatization
markov_chain_index = words_index['markov chain']

the term "facebook" only appeeared 9 times within only one description, and was therefore filtered eaclier. thus, that term has no score and no similarity can be computed

In [19]:
words_index['facebook']

KeyError: 'facebook'

### Display the top five courses together with their similarity score for each query.

In [20]:
# cosine similarity function, given
# the indeces of two courses
def sim(i, j):
    di = TFIDF[i]
    dj = TFIDF[j]
    return np.dot(di, dj) / ( np.linalg.norm(di) * np.linalg.norm(dj) )

In [21]:
# compute the scores of all courses for
# the term 'markov chain' within the TFIDF matrix
markov_chain_score = []

for i in TFIDF:
    markov_chain_score.append(i[markov_chain_index])

In [22]:
# find the 5 highest scores in markov_chain_score
TFIDF_markov_top5 = np.argsort(-np.array(markov_chain_score))[:5]
top_markov_names = []

print("TOP 5 COURSES FOR 'MARKOV CHAIN'")
for i in TFIDF_markov_top5:
    top_markov_names.append([i, courses_list[i]])
    print(" ", markov_chain_score[i], ":", courses_list[i])

TOP 5 COURSES FOR 'MARKOV CHAIN'
  5.94063200363 : MATH-332
  4.62049155838 : COM-516
  3.46536866878 : MGT-484
  1.26013406138 : FIN-408
  1.15512288959 : COM-308


In [23]:
# print the similarity scores between the 5 courses
# this results in a symmetric matrix with 1s on the diagonal
# as the similarity score of a course with itself is 1.00
print("SIMILARITIES BETWEEN TOP 5 MARKOV CHAINS COURSES")
print("         ", top_markov_names[0][1], top_markov_names[1][1], top_markov_names[2][1],
      top_markov_names[3][1], top_markov_names[4][1])
for i in top_markov_names:
    line = "" + i[1].rjust(8)
    for j in top_markov_names:
        line += "    " + str("%.2f" % round(sim(i[0], j[0]), 2))
    print(line)

SIMILARITIES BETWEEN TOP 5 MARKOV CHAINS COURSES
          MATH-332 COM-516 MGT-484 FIN-408 COM-308
MATH-332    1.00    0.34    0.33    0.18    0.07
 COM-516    0.34    1.00    0.32    0.21    0.10
 MGT-484    0.33    0.32    1.00    0.18    0.08
 FIN-408    0.18    0.21    0.18    1.00    0.04
 COM-308    0.07    0.10    0.08    0.04    1.00


### What do you think of the results? Give your intuition on what is happening.

Since all we looked for were the descriptions where "markov chain(s)" appears the most, we expected to have relatively low similartiy scores. It is likely that courses that deal with markov chains have somewhat similar topics, as it is a mathematical concept at the base, but that is not at all reqired.

## Save all necessary dictionaries and arrays to disk

In [24]:
np.save("data/TFIDF.npy", TFIDF)
np.save("data/words.npy", words_list)
np.save("data/words_dict.npy", words_index)
np.save("data/courses.npy", courses_list)
np.save("data/courses_dict.npy", courses_index)