# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Kim Lan Phan Hoang
* Robin Lang

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
punctuation = '.?!,;:-–()[]{}"/'

## Exercise 4.1: Pre-processing

### Pre-process the corpus to create bag-of-words representations of each document

In [2]:
MIN = 10
MAX = 2000

In [3]:
courses_proc = {}
words = {}
cids = []

for c in courses:
    cid = c['courseId']
    cids.append(cid)
    
    # transfer all to lowercase, remove punctuation
    desc = c['description'].lower().translate(str.maketrans('', '', punctuation))
    # remove stopwords
    desc_proc = [word for word in desc.split() if word not in stopwords]
    
    # create a dict of all words
    for w in desc_proc:
        if w in words:
            words[w] += 1
        else:
            words[w] = 1
    
    # not needed ?
    #name = c['name']
    #courses_proc[cid] = desc_proc
    
    if cid in courses_proc:
        # some courseIds appear multiple times
        # for those, we decided to append the desctiptions to each other
        courses_proc[cid].extend(desc_proc)
    else:
        courses_proc[cid] = desc_proc

In [4]:
words_filtered = []

# most and least frequent words to filter
for w in words:
    if words[w] > MIN or words[w] < MAX:
        words_filtered.append(w)

In [6]:
courses_proc2 = {}

for c in courses_proc:
    courses_proc2[c] = [word for word in courses_proc[c] if word in words_filtered]

### Explain which kinds of cleaning you implemented and why

* make sure every courseId only appears once. If it appears more than once, append the two descriptions
* Put all cheacters to lowercase, to avoid the same word being considered different. Additionally, all words in stopwords.pkl are in lowercase, so this ensures all words are removed correctly
* Remove punctuation, for the same reason as above. Only the words are important, not which one is before a comma.
* Remove stopwords, as these don't carry any information about the content of the document
* filter most and least frequent words, ???

### Print the terms in the pre-processed description of the $9^{th}$ class in alphabetical order.

In [8]:
print(sorted(courses_proc2[courses[9]['courseId']]))

['10%', '10%', '10%', '2002', '2002', '2002', '2004', '2004', '2004', '2006', '2006', '2006', '2012', '2012', '2012', '3rd', '3rd', '3rd', '40%', '40%', '40%', '50%', '50%', '50%', 'abstract', 'abstract', 'abstract', 'abstractionexploit', 'abstractionexploit', 'abstractionexploit', 'activities', 'activities', 'activities', 'al', 'al', 'al', 'algorithmic', 'algorithmic', 'algorithmic', 'architecture', 'architecture', 'architecture', 'assessment', 'assessment', 'assessment', 'assistants', 'assistants', 'assistants', 'attending', 'attending', 'attending', 'automation', 'automation', 'automation', 'basic', 'basic', 'basic', 'bibliography', 'bibliography', 'bibliography', 'bibliothèque', 'bibliothèque', 'bibliothèque', 'building', 'building', 'building', 'challenges', 'challenges', 'challenges', 'chu', 'chu', 'chu', 'chusystem', 'chusystem', 'chusystem', 'coding', 'coding', 'coding', 'combinational', 'combinational', 'combinational', 'comments', 'comments', 'comments', 'completing', 'comple

## Exercise 4.2: Term-document matrix

### Construct an M×N term-document matrix X, where M is the number of terms and N is the number of documents. The matrix X should be sparse. You are not allowed to use libraries for this task.

In [5]:
# mapping from word to its index
words_index = {}
i = 0

for w in words_filtered:
    words_index[w] = i
    i += 1

In [7]:
# mapping from courseId to its index
courses_index = {}
i = 0

for w in courses_proc2:
    courses_index[w] = i
    i += 1

In [28]:
M = len(words_filtered)
N = len(courses_proc2)

In [30]:
X = np.ndarray((M, N))

for c in courses_proc2:
    for t in courses_proc2[c]:
        X[words_index[t]][courses_index[c]] += 1
        if courses_index[c]==0:
            i += 1

### Print the 15 terms in the description of the IX class with the highest TF-IDF scores.

### Explain where the difference between the large scores and the small ones comes from.

## Exercise 4.3: Document similarity search

### Search for "markov chains" and "facebook".

In [31]:
markov_index = words_index['markov']
chains_index = words_index['chains']
facebook_index = words_index['facebook']

### Display the top five courses together with their similarity score for each query.

### What do you think of the results? Give your intuition on what is happening.