# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Xavier Jeanmonod
* Adrian Baudat
* Simon Wicky

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [4]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import math
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

In [5]:
from collections import Counter

## Exercise 4.1: Pre-processing

In [157]:
def remove_punct(description):
    return description.translate(str.maketrans('','', string.punctuation))

def remove_XA0(description):
    return description.replace(u'\xa0',u' ')

def split_words(description):
    newString = description[0]
    for i in range(1,len(description) - 1):
        newString += description[i]
        if description[i].islower() and description[i+1].isupper():
            newString += ' '
    newString += description[len(description)-1]
    return newString

def remove_words(wordList, to_remove):
    return [word for word in wordList if word not in to_remove]

def remove_number(wordlist):
    return [word for word in wordlist if  not word.isdigit()]

def extreme_words(rare,frequent):
    list_all = []
    for course in courses:
        list_all += clean(course["description"])
    hist = list(Counter(list_all).items())
    rare_words= [word for (word, occ) in hist if occ < rare]
    frequent_words = [word for (word, occ) in hist if occ > frequent]
    return rare_words, frequent_words

def clean(text):
    final = remove_XA0(text)
    final = remove_punct(final)
    final = split_words(final)
    final_list = remove_words(final.lower().split(' '), stopwords)
    final_list = remove_number(final_list)
    return final_list


rare_words, frequent_words = extreme_words(10,300)

def full_clean(text):
    final_list = clean(text)
    final_list = remove_words(final_list,rare_words)
    final_list = remove_words(final_list,frequent_words)
    return final_list


Pre-processing operations:
* Remove the punctuation to avoid having words ending with "," or "."
* Replace the \xa0 character (special space) by a simple space.
* Remove stopwords, because they're essentially meaningless
* Remove rare words, because the TF-IDF will be biased for these.
* Remove frequent words, words like "method" or "student" aren't really relevant
* Artificial split of words if an capital letter has a lowercase letter before. Ex from MSE-440: "materialsConstituentsProcessing" will be split in 3 words. Probably dirty data
* Remove number. Having "20" or "30" in the corpus is meaningless

In [158]:
cleaned_courses = []
all_courses_description = []
for course in courses:
    new_description = full_clean(course['description'])
    all_courses_description.append(new_description)
    cleaned_courses.append({'courseId': course['courseId'], 'name': course['name'],'description': new_description})

In [159]:
ix = cleaned_courses[43]["description"]
ix.sort()
print(ix)

['acquired', 'ad', 'ad', 'algebra', 'algebra', 'algorithms', 'algorithms', 'analytics', 'analytics', 'analyze', 'balance', 'based', 'based', 'cathedra', 'chains', 'class', 'class', 'class', 'clustering', 'clustering', 'collection', 'combination', 'communication', 'community', 'community', 'computing', 'computing', 'concrete', 'current', 'dedicated', 'designed', 'detection', 'detection', 'develop', 'draw', 'efficiency', 'explore', 'explore', 'explore', 'explore', 'fields', 'final', 'functions', 'fundamental', 'good', 'graph', 'graphs', 'handson', 'homeworks', 'homeworks', 'infrastructure', 'internet', 'internet', 'java', 'key', 'knowledge', 'lab', 'laboratory', 'labs', 'labs', 'largescale', 'largescale', 'largescale', 'linear', 'linear', 'machine', 'machine', 'main', 'markov', 'material', 'media', 'midterm', 'mining', 'mining', 'mining', 'modeling', 'networking', 'networking', 'networking', 'networks', 'number', 'number', 'online', 'online', 'online', 'online', 'online', 'past', 'practi

## Exercise 4.2: Term-document matrix

In [160]:
def tf(descriptionlist):
    c = Counter(descriptionlist)
    maxOcc = max(c.values())
    return dict([(word,c[word]/maxOcc) for word in descriptionlist])
def idf(descriptionlist):
    N = len(all_courses_description)
    out = {}
    for word in descriptionlist:
        counter = 0
        for course_word in all_courses_description:
            if word in course_word:
                counter += 1
        out[word] = - math.log(counter/N) / math.log(2)
    return out
def tf_idf(descriptionlist):
    out = {}
    tf_out = tf(descriptionlist)
    idf_out = idf(descriptionlist)
    for word in descriptionlist:
        out[word] = tf_out[word] * idf_out[word]
    return out

In [165]:
all_words = list(set([word for course in all_courses_description for word in course]))
#mapping from a word to an index
word2index = dict([(all_words[i],i) for i in range(len(all_words))])
#mappind from course id to an index
course2index= dict([(courses[i]["courseId"],i) for i in range(len(courses))])

In [166]:
matrix = np.zeros(shape=(len(courses),len(all_words)))

In [167]:
#Expensive calcuation
for course in cleaned_courses:
    if course2index[course["courseId"]] % 100 == 0:
        print(course2index[course["courseId"]])
    tf_idf_dict = tf_idf(course["description"])
    for word in tf_idf_dict.keys():
        matrix[course2index[course["courseId"]]][word2index[word]] = tf_idf_dict[word]

0
100
200
300
400
500
600
700
800


In [179]:
%store matrix

Stored 'matrix' (ndarray)


In [176]:
tf_idf_ix = list(tf_idf(ix).items())
tf_idf_ix.sort(key = lambda x: -x[1])
print("Top result of IX course :")
tf_idf_ix[:15]

Top result of IX course :


[('services', 5.650629418370151),
 ('realworld', 4.922503807119468),
 ('online', 4.8801112644929185),
 ('social', 4.608809242675524),
 ('mining', 4.042855355772294),
 ('explore', 3.864961331209578),
 ('networking', 3.767196384589916),
 ('largescale', 3.4987209984071828),
 ('internet', 2.7722949350251547),
 ('stream', 2.7722949350251547),
 ('ad', 2.6272669032712717),
 ('community', 2.6272669032712717),
 ('clustering', 2.511464256393278),
 ('analytics', 2.461251903559734),
 ('labs', 2.2272669032712713)]

Since the TF factor is normalized by the maximum occurence of a word in this document, a high TF-IDF score means that these words are specific to this document, that they are not present in many other documents. The small score is either due to low presence of the word in the document, or the high presence in the whole corpus.

## Exercise 4.3: Document similarity search

In [171]:
def sim(doc1, doc2):
    num = np.dot(doc1,doc2)
    denom = np.linalg.norm(doc1) * np.linalg.norm(doc2)
    return num / denom

def sim_list(doc):
    result = []
    for i in range(len(matrix)):
        value = sim(doc,matrix[i])
        if value > 0:
            for courseId, index in course2index.items():
                if index == i:
                    result.append((courseId,courses[index]["name"],value))
    result.sort(key=lambda x: -x[2])
    return result

In [175]:
markov_chain_vector = np.zeros(shape=len(all_words))
markov_chain_vector[word2index["markov"]] = 1
markov_chain_top5 = sim_list(markov_chain_vector)[:5]
print(markov_chain_top5)

[('MGT-484', 'Applied probability & stochastic processes', 0.622617931178696), ('MATH-332', 'Applied stochastic processes', 0.5341955705371184), ('EE-605', 'Statistical Sequence Processing', 0.4423097584999178), ('COM-516', 'Markov chains and algorithmic applications', 0.38026569817022693), ('EE-516', 'Data analysis and model classification', 0.14350135980025172)]


The first four course seems pretty relevant, since they all deal with probability or stochastic process.
The fifth one is arguable, we can see that the similarity score is significantly lower than the rest

#### Important note
"Facebook" is used only once in the whole corpus, hence was removed during preprocessing. 
For the need of this lab, we preprocessed without removing words, but the matrix construction takes too long.
We were still able to extract a result, but the following code won't work, because facebook is no longer a word in the corpus
##### Output we got for facebook :
[('EE-727', 'Computational Social Media', 0.17572585110979233)]

In [174]:
#facebook_vector = np.zeros(shape=len(all_words))
#facebook_vector[word2index["facebook"]] = 1
#facebook_top = sim_list(facebook_vector)
#print(facebook_top)

As said in the note before, "Facebook" only occurs once in one course descriptiob. This is why we only get this result