# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* Xavier Jeanmonod
* Adrian Baudat
* Simon Wicky

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [65]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import math
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

In [8]:
from collections import Counter

## Exercise 4.1: Pre-processing

In [31]:
def remove_punct(description):
    return description.translate(str.maketrans('','', string.punctuation))

def remove_XA0(description):
    return description.replace(u'\xa0',u' ')

def split_words(description):
    newString = description[0]
    for i in range(1,len(description) - 1):
        newString += description[i]
        if description[i].islower() and description[i+1].isupper():
            newString += ' '
    newString += description[len(description)-1]
    return newString

def remove_words(wordList, to_remove):
    return [word for word in wordList if word not in to_remove]

def remove_number(wordlist):
    return [word for word in wordlist if  not word.isdigit()]

def extreme_words(rare,frequent):
    list_all = []
    for course in courses:
        list_all += clean(course["description"])
    hist = list(Counter(list_all).items())
    rare_words= [word for (word, occ) in hist if occ < rare]
    frequent_words = [word for (word, occ) in hist if occ > frequent]
    return rare_words, frequent_words

def clean(text):
    final = remove_XA0(text)
    final = remove_punct(final)
    final = split_words(final)
    final_list = remove_words(final.lower().split(' '), stopwords)
    final_list = remove_number(final_list)
    return final_list


rare_words, frequent_words = extreme_words(10,300)

def full_clean(text):
    final_list = clean(text)
    final_list = remove_words(final_list,rare_words)
    return remove_words(final_list,frequent_words)


Pre-processing operations:
* Remove the punctuation to avoid having words ending with "," or "."
* Replace the \xa0 character (special space) by a simple space.
* Remove stopwords, because they're essentially meaningless
* Remove rare words, because the TF-IDF will be biased for these.
* Remove frequent words, words like "method" or "student" aren't really relevant
* Artificial split of words if an capital letter has a lowercase letter before. Ex from MSE-440: "materialsConstituentsProcessing" will be split in 3 words. Probably dirty data
* Remove number WIP

In [95]:
cleaned_courses = []
all_courses_description = []
for course in courses:
    new_description = full_clean(course['description'])
    all_courses_description.append(new_description)
    cleaned_courses.append({'courseId': course['courseId'], 'name': course['name'],'description': new_description})

In [33]:
ix = cleaned_courses[43]
ix["description"].sort()
print(ix["description"])

['acquired', 'ad', 'ad', 'algebra', 'algebra', 'algorithms', 'algorithms', 'analytics', 'analytics', 'analyze', 'balance', 'based', 'based', 'cathedra', 'chains', 'class', 'class', 'class', 'clustering', 'clustering', 'collection', 'combination', 'communication', 'community', 'community', 'computing', 'computing', 'concrete', 'current', 'dedicated', 'designed', 'detection', 'detection', 'develop', 'draw', 'efficiency', 'explore', 'explore', 'explore', 'explore', 'fields', 'final', 'functions', 'fundamental', 'good', 'graph', 'graphs', 'handson', 'homeworks', 'homeworks', 'infrastructure', 'internet', 'internet', 'java', 'key', 'knowledge', 'lab', 'laboratory', 'labs', 'labs', 'largescale', 'largescale', 'largescale', 'linear', 'linear', 'machine', 'machine', 'main', 'markov', 'material', 'media', 'midterm', 'mining', 'mining', 'mining', 'modeling', 'networking', 'networking', 'networking', 'networks', 'number', 'number', 'online', 'online', 'online', 'online', 'online', 'past', 'practi

## Exercise 4.2: Term-document matrix

In [93]:
def tf(descriptionlist):
    c = Counter(descriptionlist)
    maxOcc = max(c.values())
    return dict([(word,c[word]/maxOcc) for word in descriptionlist])
def idf(descriptionlist):
    N = len(all_courses_description)
    out = {}
    for word in descriptionlist:
        counter = 0
        for course_word in all_courses_description:
            if word in course_word:
                counter += 1
        out[word] = - math.log(counter/N) / math.log(2)
    return out
def tf_idf(descriptionlist):
    out = {}
    tf_out = tf(descriptionlist)
    idf_out = idf(descriptionlist)
    for word in descriptionlist:
        out[word] = tf_out[word] * idf_out[word]
    return out

In [87]:
des = ix["description"]

In [119]:
all_words = list(set([word for course in all_courses_description for word in course]))
#mapping from a word to an index
word2index = dict([(all_words[i],i) for i in range(len(all_words))])
#mappind from course id to an index
course2index= dict([(courses[i]["courseId"],i) for i in range(len(courses))])

## Exercise 4.3: Document similarity search

In [118]:
course2index

{'MSE-440': 0,
 'BIO-695': 1,
 'FIN-523': 2,
 'MICRO-614': 3,
 'ME-231(a)': 4,
 'AR-402(v)': 5,
 'ChE-421': 6,
 'CH-403': 7,
 'COM-302': 8,
 'EE-432': 9,
 'MGT-430': 10,
 'PHYS-455': 11,
 'EE-517': 12,
 'MSE-613': 13,
 'MSE-423': 14,
 'MGT-621': 15,
 'MSE-437': 16,
 'MSE-431': 17,
 'BIOENG-448': 18,
 'BIOENG-450': 19,
 'MSE-474': 20,
 'MICRO-424': 21,
 'ME-432': 22,
 'ENV-400': 23,
 'HUM-429(a)': 24,
 'BIOENG-517': 25,
 'MSE-420': 26,
 'ENV-501': 27,
 'MATH-111(en)': 28,
 'ChE-302': 29,
 'MICRO-505': 30,
 'CS-352': 31,
 'HUM-417(a)': 32,
 'CIVIL-429': 33,
 'CIVIL-449': 34,
 'FIN-405': 35,
 'CS-699(1)': 36,
 'ME-551': 37,
 'MSE-463': 38,
 'COM-500': 39,
 'MATH-106(en)': 40,
 'MGT-414': 41,
 'BIO-501': 42,
 'COM-308': 43,
 'MGT-526': 44,
 'CH-332': 45,
 'ME-476': 46,
 'EE-605': 47,
 'ENG-603': 48,
 'MSE-425': 49,
 'MATH-408': 50,
 'CS-322': 51,
 'ME-453': 52,
 'MSE-629': 53,
 'CS-490': 54,
 'ENV-715': 55,
 'CS-699(2)': 56,
 'MATH-625': 57,
 'ME-231(b)': 58,
 'AR-402(w)': 59,
 'ME-499': 6