# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** R
**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [63]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import re
from operator import itemgetter
import nltk
import math
from collections import defaultdict

from bokeh.plotting import figure, output_notebook,show, ColumnDataSource
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
from bokeh.layouts import widgetbox


from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

output_notebook()

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
stopwords.add('-')

## Exercise 4.1: Pre-processing

In [17]:
id2name = dict(map(itemgetter('courseId', 'name'),courses))
name2id = {v: k for k,v in id2name.items()}
np.save('id2name', id2name)
np.save('name2id', name2id)

In [111]:
lmtzr = WordNetLemmatizer()
stemmer = PorterStemmer()

# Returns the bag of words of a text as a dictionary, so the different words as keys and their number of occurence as value
def getCourseBagOfWords(text):
    # We create a default dict which returns 0 when an item is not in
    bow = defaultdict(lambda: 0)
    
    # Change inceasable space to normal space...
    text = text.replace('\xa0', ' ')
    # Remove punctuation, don't remove '-'
    punctuation = string.punctuation.replace('-','')
    text = text.translate(str.maketrans('','',punctuation))
    text = text.split(' ')
    for idx in range(len(text)):
        word = text[idx]
        # separate words such that "MyNameIsChristian"
        # becomes "My" "Name" "Is" "Christian"
        # however makes sure that word such as "USA" and "imageJ" are untouched
        res = re.findall('[a-zA-Z][^A-Z]*',word)
        # We check that we have a match of several words
        if len(res) > 1:
            # Check if than we dont have a one letter word
            if len(min(res,key=len)) != 1:
                # delete initial world
                del text[idx]
                # Add the new words
                for offset, match in enumerate(res):
                    text.insert(idx+offset,match)
    for word in text:
        # Only lemmatize and stem words that are not uppercase
        # (we don't want IT to become it)
        if not word.isupper():
            word = word.lower()
            word = lmtzr.lemmatize(word)
         # Skip stopwords and digits
        if not word in stopwords and not word.isdigit():
            bow[word] += 1
    return bow

# Computes the bag of word for each course
# and the global bag of words
def getBagOfWords():
    globalBagOfWords = defaultdict(lambda: 0)
    bagOfWords = {}
    for course in courses:
        bow = getCourseBagOfWords(course['description'])
        bagOfWords[course['courseId']] = bow
        for k,v in bow.items():
            globalBagOfWords[k] += v
    
    occurences = sorted(globalBagOfWords.items(), key=itemgetter(1))
    # We remove all words with occurences < minBound and > maxBound
    # where minBound is the occurence of the lowest 10th term
    # and maxBound the occurence of the highest 10th term
    minBound = occurences[9][1]
    maxBound = occurences[-9][1]
    globalBagOfWords = {k: v for k,v in globalBagOfWords.items() if v > minBound and v < maxBound}
    for course in bagOfWords.keys():
        bagOfWords[course] = {k: v for k,v in bagOfWords[course].items() if k in globalBagOfWords}
    return globalBagOfWords, bagOfWords

In [101]:
globalBagOfWords, bagOfWords = getBagOfWords()
print(sum(globalBagOfWords.values()))
print(len(globalBagOfWords.keys()))
#getCourseBagOfWords(courses[1]['description'])
#for course in courses:
#    bow = getBagOfWords(course['description'])
#    bagOfWords[course['courseId']] = bow
#    mergeBow(globalBagOfWord,bow)
#test_course = course.copy()

378393
7120


## 1. Explain which ones you implemented and why.
We chose to remove all punctuation and all stopwords since there really is no interest in keeping them.

We also lemmatize the words using the nltk library, to keep track of similar words and have a more accurate word occurence count.

## 2.Print the terms in the pre-processed description of the IX class in alphabetical order.


In [113]:
ixBow = getCourseBagOfWords([course for course in courses if course['name'] == 'Internet analytics'][0]['description'])
print('Words for Internet analytics course:')
for words in sorted(ixBow.keys(),key=lambda v: v.upper()):
    print('   -',words)

Words for Internet analytics course:
   - acquired
   - activity
   - ad
   - advertisement
   - algebra
   - algorithm
   - analysis
   - analytics
   - analyze
   - apache
   - application
   - assessment
   - auction
   - balance
   - based
   - basic
   - cathedra
   - chain
   - class
   - cloud
   - clustering
   - collection
   - COM-300
   - combination
   - communication
   - community
   - computing
   - concept
   - concrete
   - content
   - coverage
   - curated
   - current
   - data
   - datasets
   - decade
   - dedicated
   - designed
   - detection
   - develop
   - dimensionality
   - draw
   - e-commerce
   - effectiveness
   - efficiency
   - end
   - exam
   - expected
   - explore
   - explores
   - field
   - final
   - foundational
   - framework
   - function
   - fundamental
   - good
   - graph
   - hadoop
   - hands-on
   - homework
   - important
   - information
   - infrastructure
   - inspired
   - internet
   - java
   - key
   - keywords
   - knowledg

From the slides:

N = total number of documents

f[i][j] nb of occurrences of word 𝑖 in doc 𝑗, so bagOfWord(j)[i]

tf[i][j] = f[i][j] / max_k f[k][j]

idf[i] = -log_2(number of documents where word i occurs at least once / N)

tfidf[i][j] = tf[i][j] * idf[i]

## Exercise 4.2: Term-document matrix

In [121]:
globalBagOfWords, bagOfWords = getBagOfWords()

In [122]:
numTerms = len(globalBagOfWords.keys())
numCourses = len(bagOfWords.keys())
termsOrder = list(enumerate(globalBagOfWords.keys()))
coursesOrder = list(enumerate(bagOfWords.keys()))
idx2Term = {i[0]: i[1] for i in termsOrder}
term2Idx = {v: k for k,v in idx2Term.items()}
idx2Course = {i[0]: i[1] for i in coursesOrder}
course2Idx = {v: k for k,v in idx2Course.items()}

np.save('idx2Term', idx2Term)
np.save('term2Idx', term2Idx)
np.save('idx2Course', idx2Course)
np.save('course2Idx', course2Idx)

In [164]:
overallFreq = np.zeros(numTerms)
row = []
col = []
data  = []

# We construct the term document matrix
for courseIdx, course in coursesOrder:
    if(len(bagOfWords[course]) == 0):
        continue
    docMax = max(bagOfWords[course].values())
    for termIdx, term in termsOrder:
        if(term not in bagOfWords[course]):
            continue
        col.append(courseIdx)
        row.append(termIdx)
        # We use the double normalization 0.5 for the tf
        data.append(0.5+0.5*bagOfWords[course][term]/docMax)
        # We construct at the same time what we need for the idf
        overallFreq[termIdx] += 1

# We construct the sparse matrix
tf = csr_matrix((data,(row,col)))
# We compute the idf
overallFreq = np.log(numCourses/overallFreq)

In [206]:
tf_idf = tf.copy()
# We multiply each row by the corresponding idf
tf_idf.data *= overallFreq.repeat(np.diff(tf_idf.indptr))
np.save('X',tf_idf)

tf_idf = tf_idf.toarray()
for term in map(lambda x: idx2Term[x],tf_idf[:,course2Idx[name2id['Internet analytics']]].argsort()[-15:][::-1]):
    print(term)

e-commerce
hadoop
recommender
ad
auction
real-world
self-contained
advertisement
map-reduce
mining
service
internet
foundational
seek
spark


## Exercise 4.3: Document similarity search

In [204]:
def docSimilarity(di,dj):
    return np.dot(di,dj)/(np.sqrt(np.dot(di,di))*np.sqrt(np.dot(dj,dj)))

In [205]:
top = (tf_idf[term2Idx['facebook']]+tf_idf[term2Idx['markov']]+tf_idf[term2Idx['chain']]).argsort()[-5:][::-1]
topCol = list(map(lambda x: tf_idf[:,x],top))
topCourses = list(map(lambda x: id2name[idx2Course[x]],top))
cmp = np.zeros((5,5))
for i in range(5):
    for j in range(5):
        cmp[i][j] = docSimilarity(topCol[i],topCol[j])
print(topCourses)
print(cmp)

data = dict(
    courses=topCourses,
    course0=cmp[0],
    course1=cmp[1],
    course2=cmp[2],
    course3=cmp[3],
    course4=cmp[4]
)
source = ColumnDataSource(data)
columns = [TableColumn(field='courses', title='Similarity')]
columns = columns + list(map(lambda x: TableColumn(field='course'+str(x[0]), title=x[1]),enumerate(topCourses)))
data_table = DataTable(source=source,columns=columns)
show(widgetbox(data_table))

['Applied probability & stochastic processes', 'Markov chains and algorithmic applications', 'Applied stochastic processes', 'Internet analytics', 'Stochastic calculus I']
[[ 1.          0.21214307  0.13333225  0.07074898  0.21040076]
 [ 0.21214307  1.          0.14587678  0.14180858  0.18254738]
 [ 0.13333225  0.14587678  1.          0.05299842  0.10828555]
 [ 0.07074898  0.14180858  0.05299842  1.          0.05242471]
 [ 0.21040076  0.18254738  0.10828555  0.05242471  1.        ]]
