# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** R
**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [212]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import csc_matrix
from utils import load_json, load_pkl
import string
import re
from operator import itemgetter
import nltk
import math
from collections import defaultdict

from bokeh.plotting import figure, output_notebook,show, ColumnDataSource
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
from bokeh.layouts import widgetbox


from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from nltk import word_tokenize
from nltk.util import ngrams


output_notebook()

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [87]:
id2name = dict(map(itemgetter('courseId', 'name'),courses))
name2id = {v: k for k,v in id2name.items()}
np.save('id2name', id2name)
np.save('name2id', name2id)

In [288]:
lmtzr = WordNetLemmatizer()
stemmer = PorterStemmer()

def isValidWord(word):
    return not word.lower() in stopwords and not word.isdigit() and not word in string.punctuation
def processWord(word):
    return lmtzr.lemmatize(word.lower())

# Returns the bag of words of a text as a dictionary, so the different words as keys and their number of occurence as value
def getCourseBagOfWords(text):
    # We create a default dict which returns 0 when an item is not in
    bow = defaultdict(lambda: 0)
    text = nltk.word_tokenize(text)

    for idx in range(len(text)):
        word = text[idx]
        # separate words such that "MyNameIsChristian"
        # becomes "My" "Name" "Is" "Christian"
        # however makes sure that word such as "USA" and "imageJ" are untouched
        res = re.findall('[a-zA-Z][^A-Z]*',word)
        # We check that we have a match of several words
        if len(res) > 1:
            # Check if than we dont have a one letter word
            if len(min(res,key=len)) > 1:
                # delete initial world
                del text[idx]
                # Add the new words
                for offset, match in enumerate(res):
                    text.insert(idx+offset,match)
                            
    for word in text:
        if isValidWord(word):
            bow[processWord(word)] += 1

    return bow

# Computes the bag of word for each course
# and the global bag of words
def getBagOfWords():
    globalBagOfWords = defaultdict(lambda: 0)
    bagOfWords = {}
    for course in courses:
        bow = getCourseBagOfWords(course['description'])
        bagOfWords[course['courseId']] = bow
        for k,v in bow.items():
            globalBagOfWords[k] += v
    
    occurences = sorted(globalBagOfWords.items(), key=itemgetter(1))
    # We remove all words with occurences < minBound and > maxBound
    # where minBound is the occurence of the lowest 20th term
    # and maxBound the occurence of the highest 20th term
    minBound = occurences[20][1]
    maxBound = occurences[-20][1]
    globalBagOfWords = {k: v for k,v in globalBagOfWords.items() if v > minBound and v < maxBound}
    for course in bagOfWords.keys():
        bagOfWords[course] = {k: v for k,v in bagOfWords[course].items() if k in globalBagOfWords}
    return globalBagOfWords, bagOfWords

In [176]:
globalBagOfWords, bagOfWords = getBagOfWords()
print(sum(globalBagOfWords.values()))
print(len(globalBagOfWords.keys()))

118469
7205


## 1. Explain which ones you implemented and why.
We chose to remove all punctuation and all stopwords since there really is no interest in keeping them.

We also lemmatize the words using the nltk library, to keep track of similar words and have a more accurate word occurence count.

We remove every words that was as present as the 20 most/least present words.

## 2.Print the terms in the pre-processed description of the IX class in alphabetical order.


In [177]:
ixBow = bagOfWords[name2id['Internet analytics']]
print('Words for Internet analytics course:')
for words in sorted(ixBow.keys(),key=lambda v: v.upper()):
    print('   -',words)

Words for Internet analytics course:
   - acquired
   - activity
   - ad
   - advertisement
   - algebra
   - algorithm
   - analytics
   - analyze
   - application
   - auction
   - balance
   - based
   - cathedra
   - chain
   - class
   - cloud
   - clustering
   - collection
   - com-300
   - combination
   - communication
   - community
   - computing
   - concrete
   - coverage
   - current
   - data
   - datasets
   - decade
   - dedicated
   - designed
   - detection
   - develop
   - dimensionality
   - draw
   - e-commerce
   - effectiveness
   - efficiency
   - exam
   - expected
   - explore
   - explores
   - field
   - final
   - foundational
   - framework
   - function
   - fundamental
   - good
   - graph
   - hadoop
   - hands-on
   - homework
   - important
   - information
   - infrastructure
   - inspired
   - internet
   - java
   - key
   - knowledge
   - lab
   - laboratory
   - large-scale
   - linear
   - machine
   - main
   - map-reduce
   - markov
   - mat

## Exercise 4.2: Term-document matrix

In [290]:
globalBagOfWords, bagOfWords = getBagOfWords()

# Som helpers variable
numTerms = len(globalBagOfWords.keys())
numCourses = len(bagOfWords.keys())
termsOrder = list(enumerate(globalBagOfWords.keys()))
coursesOrder = list(enumerate(bagOfWords.keys()))
idx2Term = {i[0]: i[1] for i in termsOrder}
term2Idx = {v: k for k,v in idx2Term.items()}
idx2Course = {i[0]: i[1] for i in coursesOrder}
course2Idx = {v: k for k,v in idx2Course.items()}

np.save('idx2Term', idx2Term)
np.save('term2Idx', term2Idx)
np.save('idx2Course', idx2Course)
np.save('course2Idx', course2Idx)

overallFreq = np.zeros(numTerms)
row = []
col = []
data  = []

# We construct the term document matrix
for courseIdx, course in coursesOrder:
    if(len(bagOfWords[course]) == 0):
        continue
    docMax = max(bagOfWords[course].values())
    for termIdx, term in termsOrder:
        if(term not in bagOfWords[course]):
            continue
        
        row.append(termIdx)
        col.append(courseIdx)
        # We use the formual seen in class
        data.append(bagOfWords[course][term]/docMax)
        # We construct at the same time what we need for the idf
        overallFreq[termIdx] += 1

# We construct the sparse matrix
tf_idf = csr_matrix((data,(row,col)))

# We compute the idf
overallFreq = np.log(numCourses/overallFreq)

np.save('idf',overallFreq)

# tf_idf = tf.copy()
# We multiply each row by the corresponding idf
tf_idf.data *= overallFreq.repeat(np.diff(tf_idf.indptr))

np.save('X',tf_idf)

# We convert the Compressed Sparse Row matrix to Compressed Sparse Column matrix
# Since we need a particular course
csc = tf_idf.tocsc()
# Get the index of the course
idx = course2Idx[name2id['Internet analytics']]

# Get the offsets
offset = csc.indptr[idx]
offsetEnd = csc.indptr[idx+1]

# We find the top 15 values' indices in the data and find the term corresponding to those
topTerms = [(idx2Term[csc.indices[offset+index]],tf_idf.data[offset:offsetEnd][index])  for index in np.argsort(-tf_idf.data[offset:offsetEnd])[:15]]
for term, score in topTerms:
    print(term,'\t',score)

analytics 	 2.01373274539
acquired 	 0.863028319455
develop 	 0.863028319455
main 	 0.839055310581
application 	 0.805493098158
graph 	 0.671244248465
linear 	 0.671244248465
algebra 	 0.671244248465
field 	 0.671244248465
inspired 	 0.671244248465
specifically 	 0.671244248465
seek 	 0.604119823618
statistic 	 0.604119823618
lab 	 0.549199839653
work 	 0.503433186349


## Exercise 4.3: Document similarity search

In [284]:
def docSim(di,dj):
    # Norm is done using underlying data vector, much faster than using transpose
    return (di.dot(dj)/(np.sqrt((di.data ** 2).sum())*np.sqrt((dj.data ** 2).sum()))).A[0][0]

In [287]:
# Gives the top 5 course for given terms
# Takes a space separated list of terms as argument
def query(terms):
    terms = terms.split(' ')
    
    # We get the tf-idf for the terms, tf is always 1 and idf was already computed
    data = [overallFreq[term2Idx[term]] for term in terms]
    cols = [term2Idx[term] for term in terms]
    rows = [0] * len(terms)
    
    # We create the query of shape 1 x M
    query = csc_matrix((data,(rows,cols)),shape=(1,numTerms))
    
    # We run document similarity with the query for all the documents
    results = np.array([docSim(query,tf_idf.getcol(doc)) for doc in range(numCourses)])
    top = np.argsort(-results)[:5]
    # We filter all courses with null scores
    top = [course for course in top if results[course] != 0.0]
    
    print('Top courses with query "'+" ".join(terms)+'":')
    for course in top:
        print('   -',id2name[idx2Course[course]],'\t%.3f'%results[course])
query('markov chain')
query('facebook')

Top courses with query "markov chain":
   - Applied probability & stochastic processes 	0.564
   - Applied stochastic processes 	0.563
   - Markov chains and algorithmic applications 	0.357
   - Supply chain management 	0.355
   - Statistical Sequence Processing 	0.299
Top courses with query "facebook":
   - Computational Social Media 	0.186
