# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** R
**Names:**

* Raphael Strebel
* Raphaël Barman
* Thierry Bossy

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import re
import operator
import nltk

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [2]:
lmtzr = WordNetLemmatizer()
stemmer = PorterStemmer()

def add2bow(word, bow):
    if word not in bow:
        bow[word] = 0
    bow[word] += 1

def mergeBow(bow1, bow2):
    for word, occ in bow2.items():
        if word not in bow1:
            bow1[word] = 0
        bow1[word] += occ

def bagOfWord(text):
    bow = {}
    text = text.replace('\xa0', ' ')
    # Remove punctuation
    text = text.translate(str.maketrans('','',string.punctuation))
    text = text.split(' ')
    for idx in range(len(text)):
        word = text[idx]
        # separate words such that "MyNameIsChristian" becomes "My" "Name" "Is" "Christian"
        res = re.findall('[a-zA-Z][^A-Z]*',word)
        if res:
            if len(min(res,key=len)) != 1:
                if len(res) > 0:
                    text[idx] = ''
                for match in res:
                    text.append(match)
    text = [x for x in text if x != '']
    for idx in range(len(text)):
        word = text[idx]
        # Keep words that are only upper case as such (we don't want IT to become it) and put all others as lower case
        if not word.isupper():
            word = word.lower()
        # Lemmatize all non-digit words 
        if not word.isdigit() and not word in stopwords:
            add2bow(lmtzr.lemmatize(word),bow)
            #add2bow(stemmer.stem(word),bow)
#     for word in textCopy:
#         if not word.isupper():
#             word = word.lower()
#         if not word.isdigit():
#             text.add(lmtzr.lemmatize(word))
#     for word in stopwords:
#         try:
#             text.remove(word)
#         except KeyError:
#             continue
    return bow

In [3]:
bagOfWords = {}
globalBagOfWord = {}
bagOfWord(courses[1]['description'])
for course in courses:
    bow = bagOfWord(course['description'])
    bagOfWords[course['courseId']] = bow
    mergeBow(globalBagOfWord,bow)
test_course = course.copy()

We chose to remove all punctuation and all stopwords since there really is no interest in keeping them.
We also lemmatize the words using the nltk library, to keep track of similar words and have a more accurate word occurence count. 

In [4]:
test_course = courses[1]
bagOfWords = {}
#globalBagOfWord = {}
#bagOfWord(courses[1]['description'])
print(test_course)
bow = bagOfWord(test_course['description'])
bagOfWords[test_course['courseId']] = bow
#mergeBow(globalBagOfWord,bow)
print(bow)
dict(sorted(bagOfWords.items(), key=operator.itemgetter(1), reverse=True)[:5])


{'name': 'Image Processing for Life Science', 'description': 'This course intends to teach image processing with a strong emphasis of applications in life sciences. The idea is to enable the participants to solve image processing questions via workflows independently. Content Over the last decades, the images arising from microscopes in Life Sciences went from being a qualitative support of scientific evidence to a quantitative resource. To obtain good quality data from digital images, be it from a photograph of a Western blot, a TEM slice or a multi-channel confocal time-lapse stack, scientists must understand the underlying processes leading to the extracted information. Of similar importance is the software used to obtain the data. This course makes use of the ImageJ (FIJI package) as well as other open-source tools to ensure maximum reproducibility and protocol transfer of the analysis pipelines. The course will span 14 weeks with 1h30 of lecture per week, as well as exercises to c

{'BIO-695': {'FIJI': 2,
  'TEM': 1,
  'aim': 1,
  'analysis': 3,
  'application': 1,
  'arising': 1,
  'assessment': 1,
  'autonomous': 1,
  'autonomously': 1,
  'biology': 1,
  'blot': 1,
  'complete': 1,
  'completed': 1,
  'confocal': 1,
  'content': 1,
  'context': 1,
  'continuous': 1,
  'cover': 1,
  'creation': 2,
  'data': 5,
  'decade': 1,
  'deconvolution': 1,
  'defined': 1,
  'denoising': 1,
  'digital': 3,
  'emphasis': 2,
  'enable': 2,
  'ensure': 1,
  'establish': 1,
  'evidence': 1,
  'exercise': 4,
  'extracted': 1,
  'extraction': 1,
  'filtering': 2,
  'format': 1,
  'goal': 1,
  'good': 2,
  'h30': 1,
  'handed': 1,
  'homework': 1,
  'hour': 1,
  'idea': 1,
  'image': 12,
  'imagej': 2,
  'importance': 1,
  'independently': 1,
  'information': 1,
  'intends': 1,
  'interest': 1,
  'introduce': 1,
  'involve': 1,
  'keywords': 1,
  'leading': 1,
  'learning': 1,
  'lecture': 2,
  'life': 3,
  'linear': 1,
  'machine': 1,
  'macro': 3,
  'make': 1,
  'manipulation':

In [5]:
min(globalBagOfWord,key=globalBagOfWord.get)
dict(sorted(globalBagOfWord.items(), key=operator.itemgetter(1), reverse=False)[:5000])
# Maybe discard only the 3 most used words? since system and design are more specific than learning, student and system

{'interconnect': 3,
 'postprimary': 3,
 'inaction': 3,
 'others6': 3,
 'mechanosensory': 3,
 'hyperelliptic': 3,
 'FDD': 3,
 'WINGS': 3,
 'FRP': 3,
 'valentine': 3,
 'curable': 3,
 'epifourier': 3,
 'applications10': 3,
 'emblematic': 3,
 'microdispersed': 3,
 'propagator': 3,
 'photopolymers': 3,
 'equivallent': 3,
 'adaption': 3,
 'aidistributed': 3,
 'techniquesformulate': 3,
 'gyromagnetic': 3,
 'perl': 3,
 'brief': 3,
 'refreshment': 3,
 'poroelasticity': 3,
 'metaloxide': 3,
 'ah20identify': 3,
 'kirchoffs': 3,
 'lawson': 3,
 'EE332': 3,
 'BBC': 3,
 'skundin': 3,
 'oksana': 3,
 'decorrelation': 3,
 'FINAL': 3,
 'NEM': 3,
 'transcriptase': 3,
 '2B': 3,
 'processmicrostructure': 3,
 'pip3signaling': 3,
 'sectoral': 3,
 'nozzle': 3,
 'vestibular': 3,
 'EPMAWDX': 3,
 'light42': 3,
 'dijkstras': 3,
 'T3': 3,
 'spontenaous': 3,
 'URL': 3,
 'multitemporal': 3,
 'principles3': 3,
 'hamming': 3,
 'httpvectorcom': 3,
 'orthopaedic': 3,
 'subunit': 3,
 'materialsapplications': 3,
 'firouzeh

In [48]:
courses

[{'courseId': 'MSE-440',
  'description': "The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures\xa0Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites\xa0ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical

In [9]:
# Term Document Matrix
termDocMatrix = np.zeros((len(courses),len(globalBagOfWord)), dtype=np.int)

# Term index
i = 0
# Document index
j = 0

for document in courses: 
    text = document['description']
    bow = bagOfWord(text)
    for word in globalBagOfWord.keys():
        # pourquoi bow.get(word) est de type NoneType?
        termDocMatrix[i][j] += int(bow.get(word))
        i += 1
    j += 1
print(termDocMatrix)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [87]:
test_text = "I DIDN'T come here to suffer OKAY?!!!?!?!?"
test_docs = test_text
test_bow = bagOfWord(test_text)
print(test_bow)
total = sum(test_bow.values())
f = np.zeros((len(test_text),len(test_bow)))

print(total)
i = 0
j = 0
for doc in test_docs:
    test_bow = bagOfWord(doc)
    for word in test_bow:
        f[i][j] = test_bow.get(word) / total 
        print(test_bow.get(word)/total)
        j += 1
    i += 1

{'OKAY': 1, 'I': 1, 'suffer': 1, 'DIDNT': 1}
4
0.25
0.25
0.25
0.25


IndexError: index 4 is out of bounds for axis 0 with size 4

From the slides:

N = total number of documents

f[i][j] nb of occurrences of word 𝑖 in doc 𝑗, so bagOfWord(j)[i]

tf[i][j] = f[i][j] / max_k f[k][j]

idf[i] = -log_2(number of documents where word i occurs at least once / N)

tfidf[i][j] = tf[i][j] * idf[i]

## Exercise 4.2: Term-document matrix

## Exercise 4.3: Document similarity search