#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 4:  Word Embeddings for Information Retrieval and Query Expansion

### 100 points [5% of your final grade]

### Due: April 28, 2020 by 11:59pm

*Goals of this homework:* In this homework you will improve your information retrieval engine in homework 1 by word embeddings to: (i) directly match the query and the document in the latent semantic space of word embeddings; (ii) expand the original query via word embeddings.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw4.ipynb`. For example, my homework submission would be something like `555001234_hw4.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Part 0. Dataset and Parsing (The same as Homework 1)

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

For parsing this dataset, you could also just copy your code from homework 1 to complete the following tasks:
* Tokenize documents (definitions) using **whitespaces and punctuations as delimiters**.
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

In [235]:
import re
import numpy as np
import pandas as pd
import math

from nltk.corpus import stopwords
from nltk.stem.porter import *
from pandas import DataFrame, Series
from collections import defaultdict
from sklearn.preprocessing import normalize

In [2]:
# configuration options
remove_stopwords = True  # or false
use_stemming = True # or false
remove_otherNoise = True # or false

In [3]:
stop_words = set(stopwords.words('english'))

# load data
filename = "homework_4_data.txt"
with open(filename) as f:
    content = f.readlines()

# split data into 3 columns and convert it to a dataframe
documents = np.array([re.split("\t", x) for x in content])
documents_df = pd.DataFrame(documents, columns=["entity", "id", "definition"])

In [4]:
def count_unique_words(definitions: Series):
    distinct_set = set()
    
    for index, value in definitions.items():
        distinct_set.update(value)
    
    return len(distinct_set)

def stem_tokens(tokens):
    stemmer = PorterStemmer()
    return np.array([stemmer.stem(token) for token in tokens])

def parse(definitions: Series, remove_stopwords: bool = True, 
          use_stemming: bool = True, remove_otherNoise: bool = True,
         show_log = True):
    definitions = definitions.apply(lambda x: re.split("\W+", x.strip()))
    
    if show_log:        
        print("None of pre-processing options =", count_unique_words(definitions))
    
    # remove stopwords
    if remove_stopwords:
        def rm_stopwords(tokens):
            removed = []
            for w in tokens:
                if w not in stop_words:
                    removed.append(w)
            return removed
        
        definitions = definitions.apply(rm_stopwords)
        if show_log:
            print("After removing stop words =", count_unique_words(definitions))

    # do stemming
    if use_stemming:
        stemmer = PorterStemmer()
        
        def stem(tokens):
            return [stemmer.stem(token) for token in tokens]
        
        definitions = definitions.apply(stem)
        if show_log:
            print("After removing stop words + stemming =", count_unique_words(definitions))
        
    # remove noise        
    if remove_otherNoise:
        def rm_noise(tokens):
            removed = []
            # to lower case, then remove digits and empty strings
            for w in tokens:
                w = w.lower()
                if not w.isdigit() and w.strip():
                    removed.append(w)
            return removed
                
        definitions = definitions.apply(rm_noise)
        if show_log:
            print("After removing stop words + stemming + removing other noise = %i \n" % count_unique_words(definitions))
    
    return definitions

In [5]:
# parse definitions
documents_df = documents_df.assign(tokens=parse(documents_df['definition'], 
                                                remove_stopwords=remove_stopwords, 
                                                use_stemming=use_stemming, 
                                                remove_otherNoise=remove_otherNoise))

None of pre-processing options = 15939
After removing stop words = 15799
After removing stop words + stemming = 9921
After removing stop words + stemming + removing other noise = 9654 



In [6]:
# show parsed documents dataframe
documents_df

Unnamed: 0,entity,id,definition,tokens
0,bottom-up approach,0,estimates of duration and cost are made for ea...,"[estim, durat, cost, made, compon, separ, comb..."
1,bottom-up approach,1,pattern from process deductive approach use ...,"[pattern, process, deduct, approach, use, unde..."
2,bottom-up approach,2,normalization; 3nf\n,"[normal, 3nf]"
3,bottom-up approach,3,client factor standpoint\n,"[client, factor, standpoint]"
4,bottom-up approach,4,"approach for dynamic programming problems, whi...","[approach, dynam, program, problem, problem, s..."
...,...,...,...,...
30912,sensitivity analysis,30912,understand changes in the analysis outcome if ...,"[understand, chang, analysi, outcom, case, ass..."
30913,memory controller,30913,the chip that manages access to the ram and a ...,"[chip, manag, access, ram, other, comm, direct..."
30914,memory controller,30914,chip that manages access to ram.\n,"[chip, manag, access, ram]"
30915,memory controller,30915,this handles dram accesses by converting reque...,"[handl, dram, access, convert, request, proces..."


# Part 1: Word2Vec (30 points)

In this part you will use the Word2Vec algorithm to generate word embeddings for tokens in the dataset. You can just use a package like https://radimrehurek.com/gensim/models/word2vec.html. Let's set the size of word embeddings to be 20. Please print the word embeddings for the tokens: 
* relational
* database
* garbage
* collection
* retrieval 
* model

In [7]:
# code here.
# how do you generate the word embeddings

In [8]:
from gensim.models import word2vec as w2v

In [9]:
tokens_array = documents_df['tokens']

In [10]:
# build the model and the vocabulary
model = w2v.Word2Vec(min_count=1, window=6, size=20, alpha=0.03, min_alpha=0.0007, negative=20, workers=8)
model.build_vocab(tokens_array)

In [11]:
# train the Word2Vec model
model.train(tokens_array, total_examples=model.corpus_count, epochs=30)

(10034490, 11106000)

In [12]:
original_tokens = ['relational', 'database', 'garbage', 'collection', 'retrieval', 'model']
stemmed_tokens = stem_tokens(original_tokens)
stemmed_tokens

array(['relat', 'databas', 'garbag', 'collect', 'retriev', 'model'],
      dtype='<U7')

In [13]:
# print the word embeddings of the six tokens
for i in range(len(original_tokens)):
    print("The word embedding of '%s': \n%s\n" % (original_tokens[i], model.wv[stemmed_tokens[i]]))

The word embedding of 'relational': 
[ 3.987224    0.99150014 -0.2030409   5.1891437   0.03617263  2.2877784
 -1.498411    1.7737788   1.658147    0.44948825  2.339248   -4.111205
 -3.6118011   3.4931428  -1.1895815   2.4489295  -3.6807873   1.4104651
 -0.97014517  1.9691614 ]

The word embedding of 'database': 
[ 1.0722051  -2.3072646   1.2980088   0.13472337 -2.1487749   1.5096837
 -5.710013    5.101156    3.853077    0.73120254 -1.1705676  -4.029453
 -3.2424946   1.9834421  -1.2421125   1.5238795  -5.9608893   0.08506732
 -2.6960568   0.04874504]

The word embedding of 'garbage': 
[ 0.5175103  -0.19416364 -0.38053524  0.27035043  0.382551   -1.7212878
 -1.775694   -0.5677808   0.38113728  0.93155485 -0.57317287  0.08332677
  2.1163151  -0.6121957  -1.3398107  -0.15714733 -0.3113945   1.4265146
 -2.238204   -0.33334285]

The word embedding of 'collection': 
[ 2.5483363   2.5248249   0.61709976  0.3526393  -0.30008832  0.4942185
 -2.6464262   2.7307284   6.9969535   2.3407018   2.1837

# Part 2: Vector Space Model via Word Embeddings (40 points) 

In this part, your job is to match the query and the document via the cosine similarity between the embeddings of them.

Since there are not just one token in a query or a document, the first challenge is how to aggregate many word embeddings into one embedding of a query or a document. There are many ways to do so: 
* Max pooling: return the maximum value along each dimension of a bunch of word embeddings. For example, [1, 3, 4], [2, 1, 5] -> [2, 3, 5].
* Min pooling: return the minimum value along each dimension of a bunch of word embeddings
* Mean pooling: return the mean value along each dimension of a bunch of word embeddings
* Sum: element-wise add a bunch of word embeddings together
* Weighted sum: assign weights to word embeddings and then add them together. Weights could be TF, IDF or TF-IDF.

In [226]:
# your code here
import math

from sklearn.preprocessing import normalize
from IPython.core.display import HTML


embedding_size = 20

In [227]:
def max_pooling_df(model, documents_df):
    def max_pooling(tokens):
        aggregated_embedding = np.zeros((embedding_size, ))
        for token in tokens:
            aggregated_embedding = np.maximum(aggregated_embedding, model.wv[token])
        return aggregated_embedding
    
    return documents_df["tokens"].apply(max_pooling)

def min_pooling_df(model, documents_df):
    def min_pooling(tokens):
        aggregated_embedding = np.zeros((embedding_size, ))
        for token in tokens:
            aggregated_embedding = np.minimum(aggregated_embedding, model.wv[token])
        return aggregated_embedding
    
    return documents_df["tokens"].apply(min_pooling)

def avg_pooling_df(model, documents_df):
    def avg_pooling(tokens):
        aggregated_embedding = np.zeros((embedding_size, ))
        print(tokens)
        if len(tokens) > 0:
            for token in tokens:
                aggregated_embedding = aggregated_embedding + model.wv[token]
            aggregated_embedding /= len(tokens)
        return aggregated_embedding
    
    return documents_df["tokens"].apply(avg_pooling)

def sum_pooling_df(model, documents_df):
    def sum_pooling(tokens):
        aggregated_embedding = np.zeros((embedding_size, ))
        for token in tokens:
            aggregated_embedding = aggregated_embedding + model.wv[token]
        return aggregated_embedding
    
    return documents_df["tokens"].apply(sum_pooling)

def get_df_dict(documents_df):
    # get vocabulary set
    vocabulary_set = set()
    for index, value in documents_df["tokens"].items():
        vocabulary_set.update(value)
    
    # get df dictionary for calculating tf-idf
    df_dict = dict()
    for token in vocabulary_set:
        df_dict[token] = documents_df["tokens"].apply(lambda tokens: 1 if token in tokens else 0).sum()
        
    return df_dict

def tf_idf_weighted_pooling_df(model, documents_df, df_dict):    
    N = documents_df.shape[0]
    def tf_idf_weighted_pooling(tokens):
        aggregated_embedding = np.zeros((embedding_size, ))
        for token in tokens:
            tf = tokens.count(token)
            df = df_dict[token]
            tf_idf = (1 + math.log10(tf)) * math.log10(N / df)
            aggregated_embedding += tf_idf * model.wv[token]
        return aggregated_embedding
    
    return documents_df["tokens"].apply(tf_idf_weighted_pooling)

In [228]:
# add columns of different pooling approaches
df_dict = get_df_dict(documents_df)
documents_df = documents_df.assign(max_pooling_embedding=max_pooling_df(model, documents_df))
documents_df = documents_df.assign(min_pooling_embedding=min_pooling_df(model, documents_df))
documents_df = documents_df.assign(avg_pooling_embedding=avg_pooling_df(model, documents_df))
documents_df = documents_df.assign(sum_pooling_embedding=sum_pooling_df(model, documents_df))
documents_df = documents_df.assign(tf_idf_weighted_pooling_embedding=tf_idf_weighted_pooling_df(model, documents_df, df_dict))
documents_df

['estim', 'durat', 'cost', 'made', 'compon', 'separ', 'combin', 'provid', 'overal', 'figur']
['pattern', 'process', 'deduct', 'approach', 'use', 'understand', 'process', 'predict', 'pattern', 'emerg', 'individu', 'agent', 'base', 'model', 'macroscop', 'emerg', 'pattern', 'micro', 'level', 'behavior']
['normal', '3nf']
['client', 'factor', 'standpoint']
['approach', 'dynam', 'program', 'problem', 'problem', 'solut', 'compos', 'solut', 'problem', 'smaller', 'input']
['piec', 'togeth', 'system', 'give', 'rise', 'complex', 'system']
['piec', 'togeth', 'system', 'give', 'rise', 'grander', 'system']
['begin', 'look', 'process', 'directli', 'activ', 'level', 'aggreg', 'identifi', 'process', 'across', 'organ']
['begin', 'level', 'attribut', 'normal', 'bottom', 'use', 'small', 'databas', 'attribut']
['often', 'intuit', 'approach', 'use', 'recurs', 'dynam', 'program', 'start', 'know', 'solv', 'problem', 'simpl', 'case', 'like', 'list', 'one', 'element', 'figur', 'solv', 'problem', 'two', 'elemen

['process', 'use', 'previous', 'test', 'code']
['write', 'code', 'perform', 'task', 'reus', 'time', 'need', 'perform', 'task', 'old', 'code', 'secur', 'vulner', 'reus', 'code', 'spread', 'applic']
['use', 'old', 'code', 'build', 'new', 'applic', 'copi', 'past', 'known']
['attack', 'execut', 'code', 'meant', 'purpos']
['benefit', 'use', 'modul', 'write', 'code', 'perform', 'task', 'reus', 'time', 'need', 'perform', 'task']
['code', 'reus', 'involv', 'write', 'code', 'use', 'without', 'make', 'chang', 'function', 'import', 'tool', 'code', 'reus']
['reus', 'someon', 'els', 'code', 'save', 'develop', 'time', 'money', 'add', 'secur', 'howev', 'attack', 'exploit', 'reus', 'code']
['snippet', 'code', 'previous', 'use', 'get', 'put', 'new', 'code', 'function', 'speed', 'process', 'may', 'enter', 'bring', 'error', 'new', 'code']
['write', 'modul', 'use', 'place', 'want']
['local', 'wrt', 'time', 'move', 'recent', 'use', 'data', 'higher', 'level', 'memori']
['recent', 'access', 'data', 'element'

['individu', 'measur', 'e', 'peak', 'area', 'chromatogram']
['collect', 'ir', 'patient', 'exposur']
['data', 'record', 'directli', 'logbook']
['train', 'algorithm', 'test', 'train', 'algorithm']
['data', 'collect', 'measur', 'count', 'field', 'laboratori']
['data', 'collect', 'cannot', 'compar', 'across', 'dissimilar', 'variabl']
['data', 'collect', 'field', 'laboratori', 'experi']
['data', 'collect', 'origin', 'form']
['number', 'charact', 'requir', 'sort', 'group', 'aggreg']
['yet', 'process', 'reveal', 'mean']
['data', 'collect', 'studi', 'organis']
['origin', 'data', 'collect', 'without', 'modif', 'compress', 'etc']
['ungroup', 'data', 'organ', 'frequenc', 'distribut']
['inform', 'found', 'test', 'usual', 'number', 'color', 'format']
['fact', 'figur']
['actual', 'observ']
['individu', 'valu', 'measur', 'quantiti', 'peak', 'area', 'chromatogram', 'volum', 'buret']
['unorgan', 'data']
['result', 'preprocess', 'measur', 'data', 'store', 'retriev', 'need']
['origin', 'data', 'collect',

['process', 'sever', 'instruct', 'simultan', 'alway', 'possibl', 'instruct', 'must', 'carri', 'sequenti']
['process', 'sever', 'aspect', 'stimulu', 'simultan', 'brain', 'divid', 'visual', 'scene', 'subdivison', 'color', 'depth', 'form', 'movement']
['process', 'mani', 'aspect', 'problem', 'simultan', 'brain', 'natur', 'mode', 'inform', 'process', 'mani', 'function', 'contrast', 'step', 'step', 'serial', 'process', 'comput', 'consciou', 'problem', 'solv']
['two', 'stimuli', 'enter', 'system', 'process', 'togeth', 'without', 'interfer']
['process', 'multipl', 'type', 'inform', 'time']
['simultan', 'across', 'entir', 'visual', 'field', 'pre', 'attent', 'pop', 'singl', 'featur']
['process', 'mani', 'aspect', 'problem', 'simultan', 'brain', 'natur', 'mode', 'inform', 'process', 'mani', 'function', 'contrast', 'step', 'step', 'process', 'comput', 'consciou', 'problem', 'solv']
['turnov', 'strategi', 'organ', 'continu', 'process', 'manual', 'form', 'well', 'electron', 'form']
['process', 'man

['provid', 'essenti', 'inform', 'busi', 'perform', 'across', 'function', 'area', 'significantli', 'improv', 'manag', 'abil', 'make', 'better', 'time', 'decis']
['support', 'semistructur', 'decis', 'use', 'mathemat', 'analyt', 'model', 'allow', 'vari', 'type', 'analysi', 'analysi', 'sensit', 'analysi', 'backward', 'sensit', 'analysi', 'multidimension', 'analysi', 'olap', 'exampl', 'pivot', 'tabl']
['tool', 'built', 'ehr', 'provid', 'staff', 'result', 'research', 'best', 'practic', 'enhanc', 'patient', 'care']
['give', 'strateg', 'mean', 'oper', 'data', 'data', 'long', 'term', 'present', 'differ', 'level', 'aggreg', 'highli', 'summar', 'nearli', 'atom', 'includ', 'mani', 'data', 'dimens', 'relat', 'dimens', 'tabl', 'snapshot', 'oper', 'data', 'given', 'time']
['follow', 'depart', 'might', 'best', 'use', 'data', 'warehous']
['system', 'take', 'data', 'collect', 'healthcar', 'facil', 'turn', 'inform', 'use', 'inform', 'make', 'decis', 'healthcar', 'facil']
['system', 'design', 'support', '

['allow', 'multipl', 'transact', 'run', 'concurr', 'ensur', 'data', 'access', 'control', 'overal', 'effect', 'run', 'serial', 'order']
['ensur', 'updat', 'queri', 'done', 'sever', 'user', 'access', 'data', 'time', 'done', 'correctli']
['simultan', 'process', 'behav', 'properli']
['allow', 'sever', 'transact', 'execut', 'simultan', 'collect', 'manipul', 'data', 'item', 'left', 'consist', 'state']
['coordin', 'simultan', 'transact', 'multius', 'db', 'ensur', 'serializ', 'problem', 'lost', 'updat', 'uncommit', 'data', 'inconsist', 'retriev']
['process', 'manag', 'simultan', 'oper', 'databas', 'without', 'interfer', 'one', 'anoth', 'prevent', 'interfer', 'two', 'user', 'access', 'databas', 'simultan', 'least', 'one', 'updat', 'data', 'although', 'two', 'transact', 'may', 'correct', 'interleav', 'oper', 'may', 'produc', 'incorrect', 'result']
['problem', 'occur', 'two', 'transact', 'user', 'attempt', 'updat', 'piec', 'data', 'simultan']
['concurr', 'control', 'ensur', 'one', 'user', 'action

['confid', 'associ', 'rule']
['mani', 'specif', 'predict', 'possibl']
['deriv', 'real', 'world', 'constraint', 'use', 'key', 'defin', 'normal', 'form', 'relat']
['repres', 'constraint', 'valu', 'attribut', 'relat', 'use', 'normal', 'statement', 'relationship', 'attribut', 'relat', 'say', 'set', 'attribut', 'x', 'function', 'determin', 'attribut', 'given', 'valu', 'x', 'alway', 'know', 'possibl', 'valu', 'x', 'gt', 'properti', 'domain', 'model', 'data', 'instanc', 'current', 'databas']
['use', 'identifi', 'schema', 'problem', 'suggest', 'schema', 'refin', 'relat', 'depend', 'primari', 'key', 'sinc', 'primari', 'key', 'identifi', 'valu', 'attribut', 'attribut', 'depend', 'anoth', 'includ', 'primari', 'key', 'redund']
['often', 'creat', 'data', 'redund', 'direction', 'matter', 'full', 'depend', 'valu', 'non', 'key', 'column', 'determin', 'exclus', 'primari', 'key', 'partial', 'depend', 'valu', 'non', 'key', 'column', 'function', 'depend', 'part', 'composit', 'primari', 'key', 'transit', '

['neural', 'network', 'use', 'solv', 'complex', 'poorli', 'understood', 'problem', 'larg', 'amount', 'data', 'collect', 'find', 'pattern', 'relationship', 'massiv', 'amount', 'data', 'would', 'complic', 'difficult', 'human', 'analys', 'neural', 'network', 'discov', 'knowledg', 'use', 'hardwar', 'softwar', 'parallel', 'process', 'pattern', 'biolog', 'human', 'brain', 'wherea', 'expert', 'system', 'seek', 'emul', 'model', 'human', 'expert', 'way', 'solv', 'problem', 'neural', 'network', 'builder', 'claim', 'program', 'solut', 'aim', 'solv', 'specif', 'problem', 'instead', 'neural', 'network', 'design', 'seek', 'put', 'intellig', 'hardwar', 'form', 'generalis', 'capabl', 'learn', 'contrast', 'expert', 'system', 'highli', 'specif', 'given', 'problem', 'cannot', 'retrain', 'easili', 'neural', 'network', 'applic', 'medicin', 'scienc', 'busi', 'address', 'problem', 'pattern', 'classif', 'predict', 'financi', 'analysi', 'control', 'optimis', 'medicin', 'neural', 'network', 'applic', 'use', 'sc

['uncov', 'new', 'knowledg', 'pattern', 'trend', 'rule', 'data', 'store', 'data', 'warehous']
['process', 'extract', 'specif', 'data', 'inform', 'knowledg', 'previous', 'unknown']
['busi', 'analyt', 'focus', 'better', 'understand', 'characterist', 'pattern', 'among', 'variabl', 'larg', 'data', 'set', 'descript', 'prescript', 'mostli']
['enabl', 'user', 'analyz', 'data', 'variou', 'dimens', 'angl', 'categor', 'find', 'correl', 'pattern', 'among', 'field', 'data', 'warehous']
['discov', 'pattern', 'relationship', 'data', 'help', 'make', 'better', 'busi', 'decis']
['draw', 'larg', 'dataset', 'data', 'miner', 'develop', 'analyt', 'framework', 'help', 'understand', 'pattern', 'data', 'predict', 'treatment', 'outcom', 'forecast', 'futur', 'medic', 'cost', 'util']
['repres', 'converg', 'disciplin', 'techniqu', 'emerg', 'variou', 'field']
['concern', 'gener', 'novel', 'insight', 'larg', 'databas', 'systemat', 'hunt', 'nugget', 'action', 'intellig']
['process', 'search', 'valuabl', 'busi', 'inf

['osi', 'layer', 'coordin', 'represent', 'bit', 'function', 'requir', 'carri', 'bit', 'stream', 'physic', 'medium']
['layer', 'consist', 'actual', 'physic', 'connect', 'sender', 'receiv', 'layer', 'transfer', 'electr', 'radio', 'light', 'signal', 'circuit']
['tcp', 'ip', 'unit', 'bit', 'b', 'actual', 'physic', 'connect', 'c', 'pin', 'connect', 'wire', 'cabl', 'transmiss', 'move', 'physic', 'world']
['defin', 'physic', 'characterist', 'transmiss', 'medium', 'connector', 'pin', 'electr', 'current', 'encod', 'light', 'modul', 'etc', 'rj', 'ethernet', 'ieee', 'lan', 'hub', 'lan', 'repeat', 'cabl']
['describ', 'data', 'store', 'deal', 'file', 'data', 'structur']
['cabl', 'wireless', 'ie', 'hub']
['first', 'layer', 'osi', 'model', 'real', 'transmiss', 'data', 'bit', 'take', 'place', 'medium', 'layer', 'name', 'suggest', 'physic', 'stuff', 'connect', 'comput', 'togeth', 'medium']
['transmit', 'bit', 'medium', 'provid', 'mechan', 'electr', 'specif', 'defin', 'physic', 'characterist', 'represen

['contain', 'instruct', 'variabl', 'specif', 'name', 'instead', 'binari', 'number']
['machin', 'languag', 'easier', 'read']
['low', 'level', 'program', 'languag', 'difficult', 'human', 'easi', 'machin', 'use', 'mnemon', 'opcod', 'mov', 'sto', 'load', 'interact', 'directli', 'comput', 'cpu', 'regist', 'use', 'produc', 'highli', 'effici', 'fast', 'program']
['low', 'level', 'languag', 'use', 'symbol', 'name', 'rather', 'binari', 'sequenc', '0s', '1s', 'repres', 'machin', 'languag', 'instruct']
['although', 'comput', 'cpu', 'understand', 'machin', 'languag', 'impract', 'write', 'program', 'languag', 'creat', 'earli', 'altern', 'instead', 'use', 'binari', 'number', 'instruct', 'languag', 'use', 'short', 'word', 'known', 'mnemon', 'program', 'cannot', 'execut', 'cpu', 'howev', 'special', 'program', 'use', 'translat', 'sort', 'program', 'machin', 'languag', 'program']
['low', 'level', 'languag', 'written', 'use', 'mnemon']
['first', 'symbol', 'languag', 'acronym', 'assembl', 'string', 'use',

['mean', 'sourc', 'code', 'program', 'avail', 'public']
['softwar', 'creat', 'free', 'use', 'everyon', 'make', 'commod', 'collabor', 'effect', 'programm', 'develop', 'improv', 'share', 'softwar', 'code', 'within', 'commun', 'user', 'abil', 'access', 'softwar', 'code', 'edit', 'code', 'publish', 'ex', 'linux', 'oper', 'system', 'firefox', 'android', 'licens', 'cheaper', 'requir', 'addit', 'licens', 'organ', 'scale', 'transpar', 'secur', 'less', 'bug', 'reinvent', 'wheel', 'open', 'sourc', 'move', 'erp', 'crm']
['free', 'distribut', 'program', 'public']
['share', 'care', 'often', 'time', 'oblig', 'user', 'agreement', 'term', 'share', 'chang', 'make', 'whether', 'profit', 'use', 'product']
['softwar', 'origin', 'sourc', 'code', 'made', 'freeli', 'avail', 'may', 'redistribut', 'modifi', 'open', 'sourc', 'file', 'type', 'often', 'abl', 'open', 'mani', 'differ', 'type', 'applic']
['softwar', 'design', 'freeli', 'avail', 'access', 'distribut', 'modifi', 'anyon']
['softwar', 'freeli', 'use', '

['histori', 'anthropolog', 'sociolog', 'psycholog']
['record', 'type']
['file', 'store', 'type', 'data']
['store', 'data', 'inform', 'file', 'conceptu', 'level', 'file', 'collect', 'record', 'may', 'may', 'order']
['put', 'valu', 'solut']
['defin', 'requir', 'success', 'enough', 'detail', 'guid', 'action']
['solut', 'system', 'equat', 'ax', 'system', 'ax', 'b']
['solut', 'gener', 'solut', 'famili', 'solut']
['analysi', 'ask', 'much', 'traffic', 'must', 'flow', 'network', 'mani', 'individu', 'transmiss', 'link']
['attempt', 'gather', 'inform', 'somebodi', 'internet', 'connect', 'analyz', 'time', 'length', 'destin', 'packet']
['examin', 'flow', 'network', 'traffic', 'pattern', 'except', 'pattern']
['gt', 'gt', 'traffic', 'analysi', 'involv', 'determin', 'locat', 'ident', 'commun', 'host', 'could', 'observ', 'frequenc', 'length', 'messag', 'exchang']
['protocol', 'packet', 'flow']
['attack', 'monitor', 'transmiss', 'via', 'wireless', 'network', 'identifi', 'commun', 'pattern', 'particip']

['project', 'dataset', 'onto', 'lower', 'dimension', 'subspac', 'tri', 'maintain', 'class', 'discriminatori', 'inform', 'supervis', 'learn', 'techniqu']
['anoth', 'way', 'find', 'linear', 'transform', 'reduc', 'number', 'dimens', 'supervis', 'techniqu', 'classif']
['classif', 'predict', 'quant', 'input', 'qual', 'output', 'ie', 'credit', 'offic', 'det', 'good', 'risk', 'base', 'incom', 'credit', 'score', 'disqual', 'also', 'qual', 'ind', 'var']
['pca', 'unsupervis', 'pick', 'new', 'dimens', 'give', 'maximum', 'separ', 'mean', 'project', 'class', 'minimum', 'varianc', 'within', 'project', 'class', 'solut', 'eigenvector', 'base', 'class', 'covari', 'matric']
['similar', 'euclidean', 'classifi', 'better', 'assumpt', 'class', 'ident', 'priori', 'probabl', 'class', 'exhibit', 'gaussian', 'distribut', 'mean', 'lt', 'vec', 'x', 'gt', 'lt', 'vec', 'x', 'gt', 'covari', 'matric', 'lambda', 'lambda', 'covari', 'matric', 'assum', 'ident', 'class', 'lambda', 'lambda', 'lambda', 'r', 'vec', 'x', 've

['well', 'describ', 'solut', 'commonli', 'encount', 'problem', 'occur', 'softwar', 'develop']
['go', 'beyond', 'style', 'visual', 'repetit', 'interact', 'design']
['way', 'reus', 'abstract', 'knowledg', 'problem', 'solut', 'pattern', 'descript', 'problem', 'essenc', 'solut', 'help', 'inherit', 'polymorph']
['good', 'solut', 'common', 'problem', 'deriv', 'success', 'accept', 'design', 'solut']
['standard', 'design', 'techniqu', 'templat', 'wide', 'recogn', 'good', 'practic']
['make', 'good', 'inform', 'system', 'design']
['gener', 'design', 'pattern', 'tidwel', 'salaasko', 'yahoo', 'weli', 'type', 'structur', 'navig', 'widget']
['customiz', 'solut', 'design', 'problem']
['pattern', 'level', 'class', 'modul']
['reus', 'past', 'solut', 'design', 'problem']
['element', 'reusabl', 'object', 'orient', 'softwar']
['standard', 'solut', 'common', 'program', 'problem', 'techniqu', 'make', 'code', 'flexibl', 'make', 'meet', 'certain', 'criteria', 'design', 'implement', 'structur', 'achiev', 'part

['longest', 'path', 'determin', 'expect', 'project', 'durat']
['sequenc', 'schedul', 'activit', 'determin', 'durat', 'project']
['time', 'path', 'longest', 'delay']
['longest', 'sequenc', 'order', 'step', 'need', 'complet', 'comput']
['describ', 'shortest', 'amount', 'time', 'requir', 'complet', 'project', 'take', 'account', 'project', 'task', 'relationship']
['follow', 'process', 'descript', 'tool', 'decis', 'tree', 'b', 'modular', 'design', 'c', 'critic', 'path']
['gener', 'alway', 'sequenc', 'schedul', 'activ', 'determin', 'durat', 'project', 'longest', 'path', 'project', 'see', 'also', 'critic', 'path', 'methodolog']
['link', 'project', 'task', 'determin', 'long', 'project', 'take', 'complet']
['longest', 'path', 'process', 'flow', 'chart']
['longest', 'amount', 'time', 'complet', 'project']
['use', 'determin', 'earliest', 'data', 'project', 'finish', 'longest', 'path', 'network', 'activ']
['longest', 'path', 'sequenc', 'task', 'term', 'task', 'durat', 'network', 'diagram', 'delay'

['dataflow', 'arrow', 'repres']
['singl', 'piec', 'data', 'logic', 'collect', 'sever', 'piec', 'inform', 'name', 'noun', 'descript', 'list', 'exactli', 'data', 'element', 'flow', 'contain', 'hold', 'process', 'togeth', 'show', 'input', 'go', 'process', 'output', 'process', 'produc']
['flow', 'inform', 'call', 'code', 'function', 'function', 'back', 'call', 'code']
['singl', 'piec', 'data', 'logic', 'collect', 'data', 'name', 'describ', 'content', 'data', 'flow', 'implement', 'alway', 'start', 'end', 'process', 'name', 'descript', 'one', 'connect', 'process']
['compon', 'filter', 'connector', 'pipe', 'constraint', 'filter', 'independ']
['depict', 'data', 'motion', 'move', 'unit', 'one', 'place', 'anoth', 'system']
['movement', 'data', 'subprogram', 'implement', 'use', 'paramet']
['movement', 'data', 'among', 'process', 'store', 'sourc', 'destin']
['show', 'movement', 'fo', 'data', 'among', 'activ', 'repositori']
['data', 'input', 'output', 'process', 'data', 'flow', 'data', 'motion']
['

['measur', 'spread', 'dispers', 'set', 'data', 'mean']
['comput', 'measur', 'much', 'score', 'vari', 'around', 'mean', 'score', 'use', 'interv', 'ratio', 'scale', 'data']
['measur', 'variat', 'score', 'mean']
['statist', 'formula', 'sd', 'less', 'formal', 'place', 'squar', 'root', 'varianc', 'measur', 'variabl', 'data', 'around', 'mean', 'unit', 'data', 'standard', 'deviat', 'approxim', 'averag', 'deviat']
['squar', 'root', 'varianc', 'lower', 'sd', 'mean', 'data', 'distribut', 'around', 'mean', 'sd', 'varianc']
['measur', 'dispers', 'collect', 'number']
['measur', 'much', 'individu', 'observ', 'spread', 'around', 'mean']
['number', 'tell', 'spread', 'data', 'tend', 'measur', 'far', 'number', 'tend', 'averag']
['spread', 'valu', 'around', 'mean', 'affect', 'outlier', 'smaller', 'consist']
['squar', 'root', 'varianc', 'larger', 'standard', 'deviat', 'vari', 'possibl', 'skew', 'data']
['measur', 'spread', 'data']
['averag', 'deviat', 'score', 'mean']
['spread', 'data']
['measur', 'varat'

['expert', 'system', 'model', 'human', 'knowledg', 'set', 'rule']
['use', 'es', 'collect', 'fact', 'relationship', 'among', 'built', 'seri', 'rule']
['specif', 'info', 'programm', 'give', 'system']
['fact', 'rule', 'incorpor', 'knowledg', 'base', 'store', 'comput', 'file', 'manipul', 'softwar', 'call', 'infer', 'engin']
['store', 'inform', 'rule', 'use', 'solv', 'problem', 'knowledg', 'base', 'store', 'comput', 'file', 'utilis', 'infer', 'engin']
['area', 'expert', 'system', 'fact', 'knowledg', 'domain', 'store']
['use', 'expert', 'system', 'collect', 'fact', 'relationship', 'among', 'use', 'infer', 'engin', 'built', 'seri', 'rule']
['larg', 'databas', 'allow', 'user', 'find', 'inform', 'enter', 'keyword', 'question', 'normal', 'english', 'phrase', 'system', 'use', 'util', 'infer', 'rule', 'order', 'return', 'result']
['collect', 'rule', 'inform', 'structur', 'deriv', 'human', 'expert', 'rule', 'typic', 'structur', 'statement']
['knowledg', 'topic']
['person', 'know']
['store', 'expert

['type', 'inform', 'store', 'column']
['data', 'type', 'set', 'valu', 'set', 'oper', 'valu', 'far', 'discuss', 'detail', 'java', 'primit', 'data', 'type', 'exampl', 'valu', 'primit', 'data', 'type', 'int', 'integ', 'oper', 'int', 'includ', 'lt', 'gt', 'principl', 'could', 'write', 'program', 'use', 'built', 'primit', 'type', 'much', 'conveni', 'write', 'program', 'higher', 'level', 'abstract']
['attribut', 'size', 'rang', 'piec', 'data', 'store', 'rdbm']
['way', 'field', 'recogn', 'system', 'integ', 'string', 'real', 'boolean']
['specif', 'categori', 'inform', 'variabl', 'contain', 'numer', 'boolean', 'string']
['mutual', 'recurs']
['defin', 'valu', 'cell']
['indic', 'type', 'data', 'text', 'date', 'boolean', 'number', 'store', 'field']
['defin', 'kind', 'valu', 'use', 'store', 'also', 'use', 'program', 'languag', 'databas', 'system', 'determin', 'oper', 'appli', 'data']
['indic', 'type', 'data', 'store', 'field']
['term', 'describ', 'valu', 'held', 'item', 'item', 'store', 'comput', '

['determin', 'generaliz', 'model']
['inform', 'involv', 'observ', 'individu', 'well', 'control', 'environ']
['data', 'use', 'unit', 'test', 'test', 'data', 'contain', 'correct', 'data', 'erron', 'data', 'test', 'possibl', 'situat', 'could', 'occur']
['sampl', 'data', 'use', 'provid', 'unbias', 'evalu', 'final', 'model', 'fit', 'train', 'dataset']
['set', 'valu', 'use', 'programm', 'test', 'program', 'work', 'correctli']
['test', 'classifi', 'data']
['data', 'use', 'assess', 'perform', 'algorithm']
['repres', 'set', 'data', 'point', 'w', 'valu', 'part', 'domain', 'btw', 'endpoint', 'valu', 'rang', 'valu']
['portion', 'data', 'use', 'end', 'model', 'build', 'select', 'process', 'assess', 'well', 'final', 'model', 'might', 'perform', 'addit', 'data']
['data', 'use', 'refer', 'creat', 'predict', 'model']
['input', 'test', 'program']
['languag', 'model', 'test', 'data', 'set', 'compar', 'expect', 'result', 'actual', 'result']
['data', 'exist', 'test', 'execut', 'affect', 'affect', 'compon',

['consist', 'anyth', 'document']
['consist', 'anyth', 'document', 'achiev', 'codifi', 'often', 'help']
['object', 'ration', 'technic', 'type', 'knowledg', 'fact', 'read', 'like', 'book', 'ex', 'polici', 'procedur', 'guid', 'report', 'product', 'strategi', 'goal', 'core', 'compet']
['knowledg', 'easili', 'commun', 'other', 'readili', 'captur', 'store', 'type', 'document', 'databas']
['object', 'theoret', 'codifi', 'knowledg', 'transmiss', 'formal', 'systemat', 'method', 'use', 'grammar', 'syntax', 'print', 'word', 'tradit', 'focus']
['knowledg', 'document', 'distribut', 'other', 'polici', 'procedur', 'goal']
['document', 'written', 'record']
['consist', 'anyth', 'document', 'archiv', 'codifi', 'often', 'help']
['includ', 'inform', 'store', 'document', 'form', 'media']
['knowledg', 'articul', 'written', 'know']
['fact', 'measur', 'document', 'report', 'rule']
['primari', 'focu', 'formal', 'learn', 'well', 'document', 'easili', 'transfer', 'person', 'person']
['knowledg', 'easili', 'commu

['unit', 'gather', 'inform', 'transform', 'inform', 'seri', 'electron', 'signal', 'comput', 'recogn']
['techniqu', 'includ', 'speech', 'recognit', 'system', 'translat', 'user', 'spoken', 'word', 'comput', 'instruct', 'gestur', 'recognit', 'system', 'interpret', 'user', 'bodi', 'movement', 'visual', 'detect', 'sensor', 'embed', 'peripher', 'devic', 'wand', 'stylu', 'pointer', 'glove', 'bodi', 'wear']
['touchscreen', 'mic', 'transfer', 'inform', 'comput', 'memori']
['keyboard', 'mous', 'special', 'screen', 'touch', 'multitouch', 'etc', 'joystick', 'stylu', 'scan', 'devic', 'optic', 'card', 'reader', 'bar', 'code', 'reader', 'etc', 'audio', 'input', 'imag', 'captur', 'digit', 'camera', 'video', 'etc']
['allow', 'user', 'feed', 'data', 'comput']
['devic', 'accept', 'inform', 'transfer', 'inform', 'memori']
['put', 'inform', 'comput']
['keyboard', 'mous', 'scanner', 'digit', 'camera', 'webcam']
['enter', 'data', 'instruct', 'comput', 'process']
['translat', 'analog', 'inform', 'enter', 'bin

['show', 'us', 'histor', 'percentag', 'purchas', 'group', 'custom', 'defin', 'signific', 'attribut']
['hypothesi', 'form', 'tree', 'use', 'classif', 'instanc', 'repres', 'featur', 'vector']
['diagram', 'lay', 'differ', 'branch', 'result', 'differ', 'decis', 'made', 'result', 'differ', 'econom', 'situat']
['graph', 'decis', 'possibl', 'consequ', 'use', 'creat', 'plan', 'reach', 'goal']
['class', 'base', 'logic', 'argument']
['allow', 'classifi', 'data', 'accord', 'pre', 'defin', 'outcom', 'depend', 'characterist', 'data', 'give', 'insight', 'whether', 'person', 'obtain', 'loan', 'pay', 'invest', 'credit', 'card', 'fraud']
['repres', 'function', 'take', 'input', 'vector', 'attribut', 'valu', 'return', 'singl', 'output', 'valu', 'decis']
['classifi', 'data', 'accord', 'pre', 'defin', 'outcom', 'depend', 'characterist', 'data']
['graph', 'decis', 'possibl', 'consequ', 'use', 'creat', 'plan', 'reach', 'goal', 'squar', 'node', 'decis', 'chanc', 'node', 'circl', 'state', 'natur', 'outcom', 'b

['access', 'real', 'time', 'inform', 'custom', 'social', 'media', 'view', 'demo', 'servic', 'journey', 'build', 'email', 'market', 'data', 'manag', 'mobil', 'market', 'social', 'media', 'market', 'market', 'autom', 'omni', 'channel', 'interact', 'campaign', 'manag']
['def', 'set', 'relationship', 'often', 'defin', 'set', 'node', 'individu', 'tie', 'edg']
['social', 'structur', 'individu', 'organ']
['onlin', 'commun', 'peopl', 'common', 'interest', 'use', 'websit', 'technolog', 'commun', 'share', 'inform', 'resourc', 'etc']
['collect', 'connect', 'tie', 'within', 'particular', 'group', 'collect', 'group']
['social', 'structur', 'made', 'node', 'gener', 'individu', 'organ', 'social', 'network', 'repres', 'relationship', 'flow', 'peopl', 'group', 'organ', 'anim', 'comput', 'inform', 'knowledg', 'process', 'entiti']
['observ', 'pattern', 'social', 'relationship', 'among', 'individu', 'group']
['web', 'base', 'meet', 'place', 'friend', 'famili', 'co', 'worker', 'peer', 'let', 'user', 'creat

['statist', 'method', 'use', 'one', 'predictor', 'variabl', 'discrimin', 'categori', 'depend', 'variabl', 'categor']
['method', 'use', 'find', 'linear', 'combin', 'variabl', 'characteris', 'depend', 'variabl', 'one', 'class', 'depend', 'criterion', 'variabl', 'categor']
['abil', 'predictor', 'variabl', 'discrimin', 'categori', 'gener', 'perceptu', 'map', 'depict', 'brand']
['segment', 'descript', 'segment', 'predict']
['predict', 'discrimin', 'across', 'two', 'level', 'iv', 'group', 'categori', 'membership', 'base', 'measur', 'two', 'dv']
['object', 'discrimin', 'analysi', 'develop', 'discrimin', 'function', 'noth', 'linear', 'combin', 'independ', 'variabl', 'discrimin', 'categori', 'depend', 'variabl', 'perfect', 'manner']
['techniqu', 'analyz', 'market', 'research', 'data', 'criterion', 'depend', 'variabl', 'categor', 'predictor', 'independ', 'variabl', 'interv', 'natur']
['statist', 'analysi', 'take', 'advantag', 'known', 'group', 'cluster', 'data', 'deriv', 'classif', 'rule', 'invo

Unnamed: 0,entity,id,definition,tokens,max_pooling_embedding,min_pooling_embedding,avg_pooling_embedding,sum_pooling_embedding,tf_idf_weighted_pooling_embedding,max_pooling_score,min_pooling_score,sum_pooling_score,tf_idf_weighted_pooling_score,avg_pooling_score
0,bottom-up approach,0,estimates of duration and cost are made for ea...,"[estim, durat, cost, made, compon, separ, comb...","[1.6786679029464722, 1.5756455659866333, 6.669...","[-1.1568663120269775, -4.001165866851807, -1.9...","[0.08707362283021211, -1.1707010015845298, 1.7...","[0.8707362283021212, -11.707010015845299, 17.4...","[3.9182003177702427, -24.176996111869812, 34.6...",0.679641,0.670485,0.385186,-0.362722,0.385186
1,bottom-up approach,1,pattern from process deductive approach use ...,"[pattern, process, deduct, approach, use, unde...","[5.954471588134766, 2.4786300659179688, 2.4872...","[-1.6020854711532593, -8.456828117370605, -1.7...","[0.9007936220616102, -0.8057965405285359, 0.17...","[18.015872441232204, -16.115930810570717, 3.55...","[34.25498013943434, -15.956364393234253, -1.82...",0.773321,0.667227,0.441355,-0.418230,0.441355
2,bottom-up approach,2,normalization; 3nf\n,"[normal, 3nf]","[3.3570172786712646, 0.0, 1.9689255952835083, ...","[0.0, -0.34316229820251465, 0.0, 0.0, 0.0, 0.0...","[1.950080782175064, -0.2777104079723358, 1.916...","[3.900161564350128, -0.5554208159446716, 3.832...","[9.084224224090576, -1.457794964313507, 10.542...",0.430628,0.435267,0.039069,-0.061027,0.039069
3,bottom-up approach,3,client factor standpoint\n,"[client, factor, standpoint]","[3.3282482624053955, 0.5294300317764282, 0.971...","[-1.0139529705047607, -1.449976921081543, -2.8...","[0.7105877747138342, -0.7486754258473715, -0.5...","[2.1317633241415024, -2.2460262775421143, -1.7...","[5.053221046924591, -3.8104820251464844, -2.73...",0.320402,0.554570,-0.338807,0.353212,-0.338807
4,bottom-up approach,4,"approach for dynamic programming problems, whi...","[approach, dynam, program, problem, problem, s...","[1.5540236234664917, 0.0, 3.4091858863830566, ...","[-4.258579254150391, -5.072784900665283, -2.92...","[-1.2708764157511971, -3.20437846400521, 0.014...","[-13.979640573263168, -35.24816310405731, 0.16...","[-29.39676159620285, -74.54962289333344, -2.52...",0.620124,0.709553,0.296569,-0.319942,0.296569
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30912,sensitivity analysis,30912,understand changes in the analysis outcome if ...,"[understand, chang, analysi, outcom, case, ass...","[4.280294418334961, 0.022109635174274445, 2.57...","[-1.1831191778182983, -5.760196685791016, -1.0...","[1.661587978047984, -2.772448811147894, 1.2598...","[11.631115846335888, -19.40714167803526, 8.819...","[23.913735181093216, -40.287494637072086, 18.4...",0.607334,0.817563,0.228332,-0.246967,0.228332
30913,memory controller,30913,the chip that manages access to the ram and a ...,"[chip, manag, access, ram, other, comm, direct...","[4.733766078948975, 3.496218681335449, 3.64465...","[-4.841139316558838, -5.148653984069824, -5.63...","[-0.1898236069828272, -0.025680669292341918, -...","[-1.5185888558626175, -0.20544535433873534, -1...","[-3.093780666589737, -0.8318992350250483, -32....",0.652578,0.678437,0.211754,-0.078216,0.211754
30914,memory controller,30914,chip that manages access to ram.\n,"[chip, manag, access, ram]","[0.0, 2.3875834941864014, 3.6446595191955566, ...","[-4.841139316558838, -1.8792915344238281, -5.6...","[-2.1461488604545593, 0.564283549785614, -0.87...","[-8.584595441818237, 2.257134199142456, -3.488...","[-18.939855098724365, 3.819213330745697, -10.6...",0.613702,0.622675,0.330219,-0.198130,0.330219
30915,memory controller,30915,this handles dram accesses by converting reque...,"[handl, dram, access, convert, request, proces...","[9.151537895202637, 7.432768821716309, 8.33754...","[-6.727926731109619, -5.148653984069824, -4.23...","[0.45393201150000095, -0.22090421944785005, -0...","[23.60446459800005, -11.487019411288202, -9.74...","[27.690755650401115, -34.910686533898115, -27....",0.661478,0.733340,0.070410,0.030531,0.070410


In [229]:
# define queries and create dataframe
queries = np.array(["relational database", "garbage collection", "retrieval model"])
query_df = pd.DataFrame(queries, columns = ["query"])

In [230]:
# parse queries
query_df = query_df.assign(tokens=parse(query_df['query'], 
                                                remove_stopwords=remove_stopwords, 
                                                use_stemming=use_stemming, 
                                                remove_otherNoise=remove_otherNoise))
query_df

None of pre-processing options = 6
After removing stop words = 6
After removing stop words + stemming = 6
After removing stop words + stemming + removing other noise = 6 



Unnamed: 0,query,tokens
0,relational database,"[relat, databas]"
1,garbage collection,"[garbag, collect]"
2,retrieval model,"[retriev, model]"


In [231]:
# add columns of different pooling approaches
query_df = query_df.assign(max_pooling_embedding=max_pooling_df(model, query_df))
query_df = query_df.assign(min_pooling_embedding=min_pooling_df(model, query_df))
query_df = query_df.assign(avg_pooling_embedding=avg_pooling_df(model, query_df))
query_df = query_df.assign(sum_pooling_embedding=sum_pooling_df(model, query_df))
query_df = query_df.assign(tf_idf_weighted_pooling_embedding=tf_idf_weighted_pooling_df(model, query_df, df_dict))
query_df

['relat', 'databas']
['garbag', 'collect']
['retriev', 'model']


Unnamed: 0,query,tokens,max_pooling_embedding,min_pooling_embedding,avg_pooling_embedding,sum_pooling_embedding,tf_idf_weighted_pooling_embedding
0,relational database,"[relat, databas]","[3.9872241020202637, 0.9915001392364502, 1.298...","[0.0, -2.307264566421509, -0.20304089784622192...","[2.529714584350586, -0.6578822135925293, 0.547...","[5.059429168701172, -1.3157644271850586, 1.094...","[-13.113665103912354, 3.8408167362213135, -3.0..."
1,garbage collection,"[garbag, collect]","[2.5483362674713135, 2.524824857711792, 0.6170...","[0.0, -0.19416363537311554, -0.380535244941711...","[1.5329232811927795, 1.1653306111693382, 0.118...","[3.065846562385559, 2.3306612223386765, 0.2365...","[-6.994352877140045, -6.560199096798897, -1.42..."
2,retrieval model,"[retriev, model]","[0.0, 3.0120248794555664, 2.4329323768615723, ...","[-1.9308950901031494, -3.3003108501434326, 0.0...","[-1.7664902806282043, -0.1441429853439331, 1.8...","[-3.5329805612564087, -0.2882859706878662, 3.6...","[8.184017658233643, 2.5812721252441406, -8.060..."


In [237]:
def cosine(vector_1, vector_2):
    return (normalize(vector_1[:,np.newaxis], axis=0).ravel() * normalize(vector_2[:,np.newaxis], axis=0).ravel()).sum()
    
def cal_cosine_scores(query_df, query_idx, documents_df, col_name):
    query_embedding = query_df[col_name][query_idx]
    return documents_df[col_name].apply(lambda v: cosine(query_embedding, v))

In [238]:
def precision_10(query_df, idx, results_df):
    query = query_df["query"][idx]
    count = 0
    for row in range(10):
        if results_df["entity"].iloc[row] == query:
            count += 1
    return count / 10

## Max Pooling

In [239]:
for idx in range(3):
    # calculate and rank by max_pooling_score
    print("Query: %s" % query_df["query"][idx])
    documents_df = documents_df.assign(max_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'max_pooling_embedding'))
    max_pooling_sorted_results = documents_df.sort_values("max_pooling_score", ascending=False)
    
    print("Top 10 documents after calculating max pooling score:")
    display(HTML(max_pooling_sorted_results.head(10)[['entity', 'id', 'definition', 'tokens', 'max_pooling_score']].to_html()))
        
    max_pooling_precision_10 = precision_10(query_df, idx, max_pooling_sorted_results)
    print("Max pooling precision: %f \r\n\r\n" % max_pooling_precision_10)


Query: relational database
Top 10 documents after calculating max pooling score:


Unnamed: 0,entity,id,definition,tokens,max_pooling_score
28290,relational database,28290,database that contains two or more tables of related information\n,"[databas, contain, two, tabl, relat, inform]",0.981985
28177,relational database,28177,relational database schema with data\n,"[relat, databas, schema, data]",0.980311
2129,data structure,2129,how to organize data (data are organized using relations).\n,"[organ, data, data, organ, use, relat]",0.976495
4648,data warehouse,4648,-set of related databases -hierarchy of data\n,"[set, relat, databas, hierarchi, data]",0.970704
773,relational model,773,a table of data where the relation between the entities can be seen.\n,"[tabl, data, relat, entiti, seen]",0.966453
10773,adjacency matrix,10773,a table containing 0 or 1 in each posistion\n,"[tabl, contain, posist]",0.962811
19673,database schema,19673,set of schemas for the relations of a database\n,"[set, schema, relat, databas]",0.962662
19701,database schema,19701,set of all relation schemas in the database\n,"[set, relat, schema, databas]",0.962662
5148,relational databases,5148,organize data into 2-d tables\n,"[organ, data, tabl]",0.961089
2138,data structure,2138,data organized into tables\n,"[data, organ, tabl]",0.961089


Max pooling precision: 0.200000 


Query: garbage collection
Top 10 documents after calculating max pooling score:


Unnamed: 0,entity,id,definition,tokens,max_pooling_score
15732,data set,15732,information collected\n,"[inform, collect]",0.973514
17402,data gathering,17402,collections of data is imperative.\n,"[collect, data, imper]",0.970608
15725,data set,15725,is a collection of data.\n,"[collect, data]",0.966789
8758,data collection,8758,the process by which data are collected\n,"[process, data, collect]",0.966432
8802,data collection,8802,the processes by which data are collected\n,"[process, data, collect]",0.966432
17405,data gathering,17405,process of collecting data.\n,"[process, collect, data]",0.966432
15782,data set,15782,collection of information or data.\n,"[collect, inform, data]",0.963831
12409,application domain,12409,purpose of the data collection\n,"[purpos, data, collect]",0.959557
2606,raw data,2606,"data that have been collected, but has not been processed for use\n","[data, collect, process, use]",0.956864
17410,data gathering,17410,the process of collecting facts and figures\n,"[process, collect, fact, figur]",0.956461


Max pooling precision: 0.000000 


Query: retrieval model
Top 10 documents after calculating max pooling score:


Unnamed: 0,entity,id,definition,tokens,max_pooling_score
9085,query language,9085,used to update and retrieve data that is stored in a data model\n,"[use, updat, retriev, data, store, data, model]",0.939912
9707,data mining,9707,a technique used to analyze and extract information from massive databases\n,"[techniqu, use, analyz, extract, inform, massiv, databas]",0.915039
23571,knowledge workers,23571,-use technology to acquire and apply information\n,"[use, technolog, acquir, appli, inform]",0.914696
19727,online analytical processing,19727,"tools for retrieving, processing, and modelling data from the data warehouse\n","[tool, retriev, process, model, data, data, warehous]",0.907326
19708,online analytical processing,19708,"olap - tools for retrieving, processing, and modeling data from the data warehouse\n","[olap, tool, retriev, process, model, data, data, warehous]",0.90563
6725,hot spots,6725,the hawaiian islands are made from what structures.\n,"[hawaiian, island, made, structur]",0.902652
10638,information system,10638,"a method of gathering, storing, and analyzing data for the purpose of making business decisions\n","[method, gather, store, analyz, data, purpos, make, busi, decis]",0.902096
27657,database system,27657,"collectively, the database model, the dbms, and the database itself\n","[collect, databas, model, dbm, databas]",0.901114
4565,data warehouse,4565,logical collection of information gathered from many different databases that supports business analysis\n,"[logic, collect, inform, gather, mani, differ, databas, support, busi, analysi]",0.897147
13256,data transformation,13256,transforms the data for storing it in the proper format or structure for the purposes of querying and analysis.\n,"[transform, data, store, proper, format, structur, purpos, queri, analysi]",0.895686


Max pooling precision: 0.000000 




## Min Pooling

In [240]:
for idx in range(3):
    # calculate and rank by min_pooling_score
    print("Query: %s" % query_df["query"][idx])
    documents_df = documents_df.assign(min_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'min_pooling_embedding'))
    min_pooling_sorted_results = documents_df.sort_values("min_pooling_score", ascending=False)
    
    print("Top 10 documents after calculating min pooling score:")
    display(HTML(min_pooling_sorted_results.head(10)[['entity', 'id', 'definition', 'tokens', 'min_pooling_score']].to_html()))
        
    min_pooling_precision_10 = precision_10(query_df, idx, min_pooling_sorted_results)
    print("Min pooling precision: %f \r\n\r\n" % min_pooling_precision_10)

Query: relational database
Top 10 documents after calculating min pooling score:


Unnamed: 0,entity,id,definition,tokens,min_pooling_score
4613,data warehouse,4613,a collection of databases\n,"[collect, databas]",0.999127
4501,data warehouse,4501,a collection of organized databases\n,"[collect, organ, databas]",0.991413
7097,data model,7097,the structure of a database.\n,"[structur, databas]",0.989016
19641,database schema,19641,structure of a database\n,"[structur, databas]",0.989016
11840,data warehousing,11840,collection of multiple databases\n,"[collect, multipl, databas]",0.986152
27689,database system,27689,database + dbms\n,"[databas, dbm]",0.978514
486,database design,486,the actual structure of the database\n,"[actual, structur, databas]",0.967048
19676,database schema,19676,the description of a database.\n,"[descript, databas]",0.961531
27663,database system,27663,the combination of database and a dbms.\n,"[combin, databas, dbm]",0.960101
19702,database schema,19702,the overall description of the database\n,"[overal, descript, databas]",0.959386


Min pooling precision: 0.000000 


Query: garbage collection
Top 10 documents after calculating min pooling score:


Unnamed: 0,entity,id,definition,tokens,min_pooling_score
12680,data integrity,12680,when an organization has inconsistent duplicated data\n,"[organ, inconsist, duplic, data]",0.948432
11816,data warehousing,11816,combines data from multiple databases and data sources\n,"[combin, data, multipl, databas, data, sourc]",0.931623
15116,data acquisition,15116,"obtaining, cleaning, organizing, relating, and cataloging source data\n","[obtain, clean, organ, relat, catalog, sourc, data]",0.930939
4064,data extraction,4064,get data from multiple heterogeneous external sources\n,"[get, data, multipl, heterogen, extern, sourc]",0.930898
5093,relational databases,5093,organize data into 2-d-rdms\n,"[organ, data, rdm]",0.930323
2597,raw data,2597,data tha is not organized\n,"[data, tha, organ]",0.929264
15729,data set,15729,an organized collection of data.\n,"[organ, collect, data]",0.928937
2549,raw data,2549,data that is not organized\n,"[data, organ]",0.928937
12559,data integrity,12559,ensuring that all data in a database is complete\n,"[ensur, data, databas, complet]",0.924772
7834,data integration,7834,integrate multiple sources of data\n,"[integr, multipl, sourc, data]",0.923666


Min pooling precision: 0.000000 


Query: retrieval model
Top 10 documents after calculating min pooling score:


Unnamed: 0,entity,id,definition,tokens,min_pooling_score
7132,relational algebra,7132,a collection of operations that are used to retrieve data from relations\n,"[collect, oper, use, retriev, data, relat]",0.959149
7117,relational algebra,7117,set of relational operations for retrieving data\n,"[set, relat, oper, retriev, data]",0.956718
14759,quantitative analysis,14759,analysis using objective data\n,"[analysi, use, object, data]",0.953796
9085,query language,9085,used to update and retrieve data that is stored in a data model\n,"[use, updat, retriev, data, store, data, model]",0.942148
9064,query language,9064,the part of the dml that involves data retrieval\n,"[part, dml, involv, data, retriev]",0.939803
27657,database system,27657,"collectively, the database model, the dbms, and the database itself\n","[collect, databas, model, dbm, databas]",0.938249
27708,input data,27708,data that you enter into a database\n,"[data, enter, databas]",0.933132
7242,deadlock detection,7242,the dbms periodically tests the database for deadlocks\n,"[dbm, period, test, databas, deadlock]",0.933004
1619,design goals,1619,* all pertinent data must be in database * no redundant data * size of database must be reasonable * relations should be such that queries are reasonable to state * eliminate update and deletion anomalies\n,"[pertin, data, must, databas, redund, data, size, databas, must, reason, relat, queri, reason, state, elimin, updat, delet, anomali]",0.932356
28227,relational database,28227,a database using the relational data model.\n,"[databas, use, relat, data, model]",0.932212


Min pooling precision: 0.000000 




## Average Pooling

In [241]:
for idx in range(3):
    # calculate and rank by avg_pooling_score
    print("Query: %s" % query_df["query"][idx])
    documents_df = documents_df.assign(avg_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'avg_pooling_embedding'))
    avg_pooling_sorted_results = documents_df.sort_values("avg_pooling_score", ascending=False)
    
    print("Top 10 documents after calculating average pooling score:")
    display(HTML(avg_pooling_sorted_results.head(10)[['entity', 'id', 'definition', 'tokens', 'avg_pooling_score']].to_html()))
        
    avg_pooling_precision_10 = precision_10(query_df, idx, avg_pooling_sorted_results)
    print("Min pooling precision: %f \r\n\r\n" % avg_pooling_precision_10)

Query: relational database
Top 10 documents after calculating average pooling score:


Unnamed: 0,entity,id,definition,tokens,avg_pooling_score
28134,relational database,28134,a database including tables that are related to each other\n,"[databas, includ, tabl, relat]",0.977955
28210,relational database,28210,a collection of related database tables\n,"[collect, relat, databas, tabl]",0.970409
771,relational model,771,a database is a collection of relations or tables.\n,"[databas, collect, relat, tabl]",0.970409
28177,relational database,28177,relational database schema with data\n,"[relat, databas, schema, data]",0.966726
28242,relational database,28242,"a database in which data is separated into tables of related records. in access, a database includes a collection of objects-tables, queries, reports, forms, and other objects\n","[databas, data, separ, tabl, relat, record, access, databas, includ, collect, object, tabl, queri, report, form, object]",0.965342
4648,data warehouse,4648,-set of related databases -hierarchy of data\n,"[set, relat, databas, hierarchi, data]",0.964128
19673,database schema,19673,set of schemas for the relations of a database\n,"[set, schema, relat, databas]",0.961921
19701,database schema,19701,set of all relation schemas in the database\n,"[set, relat, schema, databas]",0.961921
28164,relational database,28164,table-based organization for a database in which queries can be specified using relational database operators\n,"[tabl, base, organ, databas, queri, specifi, use, relat, databas, oper]",0.960825
30329,database server,30329,database management system (dbms) tables relationships metadata\n,"[databas, manag, system, dbm, tabl, relationship, metadata]",0.959668


Min pooling precision: 0.500000 


Query: garbage collection
Top 10 documents after calculating average pooling score:


Unnamed: 0,entity,id,definition,tokens,avg_pooling_score
315,data privacy,315,"- refers to how data is collected, shared and used - data collection against consent is a violation of data privacy - makes sure data is collected in the correct manner\n","[refer, data, collect, share, use, data, collect, consent, violat, data, privaci, make, sure, data, collect, correct, manner]",0.904959
2567,raw data,2567,"is primary data, collected from the source, has not been processed for use\n","[primari, data, collect, sourc, process, use]",0.90149
326,data privacy,326,"safeguarding personal data, collected by any organization, from being inappropriately (without consent) accessed and used by the organization that collected the data or by a third party\n","[safeguard, person, data, collect, organ, inappropri, without, consent, access, use, organ, collect, data, third, parti]",0.897495
17402,data gathering,17402,collections of data is imperative.\n,"[collect, data, imper]",0.895451
335,data privacy,335,"safeguarding personal data, collected by any organization, from being inappropriately accessed and used by the organization that collected the data or by a third party\n","[safeguard, person, data, collect, organ, inappropri, access, use, organ, collect, data, third, parti]",0.89527
15725,data set,15725,is a collection of data.\n,"[collect, data]",0.89251
2550,raw data,2550,aka primary data; collected from the source and has not been processed for use; usually never published in scientific papers; normally subjected to processing your data by calculating the rate of reaction\n,"[aka, primari, data, collect, sourc, process, use, usual, never, publish, scientif, paper, normal, subject, process, data, calcul, rate, reaction]",0.891304
4335,data warehouse,4335,a central aggregation of data (that can be distributed physically) that starts from an analysis of what data already exists & how it can be collected and later used.\n,"[central, aggreg, data, distribut, physic, start, analysi, data, alreadi, exist, collect, later, use]",0.878652
2536,raw data,2536,the data collected for statisicsl studies are called\n,"[data, collect, statisicsl, studi, call]",0.878312
13250,data storage,13250,"once collected, data is further categorized and centralized.\n","[collect, data, categor, central]",0.876179


Min pooling precision: 0.000000 


Query: retrieval model
Top 10 documents after calculating average pooling score:


Unnamed: 0,entity,id,definition,tokens,avg_pooling_score
4490,data warehouse,4490,is a specialized database that stores data in a format optimized for decision suppor\n,"[special, databas, store, data, format, optim, decis, suppor]",0.901824
4338,data warehouse,4338,specialized db that stores data in a format optimized for decision support\n,"[special, db, store, data, format, optim, decis, support]",0.895641
15641,data cube,15641,the multidimensional data structure used to store and manipulate data in a multidimensional dbms.\n,"[multidimension, data, structur, use, store, manipul, data, multidimension, dbm]",0.891614
13256,data transformation,13256,transforms the data for storing it in the proper format or structure for the purposes of querying and analysis.\n,"[transform, data, store, proper, format, structur, purpos, queri, analysi]",0.890154
4370,data warehouse,4370,a specialized database that stores data in a format optimized for decision support\n,"[special, databas, store, data, format, optim, decis, support]",0.889069
9061,query language,9061,you use a specialized language called structured query language to retrieve and manipulate information in a database.\n,"[use, special, languag, call, structur, queri, languag, retriev, manipul, inform, databas]",0.883169
10109,data mining,10109,with structured data\n,"[structur, data]",0.87914
19727,online analytical processing,19727,"tools for retrieving, processing, and modelling data from the data warehouse\n","[tool, retriev, process, model, data, data, warehous]",0.878449
15633,data cube,15633,a special database used to store data in olap reporting.\n,"[special, databas, use, store, data, olap, report]",0.877279
29082,data cubes,29082,special database used to store data in olap reporting\n,"[special, databas, use, store, data, olap, report]",0.877279


Min pooling precision: 0.000000 




## Sum Pooling

In [242]:
for idx in range(3):
    # calculate and rank by sum_pooling_score
    print("Query: %s" % query_df["query"][idx])
    documents_df = documents_df.assign(sum_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'sum_pooling_embedding'))
    sum_pooling_sorted_results = documents_df.sort_values("sum_pooling_score", ascending=False)
    
    print("Top 10 documents after calculating sum pooling score:")
    display(HTML(sum_pooling_sorted_results.head(10)[['entity', 'id', 'definition', 'tokens', 'sum_pooling_score']].to_html()))
        
    sum_pooling_precision_10 = precision_10(query_df, idx, sum_pooling_sorted_results)
    print("Min pooling precision: %f \r\n\r\n" % sum_pooling_precision_10)

Query: relational database
Top 10 documents after calculating sum pooling score:


Unnamed: 0,entity,id,definition,tokens,sum_pooling_score
28134,relational database,28134,a database including tables that are related to each other\n,"[databas, includ, tabl, relat]",0.977955
28210,relational database,28210,a collection of related database tables\n,"[collect, relat, databas, tabl]",0.970409
771,relational model,771,a database is a collection of relations or tables.\n,"[databas, collect, relat, tabl]",0.970409
28177,relational database,28177,relational database schema with data\n,"[relat, databas, schema, data]",0.966726
28242,relational database,28242,"a database in which data is separated into tables of related records. in access, a database includes a collection of objects-tables, queries, reports, forms, and other objects\n","[databas, data, separ, tabl, relat, record, access, databas, includ, collect, object, tabl, queri, report, form, object]",0.965342
4648,data warehouse,4648,-set of related databases -hierarchy of data\n,"[set, relat, databas, hierarchi, data]",0.964128
19673,database schema,19673,set of schemas for the relations of a database\n,"[set, schema, relat, databas]",0.961921
19701,database schema,19701,set of all relation schemas in the database\n,"[set, relat, schema, databas]",0.961921
28164,relational database,28164,table-based organization for a database in which queries can be specified using relational database operators\n,"[tabl, base, organ, databas, queri, specifi, use, relat, databas, oper]",0.960825
30329,database server,30329,database management system (dbms) tables relationships metadata\n,"[databas, manag, system, dbm, tabl, relationship, metadata]",0.959668


Min pooling precision: 0.500000 


Query: garbage collection
Top 10 documents after calculating sum pooling score:


Unnamed: 0,entity,id,definition,tokens,sum_pooling_score
315,data privacy,315,"- refers to how data is collected, shared and used - data collection against consent is a violation of data privacy - makes sure data is collected in the correct manner\n","[refer, data, collect, share, use, data, collect, consent, violat, data, privaci, make, sure, data, collect, correct, manner]",0.904959
2567,raw data,2567,"is primary data, collected from the source, has not been processed for use\n","[primari, data, collect, sourc, process, use]",0.90149
326,data privacy,326,"safeguarding personal data, collected by any organization, from being inappropriately (without consent) accessed and used by the organization that collected the data or by a third party\n","[safeguard, person, data, collect, organ, inappropri, without, consent, access, use, organ, collect, data, third, parti]",0.897495
17402,data gathering,17402,collections of data is imperative.\n,"[collect, data, imper]",0.895451
335,data privacy,335,"safeguarding personal data, collected by any organization, from being inappropriately accessed and used by the organization that collected the data or by a third party\n","[safeguard, person, data, collect, organ, inappropri, access, use, organ, collect, data, third, parti]",0.89527
15725,data set,15725,is a collection of data.\n,"[collect, data]",0.89251
2550,raw data,2550,aka primary data; collected from the source and has not been processed for use; usually never published in scientific papers; normally subjected to processing your data by calculating the rate of reaction\n,"[aka, primari, data, collect, sourc, process, use, usual, never, publish, scientif, paper, normal, subject, process, data, calcul, rate, reaction]",0.891304
4335,data warehouse,4335,a central aggregation of data (that can be distributed physically) that starts from an analysis of what data already exists & how it can be collected and later used.\n,"[central, aggreg, data, distribut, physic, start, analysi, data, alreadi, exist, collect, later, use]",0.878652
2536,raw data,2536,the data collected for statisicsl studies are called\n,"[data, collect, statisicsl, studi, call]",0.878312
13250,data storage,13250,"once collected, data is further categorized and centralized.\n","[collect, data, categor, central]",0.876179


Min pooling precision: 0.000000 


Query: retrieval model
Top 10 documents after calculating sum pooling score:


Unnamed: 0,entity,id,definition,tokens,sum_pooling_score
4490,data warehouse,4490,is a specialized database that stores data in a format optimized for decision suppor\n,"[special, databas, store, data, format, optim, decis, suppor]",0.901824
4338,data warehouse,4338,specialized db that stores data in a format optimized for decision support\n,"[special, db, store, data, format, optim, decis, support]",0.895641
15641,data cube,15641,the multidimensional data structure used to store and manipulate data in a multidimensional dbms.\n,"[multidimension, data, structur, use, store, manipul, data, multidimension, dbm]",0.891614
13256,data transformation,13256,transforms the data for storing it in the proper format or structure for the purposes of querying and analysis.\n,"[transform, data, store, proper, format, structur, purpos, queri, analysi]",0.890154
4370,data warehouse,4370,a specialized database that stores data in a format optimized for decision support\n,"[special, databas, store, data, format, optim, decis, support]",0.889069
9061,query language,9061,you use a specialized language called structured query language to retrieve and manipulate information in a database.\n,"[use, special, languag, call, structur, queri, languag, retriev, manipul, inform, databas]",0.883169
10109,data mining,10109,with structured data\n,"[structur, data]",0.87914
19727,online analytical processing,19727,"tools for retrieving, processing, and modelling data from the data warehouse\n","[tool, retriev, process, model, data, data, warehous]",0.878449
15633,data cube,15633,a special database used to store data in olap reporting.\n,"[special, databas, use, store, data, olap, report]",0.877279
29082,data cubes,29082,special database used to store data in olap reporting\n,"[special, databas, use, store, data, olap, report]",0.877279


Min pooling precision: 0.000000 




## TF-IDF Weighted Pooling

In [243]:
for idx in range(3):
    # calculate and rank by tf_idf_weighted_pooling_score
    print("Query: %s" % query_df["query"][idx])
    documents_df = documents_df.assign(tf_idf_weighted_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'tf_idf_weighted_pooling_embedding'))
    tf_idf_weighted_pooling_sorted_results = documents_df.sort_values("tf_idf_weighted_pooling_score", ascending=False)
    
    print("Top 10 documents after calculating TF-IDF weighted pooling score:")
    display(HTML(tf_idf_weighted_pooling_sorted_results.head(10)[['entity', 'id', 'definition', 'tokens', 'tf_idf_weighted_pooling_score']].to_html()))
        
    tf_idf_weighted_pooling_precision_10 = precision_10(query_df, idx, tf_idf_weighted_pooling_sorted_results)
    print("Min pooling precision: %f \r\n\r\n" % tf_idf_weighted_pooling_precision_10)

Query: relational database
Top 10 documents after calculating TF-IDF weighted pooling score:


Unnamed: 0,entity,id,definition,tokens,tf_idf_weighted_pooling_score
11208,current trends,11208,-imprisonment rate of black men has dropped more than 24% -imprisonment rate of black women has dropped over 50% -imprisonment rate of white women has risen by 53% -incarceration overall is declining -war on drugs shift from marijuana/crack to meth/opioid -white people declining socioeconomic prospects -criminal justice reform\n,"[imprison, rate, black, men, drop, imprison, rate, black, women, drop, imprison, rate, white, women, risen, incarcer, overal, declin, war, drug, shift, marijuana, crack, meth, opioid, white, peopl, declin, socioeconom, prospect, crimin, justic, reform]",0.770818
11206,current trends,11206,1. imprisonment rate dropped 24% black men 2. imprisonment rate of black women dropped over 50% 3. imprisonment rate of white women risen by 53% 4. incarceration overall declining 5. war on drugs shift from marijuana and crack to meth and opiod 6. white people declining socioeconomic prospects narrowing significantly 7. criminal justice reform first step act signed december of last year\n,"[imprison, rate, drop, black, men, imprison, rate, black, women, drop, imprison, rate, white, women, risen, incarcer, overal, declin, war, drug, shift, marijuana, crack, meth, opiod, white, peopl, declin, socioeconom, prospect, narrow, significantli, crimin, justic, reform, first, step, act, sign, decemb, last, year]",0.759508
228,black box,228,orange! in the tail and sealed\n,"[orang, tail, seal]",0.752008
6324,direct manipulation,6324,coined by shneiderman (1983) due to his fascination with computer games at the time\n,"[coin, shneiderman, due, fascin, comput, game, time]",0.697502
11701,false positives,11701,lack of false positives are a red flag. false positives occur close to the threshold. patients faking hearing loss lack false responses\n,"[lack, fals, posit, red, flag, fals, posit, occur, close, threshold, patient, fake, hear, loss, lack, fals, respons]",0.681711
2002,video games,2002,"more addictive than drugs; less time with school, work, family, and friends; leads to obesity; lack of sleep; more harmful than tv; kids dropping out of school\n","[addict, drug, less, time, school, work, famili, friend, lead, obes, lack, sleep, harm, tv, kid, drop, school]",0.681457
11697,false positives,11697,"- a lack of false positives is a red flag. - often, patients will indicate that a sound was heard, even if a sound was not presented (false positive) - false positives tend to occur close to threshold - patients attempting to fake a hearing loss tend to lack false responses because the sound is actually always easy for them to hear.\n","[lack, fals, posit, red, flag, often, patient, indic, sound, heard, even, sound, present, fals, posit, fals, posit, tend, occur, close, threshold, patient, attempt, fake, hear, loss, tend, lack, fals, respons, sound, actual, alway, easi, hear]",0.67448
684,false positive rate,684,"fp / (fp + tn), false alarm rate\n","[fp, fp, tn, fals, alarm, rate]",0.673031
4236,comprehensive evaluation,4236,"case history, otoscopy, speech reception threshold, air conduction pure tone threshold, word recognition testing, and bone conduction pure tone threshold\n","[case, histori, otoscopi, speech, recept, threshold, air, conduct, pure, tone, threshold, word, recognit, test, bone, conduct, pure, tone, threshold]",0.652922
11709,false positives,11709,"often, patients will indicate that a sound was heard, even if a sound was not presented tend to occur close to threshold, patients attempting to fake a hearing loss tend to lack because the sound is actually easy for them to hear\n","[often, patient, indic, sound, heard, even, sound, present, tend, occur, close, threshold, patient, attempt, fake, hear, loss, tend, lack, sound, actual, easi, hear]",0.651725


Min pooling precision: 0.000000 


Query: garbage collection
Top 10 documents after calculating TF-IDF weighted pooling score:


Unnamed: 0,entity,id,definition,tokens,tf_idf_weighted_pooling_score
228,black box,228,orange! in the tail and sealed\n,"[orang, tail, seal]",0.692132
9661,new technology,9661,1.machine gun 2 flame thrower 3.mustard gas 4.tanks 5.airplanes\n,"[machin, gun, flame, thrower, mustard, ga, tank, airplan]",0.680664
16711,total cost,16711,(q/2) (h) + (d/q) s\n,"[q, h, q]",0.640698
16737,total cost,16737,(q/2) h+(d/q) s\n,"[q, h, q]",0.640698
14457,new technologies,14457,"tanks, machine guns, poison gas and trench warfare\n","[tank, machin, gun, poison, ga, trench, warfar]",0.640505
24293,semantic analysis,24293,"- &""the dog is in the pen.&"" vs. &""the ink is in the pen.&"" - &""i put the plant in the window&"" vs. &""ford put the plant in mexico&""\n","[dog, pen, vs, ink, pen, put, plant, window, vs, ford, put, plant, mexico]",0.628617
24516,greedy algorithm,24516,"dijkstra's, prim's, kruskal's\n","[dijkstra, prim, kruskal]",0.624458
8402,metric space,8402,"a metric space is a pair (x, d) where x is a set and d:xxx-&gt;r is a map satisfying the following properties: a. for x1, x2 in x, d(x1, x2)≥0 and d(x1, x2)=0 if and only if x1=x2 (positive definite) b. for any x1, x2 in x, we have d(x1, x2)=d(x2, x1) (symmetric) c. for any x1, x2, x3 in x, we have d(x1, x2)≤d(x1, x3)+d(x3, x2) (triangle inequality)\n","[metric, space, pair, x, x, set, xxx, gt, r, map, satisfi, follow, properti, x1, x2, x, x1, x2, x1, x2, x1, x2, posit, definit, b, x1, x2, x, x1, x2, x2, x1, symmetr, c, x1, x2, x3, x, x1, x2, x1, x3, x3, x2, triangl, inequ]",0.613814
4168,triangle inequality,4168,|z1 + z2| &lt;= |z1| + |z2| |z1 + z2| &gt;= ||z1| - |z2||\n,"[z1, z2, lt, z1, z2, z1, z2, gt, z1, z2]",0.60952
29572,classification problem,29572,"målet er å forutsjå ein class label, osm er eit valg fra ei predefinert liste med muligheter. outputen blir sortert som classes. for eksempel dersom ein skal finne ut arten til ein plante, så blir arten (output) ein klasse.\n","[målet, er, å, forutsjå, ein, class, label, osm, er, eit, valg, fra, ei, predefinert, list, med, mulighet, outputen, blir, sortert, som, class, eksempel, dersom, ein, skal, finn, ut, arten, til, ein, plant, så, blir, arten, output, ein, klass]",0.607189


Min pooling precision: 0.000000 


Query: retrieval model
Top 10 documents after calculating TF-IDF weighted pooling score:


Unnamed: 0,entity,id,definition,tokens,tf_idf_weighted_pooling_score
2002,video games,2002,"more addictive than drugs; less time with school, work, family, and friends; leads to obesity; lack of sleep; more harmful than tv; kids dropping out of school\n","[addict, drug, less, time, school, work, famili, friend, lead, obes, lack, sleep, harm, tv, kid, drop, school]",0.761038
23522,peer-to-peer networks,23522,all computers have equal status\n,"[comput, equal, statu]",0.692104
15589,service providers,15589,- regional authorities - counties - special districts - cities - schools -conservancies and friends groups financed primarily by taxes with user fees as a supplement source of income\n,"[region, author, counti, special, district, citi, school, conserv, friend, group, financ, primarili, tax, user, fee, supplement, sourc, incom]",0.676906
27628,categorical data,27628,"gender, race, bmi, presence of disease (y/n)\n","[gender, race, bmi, presenc, diseas, n]",0.65454
11701,false positives,11701,lack of false positives are a red flag. false positives occur close to the threshold. patients faking hearing loss lack false responses\n,"[lack, fals, posit, red, flag, fals, posit, occur, close, threshold, patient, fake, hear, loss, lack, fals, respons]",0.652843
11206,current trends,11206,1. imprisonment rate dropped 24% black men 2. imprisonment rate of black women dropped over 50% 3. imprisonment rate of white women risen by 53% 4. incarceration overall declining 5. war on drugs shift from marijuana and crack to meth and opiod 6. white people declining socioeconomic prospects narrowing significantly 7. criminal justice reform first step act signed december of last year\n,"[imprison, rate, drop, black, men, imprison, rate, black, women, drop, imprison, rate, white, women, risen, incarcer, overal, declin, war, drug, shift, marijuana, crack, meth, opiod, white, peopl, declin, socioeconom, prospect, narrow, significantli, crimin, justic, reform, first, step, act, sign, decemb, last, year]",0.637839
11208,current trends,11208,-imprisonment rate of black men has dropped more than 24% -imprisonment rate of black women has dropped over 50% -imprisonment rate of white women has risen by 53% -incarceration overall is declining -war on drugs shift from marijuana/crack to meth/opioid -white people declining socioeconomic prospects -criminal justice reform\n,"[imprison, rate, black, men, drop, imprison, rate, black, women, drop, imprison, rate, white, women, risen, incarcer, overal, declin, war, drug, shift, marijuana, crack, meth, opioid, white, peopl, declin, socioeconom, prospect, crimin, justic, reform]",0.635282
5337,social welfare,5337,funded by taxes looks out for the poorest members of society\n,"[fund, tax, look, poorest, member, societi]",0.629812
26160,broad spectrum,26160,kills a lot of bacteria\n,"[kill, lot, bacteria]",0.626813
9371,social networks,9371,"what we do -post opinions, gossip, pictures, &""away from home&"" status what they do -new services with unexpected privacy settings\n","[post, opinion, gossip, pictur, away, home, statu, new, servic, unexpect, privaci, set]",0.619492


Min pooling precision: 0.000000 




Try different aggregation methods and report the precision@10 for these queries:
* query: relational database
* query: garbage collection
* query: retrieval model

### Discussion
Among these aggregation methods, which one is the best and which one is the worst?

We can find that the precision@10 of these methods are all not that good.

For query "relational database", only `Max Pooling`, `Sum Pooling` and `Average Pooling` have hit some target documents. And the precision@10 are 0.2, 0.5 and 0.5 respectively. The other methods all don't have related hits.

For the other two queries "garbage collection" and "retrieval model", all methods don't have related hits.

To sum up, based on the above results, all the methods are not that good, but among these methods, `Sum Pooling` and `Average Pooling` are the best.

# Part 3: Query Expansion via Word Embeddings (30 points) 
Remember the hardest query "retrieval model" in homework 1? Because there is no document containing "retrieval model" in the dataset, you cannot retrieve any documents by Boolean matching. Now, it is the time of your "revenge" via query expansion.

In this part, your job is to expand the original query like "retrieval model" by adding semantically similar words (e.g., "search"), which are selected from all tokens in the dataset.

There are many ways to do so. For this part, we want you to calculate the cosine similarity between each of the original query tokens and the other tokens based on their word embeddings.

First, please find the top 3 similar tokens for:
* relational
* database
* garbage
* collection
* retrieval 
* model

In [244]:
from collections import defaultdict

In [245]:
# your code here
# get the vocabulary dataframe
vocabulary = np.array([k for k in df_dict.keys()])
vocabulary_df = pd.DataFrame(vocabulary, columns = ["parsed_term"])
vocabulary_df

Unnamed: 0,parsed_term
0,lift
1,ie
2,intersect
3,expirament
4,overwhel
...,...
9649,till
9650,feeback
9651,likelihood
9652,stepwis


In [246]:
# parse the six terms and create a dataframe
terms = np.array(["relational", "database", "garbage", "collection", "retrieval", "model"])
term_df = pd.DataFrame(terms, columns = ["term"])
term_df = term_df.assign(parsed_term=parse(term_df['term'], 
                                                remove_stopwords=remove_stopwords, 
                                                use_stemming=use_stemming, 
                                                remove_otherNoise=remove_otherNoise))
term_df

None of pre-processing options = 6
After removing stop words = 6
After removing stop words + stemming = 6
After removing stop words + stemming + removing other noise = 6 



Unnamed: 0,term,parsed_term
0,relational,[relat]
1,database,[databas]
2,garbage,[garbag]
3,collection,[collect]
4,retrieval,[retriev]
5,model,[model]


In [247]:
def find_similar_word(model, term, vocabulary_df, top_k):
    scored_vocabulary_df = vocabulary_df.assign(cos=vocabulary_df["parsed_term"].apply(lambda row: cosine(model.wv[term], model.wv[row])))
    return scored_vocabulary_df.sort_values("cos", ascending=False).head(top_k+1)

In [248]:
similar_word_dict = defaultdict(list)

for index, row in term_df.iterrows():
    term = row["term"]
    parsed_term = row["parsed_term"][0]
    similar_word_df = find_similar_word(model, parsed_term, vocabulary_df, 3)
    for i, r in similar_word_df.iterrows():
        if r["parsed_term"] != parsed_term:
            similar_word_dict[term].append(r["parsed_term"])
            
similar_word_dict

defaultdict(list,
            {'relational': ['entiti', 'predefin', 'tabl'],
             'database': ['dbm', 'data', 'structur'],
             'garbage': ['collector', 'longer', 'reclaim'],
             'collection': ['gather', 'organ', 'repositori'],
             'retrieval': ['store', 'metadata', 'updat'],
             'model': ['breakthrough', 'mathemat', 'conceptu']})

In [249]:
for key, value in similar_word_dict.items():
    print("Top 3 similar words for '%s': %s" % (key, value))

Top 3 similar words for 'relational': ['entiti', 'predefin', 'tabl']
Top 3 similar words for 'database': ['dbm', 'data', 'structur']
Top 3 similar words for 'garbage': ['collector', 'longer', 'reclaim']
Top 3 similar words for 'collection': ['gather', 'organ', 'repositori']
Top 3 similar words for 'retrieval': ['store', 'metadata', 'updat']
Top 3 similar words for 'model': ['breakthrough', 'mathemat', 'conceptu']


Second, please add these similar tokens to the orignal query and redo the **vector space model** in part 2. 
* query: relational database
* query: garbage collection
* query: retrieval model

In [250]:
def recall_10(query_df, idx, results_df):
    query = query_df["query"][idx]
    count = 0
    
    # count related docs
    num_related_docs = results_df["entity"].apply(lambda entity: 1 if entity == query else 0).sum()
    
    for row in range(10):
        if results_df["entity"].iloc[row] == query:
            count += 1
    return count / num_related_docs

## Report recall@10 before the query expansion:

In [251]:
# Report recall@10 before the query expansion
for idx in range(3):
    print("Query: %s" % query_df["query"][idx])
    max_pooling_recall_10 = recall_10(query_df, idx, max_pooling_sorted_results)
    print("Max pooling recall@10 before the query expansion: %f \r\n" % max_pooling_recall_10)

Query: relational database
Max pooling recall@10 before the query expansion: 0.000000 

Query: garbage collection
Max pooling recall@10 before the query expansion: 0.000000 

Query: retrieval model
Max pooling recall@10 before the query expansion: 0.000000 



In [252]:
for idx in range(3):
    print("Query: %s" % query_df["query"][idx])
    min_pooling_recall_10 = recall_10(query_df, idx, min_pooling_sorted_results)
    print("Min pooling recall@10 before the query expansion: %f \r\n" % min_pooling_recall_10)

Query: relational database
Min pooling recall@10 before the query expansion: 0.003521 

Query: garbage collection
Min pooling recall@10 before the query expansion: 0.000000 

Query: retrieval model
Min pooling recall@10 before the query expansion: 0.000000 



In [253]:
for idx in range(3):
    print("Query: %s" % query_df["query"][idx])
    avg_pooling_recall_10 = recall_10(query_df, idx, avg_pooling_sorted_results)
    print("Average pooling recall@10 before the query expansion: %f \r\n" % avg_pooling_recall_10)

Query: relational database
Average pooling recall@10 before the query expansion: 0.000000 

Query: garbage collection
Average pooling recall@10 before the query expansion: 0.000000 

Query: retrieval model
Average pooling recall@10 before the query expansion: 0.000000 



In [254]:
for idx in range(3):
    print("Query: %s" % query_df["query"][idx])
    sum_pooling_recall_10 = recall_10(query_df, idx, sum_pooling_sorted_results)
    print("Sum pooling recall@10 before the query expansion: %f \r\n" % sum_pooling_recall_10)

Query: relational database
Sum pooling recall@10 before the query expansion: 0.000000 

Query: garbage collection
Sum pooling recall@10 before the query expansion: 0.000000 

Query: retrieval model
Sum pooling recall@10 before the query expansion: 0.000000 



In [255]:
for idx in range(3):
    print("Query: %s" % query_df["query"][idx])
    tf_idf_weighted_pooling_recall_10 = recall_10(query_df, idx, tf_idf_weighted_pooling_sorted_results)
    print("TF-IDF weighted pooling recall@10 before the query expansion: %f \r\n" % tf_idf_weighted_pooling_recall_10)

Query: relational database
TF-IDF weighted pooling recall@10 before the query expansion: 0.000000 

Query: garbage collection
TF-IDF weighted pooling recall@10 before the query expansion: 0.000000 

Query: retrieval model
TF-IDF weighted pooling recall@10 before the query expansion: 0.000000 



## Report recall@10 after the query expansion:

In [None]:
# add query expansions
query_df["tokens"][0].extend(similar_word_dict['relational'])
query_df["tokens"][0].extend(similar_word_dict['database'])
query_df["tokens"][1].extend(similar_word_dict['garbage'])
query_df["tokens"][1].extend(similar_word_dict['collection'])
query_df["tokens"][2].extend(similar_word_dict['retrieval'])
query_df["tokens"][2].extend(similar_word_dict['model'])
query_df[["query", "tokens"]]

In [256]:
for idx in range(3):
    # calculate and rank by max_pooling_score
    print("Query: %s" % query_df["query"][idx])

    documents_df = documents_df.assign(max_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'max_pooling_embedding'))
    max_pooling_sorted_results = documents_df.sort_values("max_pooling_score", ascending=False)
        
    max_pooling_recall_10 = recall_10(query_df, idx, max_pooling_sorted_results)
    print("Max pooling recall@10 after query expansion: %f \r\n\r\n" % max_pooling_recall_10)

Query: relational database
Max pooling recall@10 after query expansion: 0.007042 


Query: garbage collection
Max pooling recall@10 after query expansion: 0.000000 


Query: retrieval model
Max pooling recall@10 after query expansion: 0.000000 




In [257]:
for idx in range(3):
    # calculate and rank by max_pooling_score
    print("Query: %s" % query_df["query"][idx])

    documents_df = documents_df.assign(min_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'min_pooling_embedding'))
    min_pooling_sorted_results = documents_df.sort_values("min_pooling_score", ascending=False)
        
    min_pooling_recall_10 = recall_10(query_df, idx, min_pooling_sorted_results)
    print("Min pooling recall@10 after query expansion: %f \r\n\r\n" % min_pooling_recall_10)

Query: relational database
Min pooling recall@10 after query expansion: 0.000000 


Query: garbage collection
Min pooling recall@10 after query expansion: 0.000000 


Query: retrieval model
Min pooling recall@10 after query expansion: 0.000000 




In [258]:
for idx in range(3):
    # calculate and rank by avg_pooling_score
    print("Query: %s" % query_df["query"][idx])

    documents_df = documents_df.assign(avg_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'avg_pooling_embedding'))
    avg_pooling_sorted_results = documents_df.sort_values("avg_pooling_score", ascending=False)
        
    avg_pooling_recall_10 = recall_10(query_df, idx, avg_pooling_sorted_results)
    print("Average pooling recall@10 after query expansion: %f \r\n\r\n" % avg_pooling_recall_10)

Query: relational database
Average pooling recall@10 after query expansion: 0.017606 


Query: garbage collection
Average pooling recall@10 after query expansion: 0.000000 


Query: retrieval model
Average pooling recall@10 after query expansion: 0.000000 




In [259]:
for idx in range(3):
    # calculate and rank by sum_pooling_score
    print("Query: %s" % query_df["query"][idx])

    documents_df = documents_df.assign(sum_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'sum_pooling_embedding'))
    sum_pooling_sorted_results = documents_df.sort_values("sum_pooling_score", ascending=False)
        
    sum_pooling_recall_10 = recall_10(query_df, idx, sum_pooling_sorted_results)
    print("Sum pooling recall@10 after query expansion: %f \r\n\r\n" % sum_pooling_recall_10)

Query: relational database
Sum pooling recall@10 after query expansion: 0.017606 


Query: garbage collection
Sum pooling recall@10 after query expansion: 0.000000 


Query: retrieval model
Sum pooling recall@10 after query expansion: 0.000000 




In [260]:
for idx in range(3):
    # calculate and rank by tf_idf_weighted_pooling_score
    print("Query: %s" % query_df["query"][idx])

    documents_df = documents_df.assign(tf_idf_weighted_pooling_score=cal_cosine_scores(query_df, idx, documents_df, 'tf_idf_weighted_pooling_embedding'))
    tf_idf_weighted_pooling_sorted_results = documents_df.sort_values("tf_idf_weighted_pooling_score", ascending=False)
        
    tf_idf_weighted_pooling_recall_10 = recall_10(query_df, idx, tf_idf_weighted_pooling_sorted_results)
    print("TF-IDF weighted pooling recall@10 after query expansion: %f \r\n\r\n" % tf_idf_weighted_pooling_recall_10)

Query: relational database
TF-IDF weighted pooling recall@10 after query expansion: 0.000000 


Query: garbage collection
TF-IDF weighted pooling recall@10 after query expansion: 0.000000 


Query: retrieval model
TF-IDF weighted pooling recall@10 after query expansion: 0.000000 




### Discussion
Why we measure recall here instead of precision or NDCG?

**Answer:** 

Because different queries have different numbers of related documents. The query with more related documents may have a larger precision or NDCG. So in order to eliminate the influence of this factor when comparing the results before query expansion and after query expansion, we'd better use recall instead of precision or NDCG.

Should the tokens added for expansion have the same importance as the original query tokens? If not, how to improve the query expansion in this part?

**Answer:** 

Maybe not. We can reduce the weights of the query expansion words in aggregating the query embedding.