# Feature extraction from text

We look at the classical method to describe text documents. We will improve these descriptors in the next chapter.

In [None]:
%pip install --quiet -r requirements.txt
%load_ext autoreload
%autoreload 2

Collecting nltk (from -r requirements.txt (line 11))
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk->-r requirements.txt (line 11))
  Downloading click-8.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk->-r requirements.txt (line 11))
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk->-r requirements.txt (line 11))
  Downloading regex-2025.9.1-cp313-cp313-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk->-r requirements.txt (line 11))
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--
   -------------------- ------------------- 0.8/1.5 MB 2.1 MB/s eta 0:00:01
   --------------------------- ------------ 1.0/1.5 MB 2.1 MB/s eta 0:00:01
   ---------------------------------------

In [5]:
from IPython.display import display, Markdown, JSON
import urllib.request
import os, re
from PyPDF2 import PdfReader
from unidecode import unidecode
from collections import Counter, defaultdict
from helpers import *
import math
import random
from nltk.stem import PorterStemmer

## Extracting text from file formats

In [6]:
uri = "https://dmi.unibas.ch/fileadmin/user_upload/dmi/Studium/Computer_Science/Vorlesungen_HS23/Multimedia_Retrieval/HS24/03_ClassicalTextRetrieval.pdf"
local_filename = 'example.pdf'

# unless local file already exists, download the file
if not os.path.exists(local_filename):
    urllib.request.urlretrieve(uri, local_filename)
    print(f"File downloaded and saved as {local_filename}")

File downloaded and saved as example.pdf


In [8]:
def extract_text_from_pdf(file_name: str) -> str:
    pages = []

    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if y > 20 and len(text) > 0:
            # replace \n and multiple spaces (\s*) with a single space
            text = text.replace("\n", " ")
            text = re.sub(r'\[\d+\]|➢|•', '', text)
            parts.append(text)

    # read the PDF and extract all texts (do some post-processing with above function)
    reader = PdfReader(file_name)
    for page in reader.pages:
        parts = []
        page.extract_text(visitor_text=visitor_text)
        pages.append(re.sub(r'\s+',' ', " ".join(parts)).strip())

    # merge text blocks and clean-up
    return pages

pages = extract_text_from_pdf(local_filename)
text = re.sub(r'\s+',' ', " ".join(pages))
display(Markdown(text[0:4000]+"..."))

Computer Science / 15731 - 01 / 2024 Multimedia Retrieval Chapter 3: Classical T ext Retrieval Dr. Roger Weber, roger.weber@gmail.com 3.1 Introduction 3.2 Fundamentals 3.3 Text Retrieval Models 3.4 Indexing Structures 3.5 Lucene - Open Source Text Search 3.6 Literature and Links 3.1 Introduction 3.2 Fundamentals 3.3 T ext Retrieval Models 3.4 Indexing Structures 3.5 Lucene - Open Source T ext Search 3.6 Literature and Links 3.1 Introduction Text retrieval originated in the 1950s and 1960s through pioneering research by Gerard Salton, Karen Spärck Jones, and others. It became popular due to its wide range of applications, simplicity, and user - friendly interface. As discussed earlier, text retrieval is less affected by the semantic gap compared to other media types (although this will be further discussed in upcoming chapters). Users input text queries against unstructured documents, and the systems can easily match the query with the document, as they share the same representation. Additionally, textual metadata enables any media type to be searchable using the same approach. This allowed the relatively basic computer systems back then to offer efficient and effective search for expert users. As early computers had limitations in terms of storage and compute, models progressed from simple Boolean matching to more complex vector space and probabilistic models as technology improved. The first generation primarily focused on "Retriever - only" models. – Boolean Retrieval Systems hold a significant advantage as they can determine document relevance while scanning the data, without the need for post - processing to sort and rank documents. Additional filters, such as publication date or author, can be easily integrated into the Boolean model. This builds a robust foundation still observed in today's systems like when searching for files on a local drive – The Boolean Model uses set theory and Boolean algebra. Documents are represented as a set of terms, without considering the number of occurrences. The query is formulated as a Boolean expression using operators like AND and OR to combine term match atomic queries. If a document satisfies the Boolean expression (and other filter conditions on its metadata), it is included in the result set; otherwise, it is excluded – Boolean models do not use scoring or ranking, so they can return results as soon as they find the first matching document while scanning the data (consider the example of to searching through a local hard drive). In addition, they can utilize a simple index structure called inverted file which makes the search process very efficient by considering only a small fraction of the data. This method is still used in modern algorithms today. Retriever query doc 1 doc 2 doc 3 … index As collections grew larger, the Boolean model needed an extension for better result organization and exploration. When there are hundreds of hits, users want a more efficient way to browse through search results. A post - processing step was introduced, enabling query - independent filtering and sorting, such as sorting by publication date or filtering for a specific language. Unlike the retriever step, users can add or remove filters and change sorting while exploring the results and without re - submitting the search. In other words, this post - processing does not impact the set of relevant documents and is often implemented in the interface directly: The above method works well for scenarios where exploration is mostly focused on metadata, as in shop or library searches. However, a key drawback is that sorting does not consider how well an object fits the query. Early extensions of the Boolean model ( Extended Boolean Model ) addressed this limitation by studying the impact of the query terms' presence in documents and their relevance assessment. For example, consider the query "cat AND dog" and the three documents: 1) "A cat walked down the street." 2) "The dog chased the cat." 3) "The cat...

# A simple tokenizer

In [9]:
def tokenize(text: str) -> list[str]:
    text = re.sub(r'[^\w\-]+', ' ', text)
    tokens = []
    for token in text.split(' '):
        token = unidecode(token.strip().lower())
        if len(token) < 2: continue
        if not(re.match(r'^[a-zA-Z][\w\-\.]*$', token)): continue
        tokens.append(token)
    return tokens

tokens = tokenize(text)
print("\n".join(tokens[0:20]))
print(f'...\n\nextracted {len(tokens)} tokens from text with {len(set(tokens))} unique tokens')

computer
science
multimedia
retrieval
chapter
classical
ext
retrieval
dr
roger
weber
roger
weber
gmail
com
introduction
fundamentals
text
retrieval
models
...

extracted 21597 tokens from text with 2755 unique tokens


### Let's see which terms appear most often

In [10]:
tokens = tokenize(text)
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

| token     |   frequency |
|:----------|------------:|
| the       |        1424 |
| and       |         652 |
| to        |         492 |
| of        |         484 |
| in        |         447 |
| for       |         335 |
| we        |         282 |
| document  |         250 |
| term      |         241 |
| query     |         226 |
| with      |         225 |
| is        |         214 |
| as        |         210 |
| documents |         204 |
| terms     |         175 |
| on        |         149 |
| that      |         132 |
| not       |         129 |
| it        |         126 |
| search    |         117 |

### Apply porter stemming to reduce words to a common stem

In [11]:
porter_stemmer = PorterStemmer()

def reduce_to_stems(tokens):
    return list(map(lambda token: porter_stemmer.stem(token), tokens))

tokens = reduce_to_stems(tokenize(text))
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

| token    |   frequency |
|:---------|------------:|
| the      |        1424 |
| and      |         653 |
| to       |         492 |
| of       |         484 |
| document |         455 |
| in       |         447 |
| term     |         416 |
| for      |         335 |
| we       |         282 |
| queri    |         275 |
| with     |         225 |
| is       |         214 |
| as       |         210 |
| it       |         155 |
| on       |         149 |
| search   |         142 |
| retriev  |         133 |
| use      |         132 |
| that     |         132 |
| not      |         129 |

### Eliminate the stopwords as they do not describe the content of the document

In [12]:
def eliminate_stopwords(tokens):
    return [token for token in tokens if not(token in stopwords['english'])]

tokens = tokenize(text)
count = len(tokens)
tokens = reduce_to_stems(eliminate_stopwords(tokens))
print(f'{count-len(tokens)} stopwords removed ({(count-len(tokens))/count*100:.2f}%)')
print(f'{len(tokens)} non-stopword tokens remain with {len(set(tokens))} unique tokens')
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

8005 stopwords removed (37.07%)
13592 non-stopword tokens remain with 1838 unique tokens


| token    |   frequency |
|:---------|------------:|
| document |         455 |
| term     |         416 |
| queri    |         275 |
| search   |         142 |
| retriev  |         133 |
| use      |         132 |
| relev    |         117 |
| model    |         115 |
| vector   |          95 |
| score    |          92 |
| index    |          91 |
| lucen    |          84 |
| valu     |          84 |
| text     |          78 |
| result   |          76 |
| cat      |          74 |
| word     |          71 |
| post     |          69 |
| tf       |          68 |
| idf      |          63 |

### We describe each page separately and treat them as mini-documents

In [13]:
collection = [reduce_to_stems(eliminate_stopwords(tokenize(text))) for text in pages]

n = 10
print_table(
    [
        [
            i+1,
            " ".join(collection[i])
        ] for i in range(n)
    ],
    ["page", "tokens"]
)

|   page | tokens                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|-------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      1 | comput scienc multimedia retriev chapter classic ext retriev dr roger weber roger weber gmail com introduct fundament text retriev model index structur lucen open sourc text search literatur link introduct fundament ext retriev model index structur lucen open sourc ext search literatur link                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|      2 | introduct text retriev origin pioneer research gerard salton karen sparck jone other becam popular due wide rang applic simplic user friendli interfac discuss earlier text retriev less affect semant gap compar media type although discuss upcom chapter user input text queri unstructur document system easili match queri document share represent addit textual metadata enabl media type searchabl use approach allow rel basic comput system back offer effici effect search expert user earli comput limit term storag comput model progress simpl boolean match complex vector space probabilist model technolog improv first gener primarili focus retriev model boolean retriev system hold signific advantag determin document relev scan data without need post process sort rank document addit filter public date author easili integr boolean model build robust foundat still observ today system like search file local drive boolean model use set theori boolean algebra document repres set term without consid number occurr queri formul boolean express use oper like combin term match atom queri document satisfi boolean express filter condit metadata includ result set otherwis exclud boolean model use score rank return result soon find first match document scan data consid exampl search local hard drive addit util simpl index structur call invert file make search process effici consid small fraction data method still use modern algorithm today retriev queri doc doc doc index                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|      3 | collect grew larger boolean model need extens better result organ explor hundr hit user want effici way brows search result post process step introduc enabl queri independ filter sort sort public date filter specif languag unlik retriev step user add remov filter chang sort explor result without submit search word post process impact set relev document often implement interfac directli method work well scenario explor mostli focus metadata shop librari search howev key drawback sort consid well object fit queri earli extens boolean model extend boolean model address limit studi impact queri term presenc document relev assess exampl consid queri cat dog three document cat walk street dog chase cat cat play dog anoth cat dog approach document meet condit cat dog document dismiss boolean logic although appear partial relev queri furthermor document contain queri term frequent seem better fit queri boolean express classifi document extend boolean model chang foundat model two way allow partial match queri like document assign lower relev score consid often queri term appear document calcul relev score use relev score sort document collect present result even condit met word instead use hard condit appli penalti meet condit retriev queri doc doc doc index filter sort meta data criteria                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|      4 | classic vector space probabilist retriev model emerg method establish relev model document queri match vector space model document queri repres high dimension vector heurist method compar vector obtain notion relev probabilist retriev model assum document gener randomli probabilist model relev determin probabl document relev queri newer model like bm25 combin vector space probabilist retriev techniqu extend boolean model vector space retriev probabilist retriev follow similar approach retriev gather larger set candid document base queri term rank model assess relev produc sort result list filter condit appli explor result collect languag filter year public chapter delv classic text retriev model detail begin explor document descript perform simpl linguist oper reduc word term form vocabulari search next studi classic model like standard extend boolean model vector space retriev model probabilist model modern bm25 model use popular softwar packag examin index method notabl invert file simpl implement use relat databas acceler search process final conclud chapter discuss apach lucen popular softwar packag offer state art text retriev variou platform chapter follow one explor natur languag process advanc techniqu gener vector text represent web retriev uniqu search challeng modern ai support classif search method retriev queri doc doc doc index filter ranker rank model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|      5 | offlin phase docid doc10 dog word word cat word home word word index featur extract new document insert fundament mani search system like search file local drive scan data queri howev approach effici larg text collect instead search divid two part offlin index phase depict left onlin queri phase see next page offlin phase extract meaning featur text document store along metadata index futur queri use featur provid concis represent document content typic repres high dimension vector offlin mode follow step take place add new document find one scan crawl addit trigger featur extract updat search index extract featur best describ content analyz context includ higher level featur pass featur index acceler search queri main challeng lie extract concis represent document chapter use simpl method creat vector represent chapter follow one explor advanc techniqu                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|      6 | onlin queri transform invert file dog doc3 doc4 doc10 cat doc10 home doc1 doc7 doc10 index dog home dog dog hound home retriev relev rank sim doc1 sim doc4 sim doc10 result doc10 doc4 doc1 onlin mode user search document use index data offlin phase queri analyz similarli document addit process correct spell mistak includ synonym broader search retriev involv compar featur two document similar featur consid similar content thu document consid good match queri featur close queri onlin mode follow step take place user enter queri speech handwrit recognit extract featur queri similar process document transform queri need correct spell mistak use queri featur search index document similar featur rank document base retriev statu valu rsv return best match document primari challeng relev rank goal accur assess document relev base sole featur represent given featur queri subsequ chapter explor sophist method includ gener ai howev chapter use simpl yet effici effect method suitabl mani use case                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|      7 | rest chapter explor fundament step extract featur sourc document offlin phase mention earlier overal process detail pictur discuss index later section focus four fundament step featur extract extract split token summar outcom includ vocabulari contain term found document also use queri analysi addit obtain document chunk split step featur represent store index along metadata sourc document split rang start end coordin sourc document html doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken land bombay learn corp advanc pass alreadi deep enemi countri proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken land bombay learn corp advanc pass alreadi deep enemi countri year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken land bombay learn corp advanc pass alreadi deep enemi countri token year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken vocabulari extract year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken land bombay learn corp advanc pass alreadi deep enemi countri summar year medicin holm surgeon london attach univers duli year medicin holm surgeon london attach univers duli year medicin holm surgeon london attach univers duli index term metadata featur split rang |
|      8 | step extract exampl html text document avail differ format html pdf epub metadata plain text first step involv extract meta inform sequenc charact form text stream without control sequenc format inform present sourc document may includ structur analysi document encod adjust identifi relev inform featur extract case may appli text extract imag consid simpl exampl html follow snippet repres web page structur initi task identifi use bit inform within header typic hold rich meta inform bodi contain main text part although html follow well defin standard extract inform known scrape requir analyz data structur use page contrast web search engin consid everyth present page point must decid charact encod use term index convert sourc text utf wide use limit abil support differ languag html extract year took degre doctor medicin univers london proceed netley go cours prescrib surgeon armi complet studi duli attach fifth northumberland fusili assist surgeon regiment station india time could join second afghan war broken land bombay learn corp advanc pass alreadi deep enemi countri html head titl mmir titl meta name keyword content multimedia retriev cours head bodi bodi html header contain meta inform document util inform add relev metadata document chunk bodi contain main content enrich markup document flow alway obviou may appear differ screen file                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|      9 | let use exampl html illustr aspect metadata gener uri page metadata content may serv concis key word retriev titl document metadata content may serv concis key word retriev meta inform header section enrich inform provid author discuss metadata section must cautiou reliabl might includ fals inform describ aspect differ observ document nevertheless mani case brief natur metadata allow us assign high weight text part web page contain link handl effect link describ relationship document enhanc current document descript importantli also describ referenc document sinc web page author often use concis anchor text keyword anchor text serv excel sourc addit term referenc document usual link text associ embed link document howev typic give much higher weight keyword referenc document essenti consid approach effect especi deal click bait promis referenc document reveal navig hint like click back main page keyword add addit content referenc document bodi includ text block use tag control render page flow may exactli match order html file usual good enough approxim certain tag offer valuabl addit inform follow text piec exampl assign higher weight term occurr headlin bold text text emphas render page html includ escap sequenc special charact need translat target encod format http dmi uniba ch de studium comput scienc informatik lehrangebot hs23 lectur multimedia retriev titl multimedia retriev homepag titl meta name keyword content mmir inform retriev meta name descript content chang life nbsp space uuml                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|     10 | illustr show anchor text surround provid relev term describ target page imag emphas need caution human metadata howev anchor text come divers sourc simplifi identif use term across mention filter outlier obvious incorrect inform subsequ chapter delv use link network assess page import object relev pagerank unlock creativ potenti multimedia retriev outstand grade multimedia retriev chapter great illustr lectur multimedia retriev roger weber fri multimedia cours uni basel absolut marvel text retriev search engin                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |

### Set-of-words summary

In [14]:
def set_of_words(tokens):
    return set(tokens)

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join(sorted(set_of_words(collection[i])))
        ] for i in range(n)
    ],
    ["page", "set of words"]
)

|   page | set of words                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|-------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      1 | chapter, classic, com, comput, dr, ext, fundament, gmail, index, introduct, link, literatur, lucen, model, multimedia, open, retriev, roger, scienc, search, sourc, structur, text, weber                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|      2 | addit, advantag, affect, algebra, algorithm, allow, although, applic, approach, atom, author, back, basic, becam, boolean, build, call, chapter, combin, compar, complex, comput, condit, consid, data, date, determin, discuss, doc, document, drive, due, earli, earlier, easili, effect, effici, enabl, exampl, exclud, expert, express, file, filter, find, first, focus, formul, foundat, fraction, friendli, gap, gener, gerard, hard, hold, improv, includ, index, input, integr, interfac, introduct, invert, jone, karen, less, like, limit, local, make, match, media, metadata, method, model, modern, need, number, observ, occurr, offer, oper, origin, other, otherwis, pioneer, popular, post, primarili, probabilist, process, progress, public, queri, rang, rank, rel, relev, repres, represent, research, result, retriev, return, robust, salton, satisfi, scan, score, search, searchabl, semant, set, share, signific, simpl, simplic, small, soon, sort, space, sparck, still, storag, structur, system, technolog, term, text, textual, theori, today, type, unstructur, upcom, use, user, util, vector, wide, without |
|      3 | add, address, allow, although, anoth, appear, appli, approach, assess, assign, better, boolean, brows, calcul, cat, chang, chase, classifi, collect, condit, consid, contain, criteria, data, date, directli, dismiss, doc, document, dog, drawback, earli, effici, enabl, even, exampl, explor, express, extend, extens, filter, fit, focus, foundat, frequent, furthermor, grew, hard, hit, howev, hundr, impact, implement, independ, index, instead, interfac, introduc, key, languag, larger, librari, like, limit, logic, lower, match, meet, met, meta, metadata, method, model, mostli, need, object, often, organ, partial, penalti, play, post, presenc, present, process, public, queri, relev, remov, result, retriev, scenario, score, search, seem, set, shop, sort, specif, step, street, studi, submit, term, three, two, unlik, use, user, walk, want, way, well, without, word, work                                                                                                                                                                                                                                         |
|      4 | acceler, advanc, ai, apach, appli, approach, art, assess, assum, base, begin, bm25, boolean, candid, challeng, chapter, classic, classif, collect, combin, compar, conclud, condit, databas, delv, descript, detail, determin, dimension, discuss, doc, document, emerg, establish, examin, explor, extend, file, filter, final, follow, form, gather, gener, heurist, high, implement, index, invert, languag, larger, like, linguist, list, lucen, match, method, model, modern, natur, newer, next, notabl, notion, obtain, offer, one, oper, packag, perform, platform, popular, probabilist, probabl, process, produc, public, queri, randomli, rank, ranker, reduc, relat, relev, repres, represent, result, retriev, search, set, similar, simpl, softwar, sort, space, standard, state, studi, support, techniqu, term, text, uniqu, use, variou, vector, vocabulari, web, word, year                                                                                                                                                                                                                                                  |
|      5 | acceler, add, addit, advanc, along, analyz, approach, best, cat, challeng, chapter, collect, concis, content, context, crawl, creat, data, depict, describ, dimension, divid, doc10, docid, document, dog, drive, effici, explor, extract, featur, file, find, follow, fundament, futur, high, higher, home, howev, includ, index, insert, instead, larg, left, level, lie, like, local, main, mani, meaning, metadata, method, mode, new, next, offlin, one, onlin, page, part, pass, phase, place, provid, queri, repres, represent, scan, search, see, simpl, step, store, system, take, techniqu, text, trigger, two, typic, updat, use, vector, word                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|      6 | accur, addit, ai, analyz, assess, base, best, broader, case, cat, challeng, chapter, close, compar, consid, content, correct, data, doc1, doc10, doc3, doc4, doc7, document, dog, effect, effici, enter, explor, extract, featur, file, follow, gener, given, goal, good, handwrit, home, hound, howev, includ, index, invert, involv, mani, match, method, mistak, mode, need, offlin, onlin, phase, place, primari, process, queri, rank, recognit, relev, represent, result, retriev, return, rsv, search, sim, similar, similarli, simpl, sole, sophist, speech, spell, statu, step, subsequ, suitabl, synonym, take, thu, transform, two, use, user, valu, yet                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|      7 | addit, advanc, afghan, along, alreadi, also, analysi, armi, assist, attach, bombay, broken, chapter, chunk, complet, contain, coordin, corp, could, countri, cours, deep, degre, detail, discuss, doctor, document, duli, earlier, end, enemi, explor, extract, featur, fifth, focus, found, four, fundament, fusili, go, holm, html, includ, index, india, join, land, later, learn, london, medicin, mention, metadata, netley, northumberland, obtain, offlin, outcom, overal, pass, phase, pictur, prescrib, proceed, process, queri, rang, regiment, represent, rest, second, section, sourc, split, start, station, step, store, studi, summar, surgeon, term, time, token, took, univers, use, vocabulari, war, year                                                                                                                                                                                                                                                                                                                                                                                                                    |
|      8 | abil, add, adjust, advanc, afghan, alreadi, although, alway, analysi, analyz, appear, appli, armi, assist, attach, avail, bit, bodi, bombay, broken, case, charact, chunk, complet, consid, contain, content, contrast, control, convert, corp, could, countri, cours, data, decid, deep, defin, degre, differ, doctor, document, duli, encod, enemi, engin, enrich, epub, everyth, exampl, extract, featur, fifth, file, first, flow, follow, form, format, fusili, go, head, header, hold, html, identifi, imag, includ, index, india, inform, initi, involv, join, keyword, known, land, languag, learn, limit, london, main, markup, may, medicin, meta, metadata, mmir, multimedia, must, name, netley, northumberland, obviou, page, part, pass, pdf, plain, point, prescrib, present, proceed, regiment, relev, repres, requir, retriev, rich, scrape, screen, search, second, sequenc, simpl, snippet, sourc, standard, station, step, stream, structur, studi, support, surgeon, task, term, text, time, titl, took, typic, univers, use, utf, util, war, web, well, wide, within, without, year                                      |
|      9 | add, addit, allow, also, anchor, approach, approxim, aspect, assign, associ, author, back, bait, block, bodi, bold, brief, case, cautiou, certain, ch, chang, charact, click, comput, concis, consid, contain, content, control, current, de, deal, describ, descript, differ, discuss, dmi, document, effect, embed, emphas, encod, enhanc, enough, enrich, escap, especi, essenti, exactli, exampl, excel, fals, file, flow, follow, format, gener, give, good, handl, header, headlin, high, higher, hint, homepag, howev, hs23, html, http, illustr, importantli, includ, inform, informatik, key, keyword, lectur, lehrangebot, let, life, like, link, main, mani, match, may, meta, metadata, might, mmir, much, multimedia, must, name, natur, navig, nbsp, need, nevertheless, observ, occurr, offer, often, order, page, part, piec, promis, provid, referenc, relationship, reliabl, render, retriev, reveal, scienc, section, sequenc, serv, sinc, sourc, space, special, studium, tag, target, term, text, titl, translat, typic, uniba, uri, us, use, usual, uuml, valuabl, web, weight, word                                     |
|     10 | absolut, across, anchor, assess, basel, caution, chapter, come, cours, creativ, delv, describ, divers, emphas, engin, filter, fri, grade, great, howev, human, identif, illustr, imag, import, incorrect, inform, lectur, link, marvel, mention, metadata, multimedia, need, network, object, obvious, outlier, outstand, page, pagerank, potenti, provid, relev, retriev, roger, search, show, simplifi, sourc, subsequ, surround, target, term, text, uni, unlock, use, weber                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

### Bag-of-words summary

In [15]:
def bag_of_words(tokens):
    return dict(Counter(tokens))

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join([f'{x[0]}:{x[1]}' for x in sorted(bag_of_words(collection[i]).items())])
        ] for i in range(n)
    ],
    ["page", "bag of words"]
)

|   page | bag of words                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|-------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      1 | chapter:1, classic:1, com:1, comput:1, dr:1, ext:3, fundament:2, gmail:1, index:2, introduct:2, link:2, literatur:2, lucen:2, model:2, multimedia:1, open:2, retriev:4, roger:2, scienc:1, search:2, sourc:2, structur:2, text:2, weber:2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|      2 | addit:3, advantag:1, affect:1, algebra:1, algorithm:1, allow:1, although:1, applic:1, approach:1, atom:1, author:1, back:1, basic:1, becam:1, boolean:8, build:1, call:1, chapter:1, combin:1, compar:1, complex:1, comput:3, condit:1, consid:3, data:3, date:1, determin:1, discuss:2, doc:3, document:7, drive:2, due:1, earli:1, earlier:1, easili:2, effect:1, effici:2, enabl:1, exampl:1, exclud:1, expert:1, express:2, file:2, filter:2, find:1, first:2, focus:1, formul:1, foundat:1, fraction:1, friendli:1, gap:1, gener:1, gerard:1, hard:1, hold:1, improv:1, includ:1, index:2, input:1, integr:1, interfac:1, introduct:1, invert:1, jone:1, karen:1, less:1, like:2, limit:1, local:2, make:1, match:4, media:2, metadata:2, method:1, model:6, modern:1, need:1, number:1, observ:1, occurr:1, offer:1, oper:1, origin:1, other:1, otherwis:1, pioneer:1, popular:1, post:1, primarili:1, probabilist:1, process:2, progress:1, public:1, queri:5, rang:1, rank:2, rel:1, relev:1, repres:1, represent:1, research:1, result:2, retriev:5, return:1, robust:1, salton:1, satisfi:1, scan:2, score:1, search:4, searchabl:1, semant:1, set:3, share:1, signific:1, simpl:2, simplic:1, small:1, soon:1, sort:1, space:1, sparck:1, still:2, storag:1, structur:1, system:4, technolog:1, term:3, text:3, textual:1, theori:1, today:2, type:2, unstructur:1, upcom:1, use:5, user:3, util:1, vector:1, wide:1, without:2 |
|      3 | add:1, address:1, allow:1, although:1, anoth:1, appear:2, appli:1, approach:1, assess:1, assign:1, better:2, boolean:6, brows:1, calcul:1, cat:6, chang:2, chase:1, classifi:1, collect:2, condit:4, consid:3, contain:1, criteria:1, data:1, date:1, directli:1, dismiss:1, doc:3, document:10, dog:5, drawback:1, earli:1, effici:1, enabl:1, even:1, exampl:1, explor:3, express:1, extend:2, extens:2, filter:4, fit:2, focus:1, foundat:1, frequent:1, furthermor:1, grew:1, hard:1, hit:1, howev:1, hundr:1, impact:2, implement:1, independ:1, index:1, instead:1, interfac:1, introduc:1, key:1, languag:1, larger:1, librari:1, like:1, limit:1, logic:1, lower:1, match:1, meet:2, met:1, meta:1, metadata:1, method:1, model:5, mostli:1, need:1, object:1, often:2, organ:1, partial:2, penalti:1, play:1, post:2, presenc:1, present:1, process:2, public:1, queri:10, relev:6, remov:1, result:4, retriev:2, scenario:1, score:3, search:3, seem:1, set:1, shop:1, sort:6, specif:1, step:2, street:1, studi:1, submit:1, term:3, three:1, two:1, unlik:1, use:2, user:2, walk:1, want:1, way:2, well:2, without:1, word:2, work:1                                                                                                                                                                                                                                                                                           |
|      4 | acceler:1, advanc:1, ai:1, apach:1, appli:1, approach:1, art:1, assess:1, assum:1, base:1, begin:1, bm25:2, boolean:2, candid:1, challeng:1, chapter:3, classic:3, classif:1, collect:1, combin:1, compar:1, conclud:1, condit:1, databas:1, delv:1, descript:1, detail:1, determin:1, dimension:1, discuss:1, doc:3, document:6, emerg:1, establish:1, examin:1, explor:3, extend:2, file:1, filter:3, final:1, follow:2, form:1, gather:1, gener:2, heurist:1, high:1, implement:1, index:2, invert:1, languag:2, larger:1, like:2, linguist:1, list:1, lucen:1, match:1, method:4, model:15, modern:2, natur:1, newer:1, next:1, notabl:1, notion:1, obtain:1, offer:1, one:1, oper:1, packag:2, perform:1, platform:1, popular:2, probabilist:6, probabl:1, process:2, produc:1, public:1, queri:5, randomli:1, rank:2, ranker:1, reduc:1, relat:1, relev:5, repres:1, represent:1, result:2, retriev:11, search:4, set:1, similar:1, simpl:2, softwar:2, sort:1, space:5, standard:1, state:1, studi:1, support:1, techniqu:2, term:2, text:3, uniqu:1, use:2, variou:1, vector:8, vocabulari:1, web:1, word:1, year:1                                                                                                                                                                                                                                                                                                                |
|      5 | acceler:1, add:1, addit:1, advanc:1, along:1, analyz:1, approach:1, best:1, cat:1, challeng:1, chapter:2, collect:1, concis:2, content:2, context:1, crawl:1, creat:1, data:1, depict:1, describ:1, dimension:1, divid:1, doc10:1, docid:1, document:5, dog:1, drive:1, effici:1, explor:1, extract:5, featur:7, file:1, find:1, follow:2, fundament:1, futur:1, high:1, higher:1, home:1, howev:1, includ:1, index:5, insert:1, instead:1, larg:1, left:1, level:1, lie:1, like:1, local:1, main:1, mani:1, meaning:1, metadata:1, method:1, mode:1, new:2, next:1, offlin:4, one:2, onlin:1, page:1, part:1, pass:1, phase:4, place:1, provid:1, queri:4, repres:1, represent:3, scan:2, search:5, see:1, simpl:1, step:1, store:1, system:1, take:1, techniqu:1, text:2, trigger:1, two:1, typic:1, updat:1, use:2, vector:2, word:5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|      6 | accur:1, addit:1, ai:1, analyz:1, assess:1, base:2, best:1, broader:1, case:1, cat:1, challeng:1, chapter:2, close:1, compar:1, consid:2, content:1, correct:2, data:1, doc1:3, doc10:5, doc3:1, doc4:3, doc7:1, document:9, dog:4, effect:1, effici:1, enter:1, explor:1, extract:1, featur:8, file:1, follow:1, gener:1, given:1, goal:1, good:1, handwrit:1, home:3, hound:1, howev:1, includ:2, index:3, invert:1, involv:1, mani:1, match:2, method:2, mistak:2, mode:2, need:1, offlin:1, onlin:3, phase:1, place:1, primari:1, process:2, queri:9, rank:3, recognit:1, relev:3, represent:1, result:1, retriev:3, return:1, rsv:1, search:3, sim:3, similar:4, similarli:1, simpl:1, sole:1, sophist:1, speech:1, spell:2, statu:1, step:1, subsequ:1, suitabl:1, synonym:1, take:1, thu:1, transform:2, two:1, use:4, user:2, valu:1, yet:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|      7 | addit:1, advanc:4, afghan:7, along:1, alreadi:4, also:1, analysi:1, armi:7, assist:7, attach:10, bombay:4, broken:7, chapter:1, chunk:1, complet:7, contain:1, coordin:1, corp:4, could:7, countri:4, cours:7, deep:4, degre:5, detail:1, discuss:1, doctor:6, document:5, duli:10, earlier:1, end:1, enemi:4, explor:1, extract:4, featur:4, fifth:7, focus:1, found:1, four:1, fundament:2, fusili:7, go:7, holm:3, html:1, includ:1, index:3, india:7, join:7, land:4, later:1, learn:4, london:9, medicin:9, mention:1, metadata:2, netley:7, northumberland:7, obtain:1, offlin:1, outcom:1, overal:1, pass:4, phase:1, pictur:1, prescrib:7, proceed:7, process:1, queri:1, rang:2, regiment:7, represent:1, rest:1, second:7, section:1, sourc:3, split:4, start:1, station:7, step:3, store:1, studi:7, summar:2, surgeon:17, term:2, time:7, token:2, took:5, univers:9, use:1, vocabulari:2, war:7, year:8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|      8 | abil:1, add:1, adjust:1, advanc:1, afghan:1, alreadi:1, although:1, alway:1, analysi:1, analyz:1, appear:1, appli:1, armi:1, assist:1, attach:1, avail:1, bit:1, bodi:4, bombay:1, broken:1, case:1, charact:2, chunk:1, complet:1, consid:2, contain:3, content:2, contrast:1, control:1, convert:1, corp:1, could:1, countri:1, cours:2, data:1, decid:1, deep:1, defin:1, degre:1, differ:3, doctor:1, document:6, duli:1, encod:2, enemi:1, engin:1, enrich:1, epub:1, everyth:1, exampl:2, extract:6, featur:1, fifth:1, file:1, first:1, flow:1, follow:2, form:1, format:2, fusili:1, go:1, head:2, header:2, hold:1, html:7, identifi:2, imag:1, includ:1, index:1, india:1, inform:8, initi:1, involv:1, join:1, keyword:1, known:1, land:1, languag:1, learn:1, limit:1, london:1, main:2, markup:1, may:3, medicin:1, meta:4, metadata:2, mmir:1, multimedia:1, must:1, name:1, netley:1, northumberland:1, obviou:1, page:3, part:1, pass:1, pdf:1, plain:1, point:1, prescrib:1, present:2, proceed:1, regiment:1, relev:2, repres:1, requir:1, retriev:1, rich:1, scrape:1, screen:1, search:1, second:1, sequenc:2, simpl:1, snippet:1, sourc:2, standard:1, station:1, step:2, stream:1, structur:3, studi:1, support:1, surgeon:2, task:1, term:1, text:6, time:1, titl:2, took:1, typic:1, univers:1, use:4, utf:1, util:1, war:1, web:2, well:1, wide:1, within:1, without:1, year:1                                    |
|      9 | add:1, addit:3, allow:1, also:1, anchor:2, approach:1, approxim:1, aspect:2, assign:2, associ:1, author:2, back:1, bait:1, block:1, bodi:1, bold:1, brief:1, case:1, cautiou:1, certain:1, ch:1, chang:1, charact:1, click:2, comput:1, concis:3, consid:1, contain:1, content:5, control:1, current:1, de:1, deal:1, describ:3, descript:2, differ:1, discuss:1, dmi:1, document:10, effect:2, embed:1, emphas:1, encod:1, enhanc:1, enough:1, enrich:1, escap:1, especi:1, essenti:1, exactli:1, exampl:2, excel:1, fals:1, file:1, flow:1, follow:1, format:1, gener:1, give:1, good:1, handl:1, header:1, headlin:1, high:1, higher:2, hint:1, homepag:1, howev:1, hs23:1, html:3, http:1, illustr:1, importantli:1, includ:3, inform:5, informatik:1, key:2, keyword:4, lectur:1, lehrangebot:1, let:1, life:1, like:1, link:4, main:1, mani:1, match:1, may:3, meta:3, metadata:5, might:1, mmir:1, much:1, multimedia:2, must:1, name:2, natur:1, navig:1, nbsp:1, need:1, nevertheless:1, observ:1, occurr:1, offer:1, often:1, order:1, page:6, part:1, piec:1, promis:1, provid:1, referenc:5, relationship:1, reliabl:1, render:2, retriev:5, reveal:1, scienc:1, section:2, sequenc:1, serv:3, sinc:1, sourc:1, space:1, special:1, studium:1, tag:2, target:1, term:2, text:8, titl:3, translat:1, typic:1, uniba:1, uri:1, us:1, use:3, usual:2, uuml:1, valuabl:1, web:2, weight:3, word:2                                  |
|     10 | absolut:1, across:1, anchor:2, assess:1, basel:1, caution:1, chapter:2, come:1, cours:1, creativ:1, delv:1, describ:1, divers:1, emphas:1, engin:1, filter:1, fri:1, grade:1, great:1, howev:1, human:1, identif:1, illustr:2, imag:1, import:1, incorrect:1, inform:1, lectur:1, link:1, marvel:1, mention:1, metadata:1, multimedia:4, need:1, network:1, object:1, obvious:1, outlier:1, outstand:1, page:2, pagerank:1, potenti:1, provid:1, relev:2, retriev:4, roger:1, search:1, show:1, simplifi:1, sourc:1, subsequ:1, surround:1, target:1, term:2, text:3, uni:1, unlock:1, use:2, weber:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |

### Bag-of-words require document frequency and idf weigths for each term

In [16]:
def idf(N, df):
    return math.log10((N + 1) / (df + 1))

terms = defaultdict(int)
for page in collection:
    # go through each distinct term on this page
    for term in set(page):
        terms[term] += 1
vocabulary = {term: {"df": count, "idf": idf(len(collection), count)} for term, count in terms.items()}

n = 20
sample = sorted(random.sample(list(vocabulary.items()), n), key=lambda x: x[1]["idf"], reverse=True)
print_table(
    [
        [x[0], f'{x[1]["df"]} / {len(collection)}', f'{x[1]["idf"]:.3f}'] for x in sample
    ],
    ["Term", "df", "idf"]
)

| Term           | df      |   idf |
|:---------------|:--------|------:|
| handwrit       | 1 / 71  | 1.556 |
| hall           | 1 / 71  | 1.556 |
| consumpt       | 1 / 71  | 1.556 |
| slow           | 1 / 71  | 1.556 |
| plot           | 1 / 71  | 1.556 |
| getindexsearch | 1 / 71  | 1.556 |
| is_relev       | 2 / 71  | 1.38  |
| logstash       | 2 / 71  | 1.38  |
| trade          | 2 / 71  | 1.38  |
| attempt        | 2 / 71  | 1.38  |
| minim          | 4 / 71  | 1.158 |
| place          | 5 / 71  | 1.079 |
| corp           | 6 / 71  | 1.012 |
| foundat        | 6 / 71  | 1.012 |
| iter           | 6 / 71  | 1.012 |
| increas        | 7 / 71  | 0.954 |
| even           | 10 / 71 | 0.816 |
| find           | 12 / 71 | 0.743 |
| structur       | 12 / 71 | 0.743 |
| list           | 18 / 71 | 0.579 |

### Now we can compute the bag-of-word representation for vector space retrieval

In [17]:
from collections import Counter
def bag_of_words_idf(tokens, vocabulary):
    terms = dict(Counter(tokens))
    return map(lambda w: (w[0], w[1] * vocabulary[w[0]]['idf']), terms.items())

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join([f'{x[0]}:{x[1]:.2f}' for x in sorted(bag_of_words_idf(collection[i], vocabulary))])
        ] for i in range(n)
    ],
    ["page", "bag of words (vector space retrieval)"]
)

|   page | bag of words (vector space retrieval)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|-------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      1 | chapter:0.58, classic:0.95, com:1.16, comput:0.65, dr:1.56, ext:4.67, fundament:1.81, gmail:1.56, index:0.65, introduct:2.51, link:1.81, literatur:2.32, lucen:1.25, model:0.58, multimedia:1.16, open:1.91, retriev:0.86, roger:2.76, scienc:1.26, search:0.49, sourc:1.56, structur:1.49, text:0.92, weber:2.76                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|      2 | addit:1.38, advantag:1.01, affect:1.01, algebra:1.38, algorithm:0.95, allow:0.74, although:0.90, applic:0.90, approach:0.44, atom:1.01, author:1.26, back:1.08, basic:1.08, becam:1.16, boolean:4.45, build:0.82, call:1.26, chapter:0.58, combin:0.90, compar:0.78, complex:0.90, comput:1.96, condit:0.82, consid:1.28, data:1.49, date:1.26, determin:0.71, discuss:1.31, doc:2.57, document:0.45, drive:2.51, due:0.78, earli:1.38, earlier:1.16, easili:2.51, effect:1.01, effici:1.31, enabl:0.74, exampl:0.43, exclud:1.16, expert:1.56, express:1.49, file:1.07, filter:1.31, find:0.74, first:1.36, focus:1.08, formul:1.38, foundat:1.01, fraction:1.56, friendli:1.16, gap:1.56, gener:0.63, gerard:1.38, hard:1.26, hold:1.08, improv:0.74, includ:0.43, index:0.65, input:1.38, integr:0.95, interfac:1.26, introduct:1.26, invert:0.60, jone:1.16, karen:1.16, less:1.08, like:0.92, limit:0.71, local:2.76, make:1.08, match:1.52, media:3.11, metadata:1.49, method:0.30, model:1.73, modern:1.16, need:0.43, number:0.44, observ:0.86, occurr:0.60, offer:0.65, oper:0.78, origin:1.01, other:1.56, otherwis:1.08, pioneer:1.56, popular:1.16, post:0.65, primarili:1.38, probabilist:0.95, process:1.25, progress:1.26, public:1.26, queri:0.67, rang:0.82, rank:1.07, rel:1.16, relev:0.38, repres:0.78, represent:0.78, research:1.38, result:0.58, retriev:1.07, return:0.68, robust:1.38, salton:1.08, satisfi:1.08, scan:2.32, score:0.43, search:0.98, searchabl:1.38, semant:0.82, set:1.43, share:1.16, signific:0.74, simpl:1.20, simplic:1.38, small:1.01, soon:1.38, sort:0.78, space:0.50, sparck:1.16, still:1.42, storag:0.86, structur:0.74, system:2.97, technolog:1.56, term:0.26, text:1.38, textual:1.56, theori:1.56, today:2.76, type:2.32, unstructur:1.56, upcom:1.26, use:0.58, user:1.61, util:0.65, vector:0.41, wide:0.95, without:1.71 |
|      3 | add:0.60, address:0.90, allow:0.74, although:0.90, anoth:1.16, appear:1.49, appli:0.63, approach:0.44, assess:0.68, assign:0.71, better:1.91, boolean:3.34, brows:1.26, calcul:0.56, cat:3.92, chang:2.02, chase:1.56, classifi:1.56, collect:0.95, condit:3.26, consid:1.28, contain:0.39, criteria:1.38, data:0.50, date:1.26, directli:1.26, dismiss:1.56, doc:2.57, document:0.65, dog:4.29, drawback:1.26, earli:1.38, effici:0.65, enabl:0.74, even:0.82, exampl:0.43, explor:2.13, express:0.74, extend:1.71, extens:2.02, filter:2.61, fit:2.51, focus:1.08, foundat:1.01, frequent:0.82, furthermor:1.38, grew:1.56, hard:1.26, hit:1.56, howev:0.35, hundr:1.16, impact:1.49, implement:0.51, independ:0.82, index:0.33, instead:0.90, interfac:1.26, introduc:0.86, key:0.82, languag:0.74, larger:0.82, librari:1.16, like:0.46, limit:0.71, logic:1.56, lower:0.86, match:0.38, meet:2.76, met:1.26, meta:1.26, metadata:0.74, method:0.30, model:1.45, mostli:1.26, need:0.43, object:1.01, often:1.81, organ:1.56, partial:1.63, penalti:1.56, play:1.38, post:1.31, presenc:1.01, present:0.78, process:1.25, public:1.26, queri:1.33, relev:2.28, remov:0.90, result:1.16, retriev:0.43, scenario:0.86, score:1.28, search:0.73, seem:1.38, set:0.48, shop:1.56, sort:4.67, specif:0.90, step:0.99, street:1.38, studi:0.71, submit:1.38, term:0.26, three:0.95, two:0.82, unlik:0.95, use:0.23, user:1.07, walk:1.26, want:1.38, way:2.32, well:1.81, without:0.86, word:0.76, work:1.01                                                                                                                                                                                                                                                                                                                                                                           |
|      4 | acceler:1.26, advanc:0.71, ai:1.26, apach:1.01, appli:0.63, approach:0.44, art:1.38, assess:0.68, assum:0.86, base:0.38, begin:1.38, bm25:1.42, boolean:1.11, candid:1.01, challeng:1.08, chapter:1.74, classic:2.86, classif:1.38, collect:0.48, combin:0.90, compar:0.78, conclud:1.16, condit:0.82, databas:1.01, delv:1.16, descript:1.01, detail:1.08, determin:0.71, dimension:1.08, discuss:0.65, doc:2.57, document:0.39, emerg:1.38, establish:0.90, examin:0.95, explor:2.13, extend:1.71, file:0.54, filter:1.96, final:0.65, follow:0.99, form:0.63, gather:1.38, gener:1.25, heurist:1.01, high:0.74, implement:0.51, index:0.65, invert:0.60, languag:1.49, larger:0.82, like:0.92, linguist:1.08, list:0.58, lucen:0.63, match:0.38, method:1.20, model:4.34, modern:2.32, natur:0.95, newer:1.26, next:0.58, notabl:0.95, notion:1.38, obtain:0.95, offer:0.65, one:0.46, oper:0.78, packag:2.76, perform:0.71, platform:1.26, popular:2.32, probabilist:5.73, probabl:0.90, process:1.25, produc:1.08, public:1.26, queri:0.67, randomli:1.16, rank:1.07, ranker:1.38, reduc:0.78, relat:0.95, relev:1.90, repres:0.78, represent:0.78, result:0.58, retriev:2.35, search:0.98, set:0.48, similar:0.44, simpl:1.20, softwar:2.76, sort:0.78, space:2.48, standard:0.90, state:1.38, studi:0.71, support:0.71, techniqu:1.81, term:0.17, text:1.38, uniqu:1.38, use:0.23, variou:0.78, vector:3.28, vocabulari:0.68, web:1.16, word:0.38, year:0.60                                                                                                                                                                                                                                                                                                                                                                                                                  |
|      5 | acceler:1.26, add:0.60, addit:0.46, advanc:0.71, along:1.08, analyz:0.82, approach:0.44, best:0.90, cat:0.65, challeng:1.08, chapter:1.16, collect:0.48, concis:2.32, content:1.71, context:0.90, crawl:1.56, creat:0.65, data:0.50, depict:1.16, describ:0.90, dimension:1.08, divid:0.95, doc10:1.38, docid:1.16, document:0.32, dog:0.86, drive:1.26, effici:0.65, explor:0.71, extract:4.52, featur:4.98, file:0.54, find:0.74, follow:0.99, fundament:0.90, futur:1.38, high:0.74, higher:0.74, home:1.26, howev:0.35, includ:0.43, index:1.63, insert:1.01, instead:0.90, larg:0.90, left:1.38, level:0.86, lie:1.38, like:0.46, local:1.38, main:0.86, mani:0.63, meaning:1.16, metadata:0.74, method:0.30, mode:1.38, new:0.99, next:0.58, offlin:5.02, one:0.92, onlin:1.38, page:0.78, part:0.86, pass:0.90, phase:5.02, place:1.08, provid:0.46, queri:0.53, repres:0.78, represent:2.33, scan:2.32, search:1.22, see:0.95, simpl:0.60, step:0.50, store:0.78, system:0.74, take:0.82, techniqu:0.90, text:0.92, trigger:1.38, two:0.82, typic:0.82, updat:0.90, use:0.23, vector:0.82, word:1.90                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|      6 | accur:1.26, addit:0.46, ai:1.26, analyz:0.82, assess:0.68, base:0.76, best:0.90, broader:1.38, case:0.82, cat:0.65, challeng:1.08, chapter:1.16, close:0.95, compar:0.78, consid:0.85, content:0.86, correct:2.51, data:0.50, doc1:4.67, doc10:6.90, doc3:1.56, doc4:4.67, doc7:1.56, document:0.58, dog:3.43, effect:1.01, effici:0.65, enter:1.26, explor:0.71, extract:0.90, featur:5.69, file:0.54, follow:0.50, gener:0.63, given:0.78, goal:1.38, good:1.16, handwrit:1.56, home:3.77, hound:1.56, howev:0.35, includ:0.85, index:0.98, invert:0.60, involv:1.08, mani:0.63, match:0.76, method:0.60, mistak:2.76, mode:2.76, need:0.43, offlin:1.26, onlin:4.14, phase:1.26, place:1.08, primari:1.01, process:1.25, queri:1.20, rank:1.61, recognit:1.56, relev:1.14, represent:0.78, result:0.29, retriev:0.64, return:0.68, rsv:1.38, search:0.73, sim:2.33, similar:1.77, similarli:1.16, simpl:0.60, sole:1.16, sophist:1.56, speech:1.56, spell:2.32, statu:1.38, step:0.50, subsequ:0.86, suitabl:1.38, synonym:1.26, take:0.82, thu:1.16, transform:2.16, two:0.82, use:0.47, user:1.07, valu:0.33, yet:1.08                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|      7 | addit:0.46, advanc:2.84, afghan:6.68, along:1.08, alreadi:3.61, also:0.56, analysi:1.08, armi:6.68, assist:6.32, attach:9.54, bombay:4.05, broken:6.68, chapter:0.58, chunk:1.08, complet:5.71, contain:0.39, coordin:1.16, corp:4.05, could:5.20, countri:3.82, cours:6.00, deep:4.05, degre:4.77, detail:1.08, discuss:0.65, doctor:5.73, document:0.32, duli:9.54, earlier:1.16, end:1.01, enemi:4.05, explor:0.71, extract:3.61, featur:2.84, fifth:6.68, focus:1.08, found:1.08, four:1.56, fundament:1.81, fusili:6.68, go:5.45, holm:4.14, html:1.01, includ:0.43, index:0.98, india:6.68, join:6.00, land:4.05, later:0.90, learn:3.82, london:8.13, medicin:8.59, mention:0.90, metadata:1.49, netley:6.68, northumberland:6.68, obtain:0.95, offlin:1.26, outcom:1.01, overal:0.95, pass:3.61, phase:1.26, pictur:1.56, prescrib:6.68, proceed:6.32, process:0.63, queri:0.13, rang:1.63, regiment:6.68, represent:0.78, rest:1.38, second:5.20, section:0.95, sourc:2.33, split:4.05, start:0.90, station:6.68, step:1.49, store:0.78, studi:4.98, summar:2.76, surgeon:16.22, term:0.17, time:3.60, token:1.56, took:4.52, univers:7.34, use:0.12, vocabulari:1.36, war:6.00, year:4.82                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|      8 | abil:1.16, add:0.60, adjust:0.78, advanc:0.71, afghan:0.95, alreadi:0.90, although:0.90, alway:1.16, analysi:1.08, analyz:0.82, appear:0.74, appli:0.63, armi:0.95, assist:0.90, attach:0.95, avail:0.90, bit:1.38, bodi:3.82, bombay:1.01, broken:0.95, case:0.82, charact:1.91, chunk:1.08, complet:0.82, consid:0.85, contain:1.18, content:1.71, contrast:1.38, control:0.82, convert:1.38, corp:1.01, could:0.74, countri:0.95, cours:1.71, data:0.50, decid:1.26, deep:1.01, defin:0.71, degre:0.95, differ:1.61, doctor:0.95, document:0.39, duli:0.95, encod:2.32, enemi:1.01, engin:0.90, enrich:1.38, epub:1.56, everyth:1.56, exampl:0.85, extract:5.42, featur:0.71, fifth:0.95, file:0.54, first:0.68, flow:1.26, follow:0.99, form:0.63, format:2.51, fusili:0.95, go:0.78, head:3.11, header:2.76, hold:1.08, html:7.09, identifi:2.32, imag:1.38, includ:0.43, index:0.33, india:0.95, inform:6.23, initi:0.74, involv:1.08, join:0.86, keyword:1.16, known:1.38, land:1.01, languag:0.74, learn:0.95, limit:0.71, london:0.90, main:1.71, markup:1.56, may:1.96, medicin:0.95, meta:5.02, metadata:1.49, mmir:1.38, multimedia:1.16, must:0.68, name:1.01, netley:0.95, northumberland:0.95, obviou:1.38, page:2.33, part:0.86, pass:0.90, pdf:1.56, plain:1.38, point:1.26, prescrib:0.95, present:1.56, proceed:0.90, regiment:0.95, relev:0.76, repres:0.78, requir:0.65, retriev:0.21, rich:1.38, scrape:1.56, screen:1.56, search:0.24, second:0.74, sequenc:2.32, simpl:0.60, snippet:1.56, sourc:1.56, standard:0.90, station:0.95, step:0.99, stream:0.90, structur:2.23, studi:0.71, support:0.71, surgeon:1.91, task:1.26, term:0.09, text:2.76, time:0.51, titl:1.81, took:0.90, typic:0.82, univers:0.82, use:0.47, utf:1.56, util:0.65, war:0.86, web:2.32, well:0.90, wide:0.95, within:0.63, without:0.86, year:0.60                                 |
|      9 | add:0.60, addit:1.38, allow:0.74, also:0.56, anchor:2.76, approach:0.44, approxim:1.26, aspect:2.76, assign:1.42, associ:1.56, author:2.51, back:1.08, bait:1.56, block:1.56, bodi:0.95, bold:1.56, brief:1.38, case:0.82, cautiou:1.56, certain:1.08, ch:1.56, chang:1.01, charact:0.95, click:3.11, comput:0.65, concis:3.48, consid:0.43, contain:0.39, content:4.29, control:0.82, current:1.16, de:1.56, deal:1.56, describ:2.71, descript:2.02, differ:0.54, discuss:0.65, dmi:1.56, document:0.65, effect:2.02, embed:1.16, emphas:1.38, encod:1.16, enhanc:0.71, enough:1.56, enrich:1.38, escap:1.56, especi:1.26, essenti:1.01, exactli:1.56, exampl:0.85, excel:1.56, fals:1.16, file:0.54, flow:1.26, follow:0.50, format:1.26, gener:0.63, give:1.56, good:1.16, handl:1.08, header:1.38, headlin:1.56, high:0.74, higher:1.49, hint:1.56, homepag:1.56, howev:0.35, hs23:1.56, html:3.04, http:0.95, illustr:1.08, importantli:1.26, includ:1.28, inform:3.89, informatik:1.56, key:1.63, keyword:4.63, lectur:1.38, lehrangebot:1.56, let:0.65, life:1.26, like:0.46, link:3.61, main:0.86, mani:0.63, match:0.38, may:1.96, meta:3.77, metadata:3.72, might:0.65, mmir:1.38, much:1.16, multimedia:2.32, must:0.68, name:2.02, natur:0.95, navig:1.56, nbsp:1.56, need:0.43, nevertheless:1.16, observ:0.86, occurr:0.60, offer:0.65, often:0.90, order:0.78, page:4.67, part:0.86, piec:1.26, promis:1.56, provid:0.46, referenc:7.78, relationship:1.56, reliabl:1.56, render:3.11, retriev:1.07, reveal:1.38, scienc:1.26, section:1.91, sequenc:1.16, serv:2.86, sinc:0.74, sourc:0.78, space:0.50, special:1.38, studium:1.56, tag:3.11, target:1.26, term:0.17, text:3.68, titl:2.71, translat:1.38, typic:0.82, uniba:1.56, uri:1.56, us:0.90, use:0.35, usual:2.32, uuml:1.56, valuabl:1.38, web:2.32, weight:1.74, word:0.76                                |
|     10 | absolut:1.56, across:0.71, anchor:2.76, assess:0.68, basel:1.56, caution:1.56, chapter:1.16, come:1.38, cours:0.86, creativ:1.56, delv:1.16, describ:0.90, divers:1.56, emphas:1.38, engin:0.90, filter:0.65, fri:1.56, grade:1.56, great:1.26, howev:0.35, human:1.56, identif:1.56, illustr:2.16, imag:1.38, import:0.86, incorrect:1.38, inform:0.78, lectur:1.38, link:0.90, marvel:1.56, mention:0.90, metadata:0.74, multimedia:4.63, need:0.43, network:1.56, object:1.01, obvious:1.56, outlier:1.56, outstand:1.56, page:1.56, pagerank:1.56, potenti:1.38, provid:0.46, relev:0.76, retriev:0.86, roger:1.38, search:0.24, show:1.01, simplifi:0.90, sourc:0.78, subsequ:0.86, surround:1.56, target:1.26, term:0.17, text:1.38, uni:1.56, unlock:1.56, use:0.23, weber:1.38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

----