## INET4031 Introduction to Systems

### Introduction to Document Search Engines (how Google works)

This Jupyter Notebook contains code that demonstrates the most basic functionality of a document search engine.

It is intended as a simple introduction to the core concepts of a search engine:

* Document Repository
* Indexing Documents
* Vector Mathematics
* Matching Search Terms to Document Matches

This Notebook contains functionality and concepts that will required for each Student's Final Project in INET4031.


### Import the NumpPy Module

Click here to learn more about NumPy **[NumPy Link](https://numpy.org)**

In [3]:
# Install the numpy (num-pie) module on your computer
#
# The numpy module adds advanced mathematics capabilities to the base install of Python
#
# Use the "pip" installer:
# Windows: pip install numpy
# macOS: pip3 install numpy

import numpy

### File and Text Processing

In this section of code the sample documents in the document "repository" are opened.  The contents of the files accessible via the variables file1, file2, file3.

Basic text processing operations are done to prep the text inside each of the documents.  Here the only operation is to strip "newline" characters from the documents.  When indexing documents so that they can be searched, special characters such as newline characters are not necessary and should be removed.


In [4]:
#open up our "document library" of three documents
file1 = open("document1.txt", "r")
file2 = open("document2.txt", "r")
file3 = open("document3.txt", "r")

#prompt to enter some search terms
query = input("Please enter your terms: ")
terms1 = [line.rstrip('\n') for line in file1 ]
terms2 = [line.rstrip('\n') for line in file2 ]
terms3 = [line.rstrip('\n') for line in file3 ]

#close the files when done
file1.close()
file2.close()
file3.close()

In [5]:
#not much here so let's print everything out so we can see it at this point
print(terms1)
print(terms2)
print(terms3)
print(query)

['quickly jumping daft zebras vex']
['Waltz nymph for quick jigs vex Bud']
['How quickly daft jumping zebras vex']
waltz nymph


In [6]:
# case doesn't matter here, so convert everything to lowercase.
terms1[0] = terms1[0].lower()
terms2[0] = terms2[0].lower()
terms3[0] = terms3[0].lower()

In [7]:
#split docs and the query into lists (will split on space by default)
terms1 = terms1[0].split()
terms2 = terms2[0].split()
terms3 = terms3[0].split()
query = query.split()

In [8]:
#print the lists out to see what we got
print(terms1)
print(terms2)
print(terms3)
print(query)

['quickly', 'jumping', 'daft', 'zebras', 'vex']
['waltz', 'nymph', 'for', 'quick', 'jigs', 'vex', 'bud']
['how', 'quickly', 'daft', 'jumping', 'zebras', 'vex']
['waltz', 'nymph']


In [9]:
#establish our vocabulary, create a set of all the unique terms
#start by taking the list of terms and creating sets
#having Python turn lists into sets is an easy way to get rid of duplicates...unsure of performance.
vocab1 = set(terms1)
vocab2 = set(terms2)
vocab3 = set(terms3)
vocab4 = set(query)

In [10]:
#union them all together
vocab1 = vocab1.union(vocab2)
vocab = vocab1.union(vocab3)
vocab = vocab.union(vocab4)

In [11]:
#turn the vocab set back into a list
vocab = list(vocab)

In [12]:
#print our vocabulary out
print(vocab)

['nymph', 'waltz', 'jigs', 'quick', 'bud', 'vex', 'jumping', 'daft', 'quickly', 'how', 'for', 'zebras']


In [13]:
#create numpy arrays to create the vectors from the docs, query, and vocab.  put zeroes in by default
v1 = numpy.zeros(len(vocab), dtype=float)
v2 = numpy.zeros(len(vocab), dtype=float)
v3 = numpy.zeros(len(vocab), dtype=float)
#variable s used below since this is the "search" aka query
s  = numpy.zeros(len(vocab), dtype=float)

In [14]:
#print them out.  so now we have multidimensional vectors (aka an array)
#in the next step, if a doc/query has the term, we will set to 1, if it doesn't not, will set to 0
print(v1)
print(v2)
print(v3)
print(s)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [15]:
#create vector for terms in doc 1
for w in terms1:
    i = vocab.index(w)
    v1[i] += 1

In [16]:
#print it
print(v1)

[0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1.]


In [17]:
#create vector for terms in doc 2
for w in terms2:
    i = vocab.index(w)
    v2[i] += 1

In [18]:
print(v2)

[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0.]


In [19]:
#create vector for terms in doc 1
for w in terms3:
    i = vocab.index(w)
    v3[i] += 1

In [20]:
print(v3)

[0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1.]


In [21]:
#create vector for terms in the query
for w in query:
    i = vocab.index(w)
    s[i] += 1

In [22]:
print(s)

[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [23]:
#compute the cosine between the query (s) and doc 1
cos = numpy.dot(s, v1) / (numpy.sqrt(numpy.dot(s,s))*numpy.sqrt(numpy.dot(v1,v1)))

In [24]:
print(cos)

0.0


In [25]:
#compute the cosine between the query (s) and doc 2
cos = numpy.dot(s, v2) / (numpy.sqrt(numpy.dot(s,s))*numpy.sqrt(numpy.dot(v2,v2)))

In [26]:
print(cos)

0.5345224838248487


In [27]:
#compute the cosine between the query (s) and doc 3
cos = numpy.dot(s, v3) / (numpy.sqrt(numpy.dot(s,s))*numpy.sqrt(numpy.dot(v3,v3)))

In [28]:
print(cos)

0.0
