# INFO 4271 - Exercise 1 - Web Crawling

Issued: April 16, 2024

Due: April 22, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Duplicate Detection
When crawling large numbers of Web pages we are likely to encounter a considerable number of duplicate documents. To not flood our index with replicas of the same documents, we need a duplicate detection scheme.

a) Using python's built-in hash() function, process the following documents in order of appearance and flag up any exact duplicates.

- **D1** "This is just some document"
- **D2** "This is another piece of text"
- **D3** "This is another piece of text"
- **D4** "This is just some documents"
- **D5** "Totally different stuff"

In [1]:
#Check a single document against an existing collection of previsouly seen documents for exact duplicates.
def check_exct(doc, docs):
    docs = [d[1] for d in docs]
    return hash(doc[1]) in (hash(d) for d in docs)

b) Going beyond exact duplicates, we want to also identify any near-duplicates that are very similar but not identical to previously seen content. Implement the SimHash method discussed in class and again process the five documents, this time flagging up exact and near duplicates.

In [9]:
import hashlib

# example set of stopwords 
# (instead of using nltk stopwords list for simplicity)
stopwords = set(['.', ',', 'i', 'the', 'of', '...']) 

def create_simhash(doc) -> list:
    # determine document word frequency
    word_freq = {} #dictionary of word frequency
    for word in doc.split():
        if word in stopwords:
            continue
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1
    # create  8 bit hash value for each word, convert 0 to -1 and multiply by frequency 
    weighted_hash_values = []
    for word in word_freq:
        hash_object = hashlib.md5(word.encode('utf-8')) 
        hash_value = int(hash_object.hexdigest(), 16) % 256
        hash_value = [int(b) for b in bin(hash_value)[2:].zfill(8)]
        # turn each 0 into -1 
        hash_value = [x if x > 0 else -1 for x in hash_value]
        # multiply hash value by word frequency
        weighted_hash_value = [hv * word_freq[word] for hv in hash_value] 
        weighted_hash_values.append(weighted_hash_value)
    
    # print(weighted_hash_values)
    summed_hash_values = [sum(x) for x in zip(*weighted_hash_values)]
    # convert > 0 to 1 and 0 otherwise
    summed_hash_values = [1 if x > 0 else 0 for x in summed_hash_values]
    # sum each hash_value 
    return summed_hash_values 

#Check a single document against an existing collection of previsouly seen documents for near duplicates
def check_simhash(doc, docs):
    docs = [d[1] for d in docs]
    doc = doc[1]
    print(doc, docs)
    simhash_doc = create_simhash(doc)
    for d in docs:
        simhash_d = create_simhash(d)
        hamming_distance = sum([1 for i in range(len(simhash_doc)) if simhash_doc[i] != simhash_d[i]])
        if hamming_distance < 3:
            return True

    return False



In [10]:
crawl = [['D1', 'This is just some document'], ['D2', 'This is another piece of text'], ['D3', 'This is another piece of text'], ['D4', 'This is just some documents'], ['D5', 'Totally different stuff']]

#Process raw crawled website content
def process(crawl):
    docs = []
    for doc in crawl:
        if check_simhash(doc, docs): #Can be exchanged for check_simhash()
            print('DUPLICATE: '+doc[0])
        else:
            docs.append(doc)
    print(docs)
process(crawl)

This is just some document []
This is another piece of text ['This is just some document']
This is another piece of text ['This is just some document', 'This is another piece of text']
DUPLICATE: D3
This is just some documents ['This is just some document', 'This is another piece of text']
DUPLICATE: D4
Totally different stuff ['This is just some document', 'This is another piece of text']
[['D1', 'This is just some document'], ['D2', 'This is another piece of text'], ['D5', 'Totally different stuff']]


# 2. Focused Search Engines
Suppose you were to build a COVID-19 Web search engine for which you want to collect and eventually serve only COVID-19 information. The general web crawling process follows this scheme:

1. Create a seed set of known URLs (a.k.a the frontier)
2. Pull a URL from the frontier and visit it
3. Save the page content for our search engine (indexing)
4. Once on the page, note down all URLs linked there
5. Put all encountered URLs in the queue
6. Repeat from Step 2 until the queue is empty

In this particular setting, how should the generic step-by-step crawling process be modified/extended? Discuss all relevant considerations:

### General considerations:
1. Do not revisit the same pages
2. Do not crawl pages with almost identical content
3. Be polite: do not revisit same page too often within a single timeframe

#### COVID-19 specific considerations:
1. Content saving: When saving the page content, a filter should be applied to ensure the content is relevant to COVID-19. This could involve checking for COVID-19 related keywords, and similarly...
2. Link extraction: Not all URLs from a website should be added to the queue. Only URLs that are likely to lead to COVID-19 related information should be added.
3. Repetition/Freshness: To ensure the freshness of the information, the crawler should periodically revisit the URLs in the seed set and any other URLs that have been identified as important sources of COVID-19 information. E.g. flag URLs as important during the crawing process. 
