# Self study 1

Self studies should be solved individually, or in small groups of 2-3 students. There is no hand-in of your solutins to the self studies. However, you can bring your solutions to the exam, and use them as the basis for your answers to the exam questions.

In this self-study we construct a simple crawler. Concretely, you should: 

* Select about 5 seed urls, e.g. homepages of universities, e-commerce sites, or similar

* Start crawling from these seeds. Define a strategy for selecting the next url to be crawled. What kind of prioritization (if any) is embodied in your strategy?

* Make sure you obey the robots.txt file, and make ensure that at least 2 seconds elapse between requests to the same host

* Stop when you have crawled approx. 1000 pages

* For each crawled page, save the url and the text string contained in the 'title' element of the document (we do not want to handle the full text of the pages at this point).

* You can repeat this several times, using different seed sets and/or prioritization strategies.

The following two self studies will extend the work that you do in this self study.

The following introduces a few helpful libraries and essential functions. You can use these methods, or use other tools that you are already familiar with and/or prefer to work with. 

A simple crawler implementation can be based on the 'requests' package [https://requests.readthedocs.io/en/master/](https://requests.readthedocs.io/en/master/) for retrieving html documents, and the BeautifulSoup parser https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the html.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from urllib.robotparser import RobotFileParser
import random
import time

Let's start crawling at https://www.aau.dk/ . We first retrieve the robots.txt file and check whether we are allowed to crawl the top-level url:

In [3]:
rp=RobotFileParser()
rp.set_url("https://www.aau.dk/")
rp.read()
print(rp.can_fetch("*","https://www.aau.dk"))

True


We can now get the html using the requests package, which returns a response object:

In [4]:
r=requests.get('https://www.aau.dk/')
print(type(r))

<class 'requests.models.Response'>


A basic view of the contents is accessible via the content attribute:

In [11]:
r.content

b'<!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><title>AAU - Viden for verden - Aalborg Universitet</title><meta name="description" content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden."/><meta name="robots" content="follow, index"/><link rel="canonical" href="https://www.aau.dk/"/><meta property="og:type" content="website"/><meta property="og:site_name" content="Aalborg Universitet"/><meta property="og:url" content="https://www.aau.dk/"/><meta property="og:title" content="AAU - Viden for verden"/><meta property="og:description" content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden."/><meta property="og:image" content="https://prod-aaudxp-cms-001-app.azurewebsites.net/media/qughn2e5/billeder-38.jpg"/><meta name="twitter:card" c

For serious parsing, we can use the BeautifulSoup html parser:

In [12]:
r_parse = BeautifulSoup(r.text, 'html.parser')
print(r_parse.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   AAU - Viden for verden - Aalborg Universitet
  </title>
  <meta content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden." name="description"/>
  <meta content="follow, index" name="robots"/>
  <link href="https://www.aau.dk/" rel="canonical"/>
  <meta content="website" property="og:type"/>
  <meta content="Aalborg Universitet" property="og:site_name"/>
  <meta content="https://www.aau.dk/" property="og:url"/>
  <meta content="AAU - Viden for verden" property="og:title"/>
  <meta content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden." property="og:description"/>
  <meta content="https://prod-aaudxp-cms-001-app.azurewebsites.net/media/qughn2e5/billeder-38.jpg" prop

We can get the title:

In [13]:
print(r_parse.find('title'))
print(r_parse.find('title').string)

<title>AAU - Viden for verden - Aalborg Universitet</title>
AAU - Viden for verden - Aalborg Universitet


Importantly, we can get all the links on the page. The following also illustrates the sleep() function to implement time delays (the following will take a while to complete; use the "interrupt kernel" button to terminate early):

In [14]:
for a in r_parse.find_all('a'):
    sleep(1)
    print(a['href'])

#main
https://www.aau.dk/
/uddannelser
/uddannelser/kandidat
/uddannelser/sidefag-tilvalgsfag
/uddannelser/studiebyer
/uddannelser/su
/uddannelser/sps
/forskning
/forskning/forskningsnyt
/forskning/phd
/forskning/tvaervidenskabelig-forskning
/samarbejde
/om-aau
/om-aau/kontakt
/nyheder
/nyheder/pressen
/nyheder/podcasts-fra-aau
https://www.en.aau.dk/
https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false
https://www.aau.dk/nyheder
https://www.aau.dk/arrangementer
https://www.aau.dk/nyheder/pressen
https://www.aau.dk/om-aau/profil/baeredygtighed
https://www.aau.dk/om-aau/profil/ranking
https://www.aau.dk/alumni
https://www.aau.dk/mads-pagh-nielsen-kares-som-arets-underviser-af-uddannelses-og-forskningsministeriet-n90863
https://www.aau.dk/mads-pagh-nielsen-kares-som-arets-underviser-af-uddannelses-og-forskningsministeriet-n90863
https://www.aau.dk/uddannelser/bachelor/liste
https://www.aau.dk/uddannelser/kandidat/liste
https://www.aau.dk/uddannelser/stem-uddannelser/ingeniorud

In [118]:
def remHeap(heap, name):
    for i, el in enumerate(heap):
        if el['hostname'] == name:
            del heap[i]
    return heap

def crawl2(frontier, heap, rp): #periodic (general picture of web) URL RECENTLY VISITED HAS LOWER PRIO
    while(len(frontier) > 0):
        next = frontier[0] #using frontier as queue (BFS)
        rp.set_url(next)
        rp.read()

        if(not rp.can_fetch("*",next)): continue #obeys robots.txt file

        r=requests.get(next)
        r_parse = BeautifulSoup(r.text, 'html.parser')

        for i, link in enumerate(r_parse.find_all('a')):
            if(i == 30): break
            _link = link['href']

            if(_link == '#main'):   continue
            if(_link[0] == '/'):    _link = next+_link[1:]

            #save link and title in dict
            savedUrl.update({_link: link.get('title')}) 

            frontier.append(_link)
            #sleep(1)
        
        heap = remHeap(heap, frontier[0])
        frontier.pop(0) # dequeue head element of selected queue

        #enqueue next front queue.
        choice = random.choice(frontier)
        frontier.remove(choice)
        frontier.insert(0, choice)
        print(frontier)
        print(heap)
        print(savedUrl)
        break


seeds = ['https://www.aau.dk/', 'https://www.instagram.com/', 'https://www.facebook.com/', 'https://www.bilka.dk/', 'https://www.python.org/']
savedUrl = {}
frontier = seeds 
rp=RobotFileParser()
startSec = round(time.time())

frontQueue = []
nextIndex = 0
backQueue = []
heap = [dict(hostname=x, backqueue=0) for x in seeds]

def initFrontQueue(arr, nextIndex):
    for i,x in enumerate(arr):
        frontQueue.append([{x:i}])
        nextIndex = i+1
    return nextIndex

def initBackQueue(arr):
    for x in arr:
        backQueue.append([x])

def useFrontQueue(chosen=""): # does not update priority numbers, randomly chooses link
    idx = -1
    if(chosen):
        for i, headFQ in enumerate(getHeadFrontQueue()):
            if(headFQ == chosen):
                idx = i
                print("found", idx, chosen)
                break
    else:
        idx = random.randint(0,len(frontQueue)-1)

    selectedFQ = frontQueue[idx]
    selectedUrl = list(selectedFQ[0].keys())[0]

    frontQueue[idx].pop(0) #deletes head element in selected FQ
    if len(selectedFQ) == 0: del frontQueue[idx] #deletes list if empty

    return selectedUrl

def useBackQueue(request):
    for i,entry in enumerate(heap):
        if entry['hostname'] == request:
            if entry['backqueue'] != 0: return False
            if entry['backqueue'] == 0:
                entry['backqueue'] = 2
                headElemIdx = -1
                for i,lst in enumerate(backQueue):
                    if entry['hostname'] == lst[0]:
                        headElemIdx = i
                        break
                backQueue[headElemIdx].pop(0) #deletes head element in selected BQ
                if len(backQueue[headElemIdx]) == 0: 
                    del backQueue[headElemIdx]
                    nextUrl = useFrontQueue(entry['hostname'])
                    print("NEXtURL:", nextUrl)
                    print(backQueue)
                    for e in heap:
                        if (e['hostname'] == nextUrl):
                            request = useBackQueue(nextUrl)
                        else:
                            pass 
                            #unsure what enqueue url in the empty queue
                            #or update heap and host dictionary means #slide21 lecture 1
                return request
    return False

def getHeadFrontQueue():
    _keys = []
    for lst in frontQueue:
        _keys.append(list(lst[0].keys())[0])
    return _keys

def getHeadDicts(dicts):
    _keys = []
    for lst in dicts:
        _keys.append(list(lst.keys())[0])
    return _keys

def crawl(next, nextIndex, rp):
    rp.set_url(next)
    rp.read()

    if(not rp.can_fetch("*", next)): return nextIndex #has to obey robots.txt file

    r=requests.get(next)
    r_parse = BeautifulSoup(r.text, 'html.parser')

    for i, link in enumerate(r_parse.find_all('a')):
        if(i == 10): break
        _link = link['href']

        if(_link == '#main'):   continue
        if(_link[0] == '/'):    _link = next+_link[1:]

        headFQ = getHeadFrontQueue()
        #savedUrls = list(savedUrl.keys())

        #if(next in savedUrls):
        if next in headFQ:
            _idx = headFQ.index(next)
            if _link not in getHeadDicts(frontQueue[_idx]):
                frontQueue[_idx].append({_link:nextIndex})
                nextIndex+=1
        savedUrl.update({_link: link.get('title')}) #save link and title in dict
    return nextIndex

response = seeds[0]

nextIndex = initFrontQueue(seeds, nextIndex)
initBackQueue(seeds)
for i in range(10):
    #print(frontQueue)
    #print(savedUrl)
    selectedUrl = useFrontQueue()
    response = useBackQueue(selectedUrl)
    if response == False: continue
    nextIndex = crawl(response, nextIndex, rp)

#TODO: frontqueue need integration if backqueue calls for next link from FQ and same head

print("Done!")



NEXtURL: https://www.python.org/
[['https://www.aau.dk/'], ['https://www.facebook.com/'], ['https://www.bilka.dk/'], ['https://www.python.org/']]
NEXtURL: https://www.bilka.dk/
[['https://www.aau.dk/'], ['https://www.facebook.com/'], ['https://www.bilka.dk/']]
NEXtURL: https://www.facebook.com/
[['https://www.aau.dk/'], ['https://www.facebook.com/']]
NEXtURL: https://www.aau.dk/
[['https://www.aau.dk/']]


IndexError: list index out of range

In [91]:
def updHeap(heap):
    for el in heap:
        if(el['backqueue'] > 0):
            el['backqueue'] -= 1
    return heap

heap = []

for i in range(10):
    heap.append(dict(hostname=f"test{i}", backqueue=random.randint(1,2)))

for i, el in enumerate(heap):
    if el['hostname'] == 'test5':
        del heap[i]
print(heap)
#heap = updHeap(heap)




[{'hostname': 'test0', 'backqueue': 1}, {'hostname': 'test1', 'backqueue': 1}, {'hostname': 'test2', 'backqueue': 2}, {'hostname': 'test3', 'backqueue': 2}, {'hostname': 'test4', 'backqueue': 2}, {'hostname': 'test6', 'backqueue': 1}, {'hostname': 'test7', 'backqueue': 1}, {'hostname': 'test8', 'backqueue': 1}, {'hostname': 'test9', 'backqueue': 2}]
