## 02 Exercise: Writing a Simple Web Crawler

1. Write a simple web crawler that can capture all links in a document (like: https://www.cphbusiness.dk/). And all links of the linked document - so 2 levels of documents will be scraped. Use threads if helpfull


In case a page returns a status code, which is not `200` we just disregard this page. See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more detailes on the various HTTP status codes.

In [13]:
import bs4
import requests

url = 'https://www.cphbusiness.dk/'
r = requests.get(url)

soup = bs4.BeautifulSoup(r.text,'html.parser')
links = {}
links = [link.get('href') for link in soup.select('a')
        if link.get('href') and link.get('href').startswith('http')]

nested_links = {}
for link in links:
  r = requests.get(link)
  soup = bs4.BeautifulSoup(r.text, 'html.parser')
  nested_links[link] = [link.get('href') for link in soup.select('a')
        if link.get('href') and link.get('href').startswith('http')]

nested_links


{'https://intra.cphbusiness.dk/': [],
 'https://cphbusiness.mrooms.net/': ['https://cphbusiness.mrooms.net',
  'https://cphbusiness.mrooms.net/login/index.php',
  'https://cphbusiness.mrooms.net/mahara/auth/xmlrpc/jump.php?hostwwwroot=https%3A%2F%2Fcphbusiness.mrooms.net&wantsurl=%2F&remoteurl=1',
  'https://cphbusiness.mrooms.net/login/index.php',
  'https://cphbusiness.mrooms.net/login/index.php',
  'https://cphbusiness.mrooms.net/calendar/view.php?view=month&time=1646992973',
  'https://cphbusiness.mrooms.net/calendar/view.php?view=day&time=1646953200',
  'https://cphbusiness.mrooms.net/course/view.php?id=3329',
  'https://cphbusiness.mrooms.net/login/index.php?saml=off',
  'https://selvbetjening.cphbusiness.dk/',
  'http://skema.cphbusiness.dk',
  'https://outlook.office.com/',
  'https://europe.wiseflow.net/login',
  'https://www.linkedin.com/learning/',
  'https://efif-my.sharepoint.com/_layouts/15/MySite.aspx?MySiteRedirect=AllDocuments',
  'https://cphbusiness.padlet.org',
  'h

## 01 Exercise with findall()
In the following text find all the family names of everyone with first name Peter:

"Peter Hansen was meeting up with Jacob Fransen for a quick lunch, but first he had to go by Peter Beier to pick up some chokolate for his wife. Meanwhile Pastor Peter Jensen was going to church to give his sermon for the same 3 people in his parish. Those were Peter Kold and Henrik Halberg plus a third person who had recently moved here from Norway called Peter Harold".

In [20]:
import re

text = 'Peter Hansen was meeting up with Jacob Fransen for a quick lunch, but first he had to go by Peter Beier to pick up some chokolate for his wife. Meanwhile Pastor Peter Jensen was going to church to give his sermon for the same 3 people in his parish. Those were Peter Kold and Henrik Halberg plus a third person who had recently moved here from Norway called Peter Harold'

search_pattern = re.compile(r'Peter.\w*')

peter = search_pattern.findall(text)

for i in peter:
  print(i[6:])

Hansen
Beier
Jensen
Kold
Harold


## 02 exercise

We will play with the addresses from data/addresses.txt and the following regex patterns

Write a regular expression, that you can use to create 5 lists with:

  * all names in the list above
  * all telephone numbers 
  * all zip codes
  * all city names with corresponding zip code
  * all street names

In [39]:
import re

with open('../../data/addresses.txt','r', encoding='latin1') as f:
  addresses = f.read()


all_names = re.compile(r'([a-zA-ZæøåÆØÅ]+)*')
all_names.findall(addresses)

['A',
 '',
 'Henning',
 '',
 'Gamborg',
 '',
 'Møller',
 '',
 'Klostergade',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Ribe',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'A',
 '',
 'K',
 '',
 'Møller',
 '',
 'Bregnerødvej',
 '',
 '',
 '',
 '',
 '',
 'st',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Birkerød',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'A',
 '',
 'Møller',
 '',
 'Violvej',
 '',
 '',
 '',
 'Ø',
 '',
 '',
 'Bjerregrav',
 '',
 '',
 '',
 '',
 '',
 '',
 'Randers',
 '',
 'NV',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'A',
 '',
 'Møller',
 '',
 'Hyrdevej',
 '',
 '',
 '',
 'A',
 '',
 '',
 '',
 '',
 '',
 '',
 'Fredericia',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'A',
 '',
 'Møller',
 '',
 'Brammersgade',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Aarhus',
 '',
 'C',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'A

## Class exercise
Find a web site to interact with and fill out a form to get some information back.  
Examples could be https://www.jobindex.dk/,    
https://google.com or   
https://www.ikea.com/dk/da/

In [2]:
import bs4
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options


def search_ikea(search_param):
  url = 'https://www.ikea.com/dk/da/'
  options = Options()
  options.headless = True
  browser = webdriver.Firefox(options=options)
  browser.get(url)
  browser.implicitly_wait(3)
  
  try:
    #Cookies approval popup. This will wait for elememt to be visible for 20 seconds or until ready
    WebDriverWait(browser,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#onetrust-reject-all-handler'))).click()
  except Exception as e:
    print('EXCEPTION:',e)
    
  search_field = browser.find_element_by_css_selector('div.hnf-header__search > div > div > form > div.search-wrapper > div > input')
  
  search_field.send_keys(search_param)
  
  browser.implicitly_wait(3)
  
  link_to_click = browser.find_element_by_css_selector('div.hnf-header__search > div > div > form > div.dropdown > div > div > ol:first-child')
  link_to_click.click()
  
  soup = bs4.BeautifulSoup(browser.page_source, 'html.parser')
  
  # print(soup.prettify()[:10000])
  links = [link.get('href') for link in soup.find_all('a') 
            if link.get('href') and link.get('href').startswith('http')]
  return links
  
search_ikea('stole')

['https://www.ikea.com/dk/da/customer-service/services/click-collect/',
 'https://www.ikea.com/dk/da/customer-service/returns-claims/return-policy/',
 'https://www.ikea.com/dk/da/customer-service/stock-availability/',
 'https://www.ikea.com/dk/da/',
 'https://www.ikea.com/dk/da/profile/login/',
 'https://www.ikea.com/dk/da/favourites/',
 'https://www.ikea.com/dk/da/shoppingcart/',
 'https://www.ikea.com/dk/da/cat/produkter-products/',
 'https://www.ikea.com/dk/da/rooms/',
 'https://www.ikea.com/dk/da/',
 'https://www.ikea.com/dk/da/cat/mobler-fu001/',
 'https://www.ikea.com/dk/da/cat/kokken-og-hvidevarer-ka001/',
 'https://www.ikea.com/dk/da/cat/senge-og-madrasser-bm001/',
 'https://www.ikea.com/dk/da/cat/opbevaring-st001/',
 'https://www.ikea.com/dk/da/cat/kontormobler-700291/',
 'https://www.ikea.com/dk/da/cat/tekstiler-tl001/',
 'https://www.ikea.com/dk/da/cat/dekoration-de001/',
 'https://www.ikea.com/dk/da/cat/badevaerelsesmobler-og-tilbehor-ba001/',
 'https://www.ikea.com/dk/da/c