### Web Scraping##
#Exercises#



---


<font color='violet'>
Hints are written in white, so you do not see them immediately. If you highlight them (or double-click on them), they will appear! 
<font color='white'> I am a hint! :-)


---


## 1. Basic exercises

### Exercise 1.1

Import the ``requests`` library, the ``BeautifulSoup`` library and the ``pandas`` library.

In [None]:
import pandas
import requests
from bs4 import BeautifulSoup

Using the ``requests`` library, retrieve the example page (http://repec.sowi.unibe.ch/varia/example-page.html) and assign the response object to a variable named ``exR``. Print out the status code. You will get a  number. What does it mean?

In [None]:
exR = requests.get("http://repec.sowi.unibe.ch/varia/example-page.html")
exR.status_code # Status code of 200 means that everything went well 

Now print out the text of the response.

In [None]:
exR.text

### Exercise 1.2

Using ``BeautifulSoup``, parse the text of your response object and assign the result to a variable called ``mySoup``.

In [None]:
mySoup = BeautifulSoup(exR.text)

Print the content of your soup object.

In [None]:
print(mySoup.prettify())

Now, try to access the following:
1. The ``thead`` element
2. All ``p`` elements
3. The last ``p`` element
4. The text of the ``h1`` element
5. The first URL in the document (only the URL!)
6. All ``a`` elements within the ``table`` element
7. All ``table`` elements of class "cat_table"  <font color='violet'> Hint: <font color='white'>  Remember to use class_ instead of class to find elements based on the value of the class attribute!<font color='black'> 
8. The text of all ``p`` elements (as a list) <font color='violet'> Hint: <font color='white'>  Use a list comprehension! <font color='black'> 



In [None]:
# 1: The thead element
mySoup.find("thead") # Or just: mySoup.thead

In [None]:
# 2: All p elements
mySoup.find_all("p")

In [None]:
# 3: The last p element
mySoup.find_all("p")[-1]

In [None]:
# 4: The text of the h1 element
mySoup.find("h1").get_text() # Or: mySoup.h1.text

In [None]:
# 5: The first URL in the document (only the URL!) 
mySoup.a["href"]

In [None]:
# 6: All a elements within the table element
mySoup.find("table").find_all("a") # Or just: mySoup.table("a")

In [None]:
# 7: All table elements of class "cat_table" 
mySoup.find_all("table", class_="cat_table") # Or:
mySoup.find_all("table", attrs = {"class" : "cat_table"})

In [None]:
# 8: The text of all p elements (as a list)
aElems = mySoup.find_all("p")
urls = [elem.get_text() for elem in aElems]
urls

### Exercise 1.3

Most websites are a bit more complicated than our example page. In this exercise, we will retrieve the Wikipedia page on cats: https://en.wikipedia.org/wiki/Cat

Retrieve the page, get the text and convert it to a BeautifulSoup object called ``cats``.

In [None]:
res = requests.get("https://en.wikipedia.org/wiki/Cat")
cats = BeautifulSoup(res.text)

Go to the page and inspect it (right-click on the different elements and select "Insepct" (Element untersuchen)". Then, try to retrieve the following elements from the page:

1. The title of the page (only the text)
2. The title header of the page (Cat)
3.  All the main headers of the text on the page (Etymology and naming, Taxonomy...)
4. All the headers in the text (Etymology and naming, Taxonomy, Evolution, Domestication, Characteristics, Size...)
 <font color='violet'> Hint: <font color='white'> These headers are all of the same class.<font color='black'>
5. The opening paragraph ("The cat (Felis catus) is a ...")
6. All the links in the infobox table on the right 
7. The number of images on the page <font color='violet'> Hint: <font color='white'> Hint: You can use the len() function

In [None]:
# 1
cats.find("title").get_text() # Or: cats.title.text

In [None]:
# 2
cats.find("h1").get_text() # Or: cats.h1.text

In [None]:
# 3
headers = [header.text for header in cats.find_all("h2")]
headers 

In [None]:
# 4
[header.text for header in cats.find_all(class_="mw-headline")] 
# Or: [header.text for header in cats.find_all(["h1", "h2", "h3"])]

In [None]:
# 5
cats.find_all("p")[1].text

In [None]:
# 6
links = cats.find("table", class_="infobox biota").find_all("a")
links = [link["href"] for link in links] # Or:["https://en.wikipedia.org" + link["href"] for link in links]
links

In [None]:
# 7
len(cats.find_all("img"))

Now try to retrieve the first table on the page and convert it to a Pandas dataframe.

In [None]:
import pandas as pd
cat_table = pd.read_html("https://en.wikipedia.org/wiki/Cat")[0]
cat_table

### Exercise 1.4

Consider the following list of links to Wikipedia pages on animals:

In [None]:
animals_wiki = ["https://en.wikipedia.org/wiki/Cat",
                "https://en.wikipedia.org/wiki/Dog",
                "https://en.wikipedia.org/wiki/Tiger",
                "https://en.wikipedia.org/wiki/Panda"]

Write a simple loop that fetches each of these pages and writes the response into a list (it should look like this: ``[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]``)

In [None]:
L = []
for link in animals_wiki:
  r = requests.get(link)
  L.append(r)

L

You would like to have (1) the title header and (2) the number of images of each of these pages. Revise your loop so that this information is retreived from the source code of each page and written into a (nested) list. Which animal page has the most images?

In [None]:
L = []
for link in animals_wiki:
  r = requests.get(link)
  soup = BeautifulSoup(r.text)

  header = soup.h1.text
  nr_images = len(soup.find_all("img"))

  L.append([header, nr_images])

L

In [None]:
# Tiger page has the most images

Repetition task: Convert your list into a pandas dataframe.

In [None]:
animal_data = pd.DataFrame(L, columns=["Animal", "Nr. of images"])
animal_data

## 2. Advanced exercises*

---


<font color='red'>
*Feel free to skip the advanced exercises if you feel overwhelmed or if trying to solve the basic exercises already took you a lot of time! 


---



### Exercise 2.1

Suppose your list of animal links also contains a link to a website that does not exist: 

In [None]:
animals_wiki = ["https://en.wikipedia.org/wiki/Cat",
                "https://en.wikipedia.org/wiki/Dog",
                "https://no-such-link-exists.com",
                "https://en.wikipedia.org/wiki/Panda"]

Add a ``try-exept`` block to the loop from Exercise 1.4 to prevent your web scraper from crashing when an URL cannot be retrieved. <font color='violet'> Hint: <font color='white'> Use ``continue`` within the ``except`` block to jump back to the beginning of the loop!




In [None]:
L = []
for link in animals_wiki:
  try:
    r = requests.get(link)
  except:
    print("Problem loading the site:", link)
    continue # Go back to beginning of the for loop

  soup = BeautifulSoup(r.text)

  header = soup.h1.text
  nr_images = len(soup.find_all("img"))

  L.append([link, header, nr_images]) # append url as well so you know which 
                                      # site corresponds to which url

L

### Exercise 2.2

The Wikipedia page https://en.wikipedia.org/wiki/List_of_cat_breeds contains a list of all cat breeds and the links to the respective Wikipedia pages. You would like to create a dataset about the different cat breeds with information from their Wikipedia pages.

In a first step, you will have to retrieve all the links to the respective Wikipedia pages. Retrieve them from the first table on the website and write them into a list. *Note that the table also contains some links you do not want to have included (you only want those to the pages for the different cat breeds). You can use ``CSS`` selectors to specify what links you want to extract. First have a look at the source code of the page to find out how the relevant links can be addressed.*

In [None]:
# Request page and make soup object
cat_res = requests.get("https://en.wikipedia.org/wiki/List_of_cat_breeds")
cat_soup = BeautifulSoup(cat_res.text)

# Select the table
table = cat_soup.find_all("table")[0]

# Select a elements that are directly within a table head that is within a table row
cat_links = table.select("tr th > a")

# Extract URLs
cat_links = [link["href"] for link in cat_links] 
cat_links

If you inspect your list, you may notice that there are some external URLs (i.e. URLs that do not point to Wikipedia pages). Try to remove them! <font color='violet'> Hint: <font color='white'> Note that you can specify an if-condition within a list comprehension. In the very first tutorial you learned how to check if a string is containend within anotther string.

In [None]:
# Remove non-wiki links
cat_links = [link for link in cat_links if "wiki" in link] 
cat_links

With your cleaned list you can now start to scrape the pages. You would like to retrieve (1) the title header, (2) the number of images, (3) the number of characters of the text of each page, and (4) the text of the introductory paragraph. Try to do so for the **first page only** to develop your code.  Note that you will have to add the top level domain (https://en.wikipedia.org) to your URL!

In [None]:
link = "https://en.wikipedia.org" + cat_links[0]

r = requests.get(link)
soup = BeautifulSoup(r.text)

header = soup.h1.text
nr_images = len(soup.find_all("img"))
chars = len(soup.body.text)
par1 = soup.find_all("p")[1].text

[header, nr_images, chars, par1]

Now you are ready to build your web scraper. Write a loop the fetches the information from all the pages and writes it into a list. Tipp: Before you loop through the entire list, try looping over the first few elements to check if everything works (running the loop across the whole list may take a while).

In [None]:
L = []

for subdom in cat_links:

  link = "https://en.wikipedia.org" + subdom
  r = requests.get(link)

  soup = BeautifulSoup(r.text)

  header = soup.h1.text
  nr_images = len(soup.find_all("img"))
  chars = len(soup.body.text)
  par1 = soup.find_all("p")[1].text
  
  L.append([header, nr_images, chars, par1])

In [None]:
# More robust and slower version (not necessary in this case)
import time 
L = []

for subdom in cat_links:
  time.sleep(1)
  link = "https://en.wikipedia.org" + subdom

  try:
    r = requests.get(link)
  except Exception as e:
    print("Error with:", link)
    print(e)

  soup = BeautifulSoup(r.text)

  header = soup.h1.text
  nr_images = len(soup.find_all("img"))
  chars = len(soup.body.text)
  par1 = soup.find_all("p")[1].text

  L.append([header, nr_images, chars, par1])

In [None]:
# Look at some example results
L[:3]

Create a pandas Dataframe with the information you gathered and inspect it. Which cat has the longest article? Which one has the most images?

In [None]:
cat_data = pd.DataFrame(L, columns=["Animal", "Nr. of images", "Nr. of chars", "Summary"])
cat_data

In [None]:
cat_data.sort_values("Nr. of chars", ascending=False).head(4)

In [None]:
cat_data.sort_values("Nr. of images", ascending=False).head(4)

In [None]:
# Manx cat has longest article
# Persian cat has the most images 