# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:-Amazon-books" data-toc-modified-id="Example:-Amazon-books-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example: Amazon books</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Get-book-tags" data-toc-modified-id="Get-book-tags-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Get book tags</a></span></li><li><span><a href="#Book-titles" data-toc-modified-id="Book-titles-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Book titles</a></span></li><li><span><a href="#Book-prices" data-toc-modified-id="Book-prices-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Book prices</a></span></li><li><span><a href="#Similar-with-date" data-toc-modified-id="Similar-with-date-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Similar with date</a></span></li><li><span><a href="#Creating-a-DataFrame-with-the-data" data-toc-modified-id="Creating-a-DataFrame-with-the-data-3.2.6"><span class="toc-item-num">3.2.6&nbsp;&nbsp;</span>Creating a DataFrame with the data</a></span></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Advanced</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li><li><span><a href="#Comments" data-toc-modified-id="Comments-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are build mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
       </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

Tags may have attributes. Here, the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  

`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`

`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequent used tags are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag
 * `title` (not to be confused with `<title>` tag

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

In [1]:
!pip install beautifulsoup4



In [2]:
import requests

In [3]:
from bs4 import BeautifulSoup

### Example: Amazon books

Lets scrape Python books sold at Amazon!

In [3]:
url = "https://www.amazon.es/s?k=python+books"

Lets try to get the books titles and their prices!

#### Research

In [4]:
response = requests.get(url)

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

In [43]:
try:
    response.json()
except Exception:
    print(f"The content is not JSONeable")

The content is not JSONeable


But we just requested a non-API URL, so we receive raw HTML content!

In [45]:
type(response.content)

bytes

In [46]:
type(response.text)

str

`Beautiful Soup` helps us with the task of accessing this info

In [86]:
soup = BeautifulSoup(response.content)

Lets find books names and prizes, going to Chrome console and inspecting...

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

#### Get book tags

In [87]:
possible_books = soup.find_all(name="div", class_="sg-col-inner")

In [88]:
len(possible_books)

53

In [89]:
possible_books[10].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python Programming: A Step By Step Guide from Beginner to Advanced (Beginner &amp; Advanced)</span>]

In [90]:
possible_books[0].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[]

In [91]:
possible_books[1].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[]

In [92]:
possible_books[2].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[]

In [93]:
possible_books[3].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python Crash Course: A Hands-On, Project-Based Introduction to Programming</span>,
 <span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Let Us Python: Python Is Future, Embrace It Fast</span>,
 <span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python: 6 Books in 1: 
 The Ultimate Bible to Learn Python Programming for a Career in Machine Learning, Data Science &amp; 
 Web Development. (English Edition)</span>,
 <span class="a-size-base-plus a-color-base a-text-normal" dir="auto">LEARNING PYTHON: 3 Books in 1: Ultimate Beginners guide Including Data Analysis and 50 Step-By-Step Coding Projects in Games, Art and More</span>,
 <span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Fluent Python: Clear, Concise, and Effective Programming</span>,
 <span class="a-size-base-plus a-color-base a-text-normal" dir="auto">PYTHON: 2 books in 1 : Learn python programming for beginne

In [96]:
possible_books[4].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python Crash Course: A Hands-On, Project-Based Introduction to Programming</span>]

In [97]:
possible_books[5].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Let Us Python: Python Is Future, Embrace It Fast</span>]

In [98]:
possible_books[30].find_all("span", class_="a-size-base-plus a-color-base a-text-normal")

[<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python for Data Science: Step-by-Step Crash Course On How To Come Up Easily With Your First Data Science Projects From Scratch In Less Than 7 Days. Includes Practical Exercises</span>]

It seems that real book only have 1 ocurrence of a `span` tag with that class name and value

In [99]:
type(possible_books[10])

bs4.element.Tag

In [100]:
def is_book_tag(tag):
    """
    Decides whether an Amazon div tag corresponds to a book or to other useless information
    Args:
        tag (bs4.element.Tag)
    Returns:
        bool: True if book
    """
    list_of_spans = tag.find_all("span", class_="a-size-base-plus a-color-base a-text-normal")
    
    if len(list_of_spans) == 1:
        return True
    else:
        return False

We filter `possible_books` into `books`. I do it in two equivalent ways (I personally prefer list comprehension):
 * list comprehension
 * filter function

In [123]:
books = [b for b in possible_books if is_book_tag(b)]

In [121]:
# equivalently
books2 = filter(is_book_tag, possible_books)

In [122]:
list(books2) == books

True

In [124]:
len(books)

48

#### Book titles

In [125]:
example_book = books[0]

In [126]:
example_book

<div class="sg-col-inner">
<span cel_widget_id="MAIN-SEARCH_RESULTS-0" class="celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results">
<div class="s-expand-height s-include-content-margin s-border-bottom s-latency-cf-section">
<div class="a-section a-spacing-medium">
<div class="a-section a-spacing-micro s-grid-status-badge-container">
</div>
<span class="rush-component" data-component-type="s-product-image">
<a class="a-link-normal s-no-outline" href="/Python-Crash-Course-Eric-Matthes/dp/1593279280?dchild=1">
<div class="a-section aok-relative s-image-square-aspect">
<img alt="Python Crash Course: A Hands-On, Project-Based Introduction to Programming" class="s-image" data-image-index="0" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" onload="window.uet &amp;&amp; uet('cf')" src="https://m.media-amazon.com/images/I/81vmJCNCm6L._AC_UL320_.jpg" srcset="https://m.media-amazon.com/images/I/81vmJCNCm6L._AC_UL320_.jpg 1x, https://m.media-am

In [127]:
type(example_book)

bs4.element.Tag

After some **Inspection** using Chrome, we see title here:  
`<span dir="auto" class="a-size-base-plus a-color-base a-text-normal">Python Made Simple: Learn Python programming in easy steps with examples</span>`

In [128]:
example_book.find("span", class_="a-size-base-plus a-color-base a-text-normal")

<span class="a-size-base-plus a-color-base a-text-normal" dir="auto">Python Crash Course: A Hands-On, Project-Based Introduction to Programming</span>

In [143]:
example_book.find("span", class_="a-size-base-plus a-color-base a-text-normal").text

'Python Crash Course: A Hands-On, Project-Based Introduction to Programming'

We got it! Lets create a function for this!!

In [130]:
def get_book_name(book_tag):
    """
    Extracts book name
    Args:
        book_tag (bs4.element.Tag): corresponding to an Amazon book
    Returns:
        str: book title
    """
    book_title_tag = book_tag.find("span", class_="a-size-base-plus a-color-base a-text-normal")
    book_title = book_title_tag.text
    
    return book_title

In [131]:
get_book_name(books[5])

'PYTHON: 2 books in 1 : Learn python programming for beginners and machine learning (English Edition)'

In [132]:
get_book_name(books[10])

'Python for Data Science: 2 Books in 1. A Practical Beginner’s Guide to learn Python Programming, introducing into Data Analytics, Machine learning, Web Development, with Hands-on Projects'

In [133]:
get_book_name(books[0])

'Python Crash Course: A Hands-On, Project-Based Introduction to Programming'

In [134]:
get_book_name(books[-1])

'Deep Learning Models explored with help of Python Programming'

#### Book prices

Lets inspect!!

After some **Inspection** using Chrome, we see title here:  
`<span class="a-price-whole">31,01</span>`

In [136]:
def get_book_price(book_tag):
    """
    Extracts book price
    Args:
        book_tag (bs4.element.Tag): corresponding to an Amazon book
    Returns:
        float: book price
    """
    book_price_tag = book_tag.find("span", class_="a-price-whole")
    book_price = book_price_tag.text
    
    return book_price

#### Similar with date

In [156]:
def get_book_date(book_tag):
    """
    Extracts book price
    Args:
        book_tag (bs4.element.Tag): corresponding to an Amazon book
    Returns:
        string: book date
    """
    book_date_tag = book_tag.find("span", class_="a-text-bold")
    try:
        book_date = book_date_tag.text
    except:
        return None
    else:
        return book_date

In [158]:
get_book_date(books[1])

'domingo, 8 de noviembre'

In [159]:
get_book_date(books[2])

In [160]:
get_book_date(books[3])

'domingo, 8 de noviembre'

#### Creating a DataFrame with the data

In [161]:
books_info_list = []

for book in books:
    book_dict = {
        "title": get_book_name(book),
        "price": get_book_price(book),
        "date": get_book_date(book)
    }

    books_info_list.append(book_dict)

AttributeError: 'NoneType' object has no attribute 'text'

In [162]:
books_info_list = []

for book in books:
    try:
        book_dict = {
            "title": get_book_name(book),
            "price": get_book_price(book),
            "date": get_book_date(book)
        }

        books_info_list.append(book_dict)
    except:
        print('there was a problem')

there was a problem
there was a problem
there was a problem


In [163]:
len(books_info_list)

45

In [164]:
import pandas as pd

In [165]:
pd.DataFrame(books_info_list)

Unnamed: 0,title,price,date
0,"Python Crash Course: A Hands-On, Project-Based...",3101,"domingo, 8 de noviembre"
1,"Let Us Python: Python Is Future, Embrace It Fast",1871,"domingo, 8 de noviembre"
2,Python: 6 Books in 1: The Ultimate Bible to L...,0,
3,LEARNING PYTHON: 3 Books in 1: Ultimate Beginn...,1945,"domingo, 8 de noviembre"
4,"Fluent Python: Clear, Concise, and Effective P...",4043,"sábado, 7 de noviembre"
5,PYTHON: 2 books in 1 : Learn python programmin...,0,
6,"Python: 3 books in 1: Beginner’s guide, Data s...",0,
7,Python: 3 Manuscripts in 1 book: - Python Prog...,875,
8,Python for Data Science: 2 Books in 1. A Pract...,2495,"domingo, 8 de noviembre"
9,Learning Python: Powerful Object-Oriented Prog...,4908,"sábado, 7 de noviembre"


### Advanced

We can use generic selectors to do the same findings as before:

For this we use `select` method, which is more generic than `find` or `find_all`

The two following are equivalent

In [169]:
possible_books = soup.find_all(name="div", class_="sg-col-inner")

In [170]:
possible_books2 = soup.select("div.sg-col-inner")

In [171]:
possible_books == possible_books2

True

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

`soup.select("tagname1 tagname2")` tag 2 inside tag 2

In [184]:
len(soup.select("a"))

526

In [186]:
len(soup.select("span a"))

424

In [190]:
len(soup.select("span span span a"))

184

In [194]:
len(soup.select("span span span span a"))

43

`soup.select(".classname")`

In [198]:
len(soup.select(".a-price-whole"))

45

`soup.select("tagname.classname")`

In [199]:
len(soup.select("div.a-price-whole"))

0

In [200]:
len(soup.select("span.a-price-whole"))

45

In [200]:
len(soup.select("span.a-price-whole"))

45

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Exercise

Can we just change the url "https://www.amazon.es/s?k=python+books" to do analogous findings?  

## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

[Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red'` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!