# Web scraping tutorial
[Le Wagon tutorial](https://www.youtube.com/watch?v=JzwO8Y_3zKw&ab_channel=LeWagon)

Extracting data from webs automatically. This means scraping the HTML of the websites.

The web to be used is [Books to scrape](https://books.toscrape.com/)

## Livecode

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
url = 'https://books.toscrape.com/'
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

In [4]:
scraped.title.text.strip()

'All products | Books to Scrape - Sandbox'

To access the title of each book we have to check the page structure
```html
<article class="product_pod">
    <div class="image_container">
        <a href="catalogue/a-light-in-the-attic_1000/index.html">
            <img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail">
        </a> 
    </div>
    <p class="star-rating Three">
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
    </p>
    <h3>
        <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
    </h3>
    <div class="product_price">
        <p class="price_color">£51.77</p>
        <p class="instock availability">
            <i class="icon-ok"></i>
            In stock
        </p>
        <form>
            <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
        </form>
    </div>
</article>
```

### Find
Let's find the first book by grabbing the attribute title inside a, inside h3, inside article.

In [8]:
scraped.find('article', class_="product_pod").h3.a["title"]

'A Light in the Attic'

### Find_all
Now let's find all the book titles, in all articles

In [12]:
articles = scraped.find_all("article", class_="product_pod")
type(articles)

bs4.element.ResultSet

In [13]:
for article in articles:
    print(article.h3.a["title"])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


Now we'll try and get all prices using the `class` of the `p`

In [20]:
prices = scraped.find_all("p", class_="price_color")
for price in prices:
    print(price.text)

£51.77
£53.74
£50.10
£47.82
£54.23
£22.65
£33.34
£17.93
£22.60
£52.15
£13.99
£20.66
£17.46
£52.29
£35.02
£57.25
£23.88
£37.59
£51.33
£45.17


In [21]:
type(price.text)

str

For data calculation is better to transform the price to floats, removing the pound sign.

In [27]:
for price in prices:
    price = float(price.text[1:])
    print(price)

51.77
53.74
50.1
47.82
54.23
22.65
33.34
17.93
22.6
52.15
13.99
20.66
17.46
52.29
35.02
57.25
23.88
37.59
51.33
45.17


In [28]:
type(price)

float

Now we create a list of dicts using title as key and price as value

In [30]:
results = []
for article in articles:
    title = article.h3.a["title"]
    price = article.find('p', class_='price_color')
    price = float(price.text.lstrip('£'))
    results.append({title: price})
print(results)

[{'A Light in the Attic': 51.77}, {'Tipping the Velvet': 53.74}, {'Soumission': 50.1}, {'Sharp Objects': 47.82}, {'Sapiens: A Brief History of Humankind': 54.23}, {'The Requiem Red': 22.65}, {'The Dirty Little Secrets of Getting Your Dream Job': 33.34}, {'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull': 17.93}, {'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics': 22.6}, {'The Black Maria': 52.15}, {'Starving Hearts (Triangular Trade Trilogy, #1)': 13.99}, {"Shakespeare's Sonnets": 20.66}, {'Set Me Free': 17.46}, {"Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)": 52.29}, {'Rip it Up and Start Again': 35.02}, {'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991': 57.25}, {'Olio': 23.88}, {'Mesaerion: The Best Science Fiction Stories 1800-1849': 37.59}, {'Libertarianism for Beginners': 51.33}, {"It's Only the Himalayas": 45.17}]


### Select
Another useful method is `select`.
Allows to concat tags, classes and ids for searching in a single command

The code below selects all `a` inside `h3` inside `article`

In [36]:
scraped.select("article h3 a")

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>,
 <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>,
 <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>,
 <a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>,
 <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>,
 <a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>,
 <a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>,
 <a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the 

The code below selects all `a` and all `h3` and all `article`

In [35]:
scraped.select("article, a, h3")

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

The code below selects tags with classes `instock` and `availability`

In [37]:
scraped.select(".instock.availability")

[<p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i cl