# Scraping

The plan is to scrape the [Books to scrape](https://books.toscrape.com) website

<img src="books_screenshot.png" width=700 height=100 alt="books.toscrape.com">

The website contains 1000 books split accross 50 pages. 

Detailed plan include:

1. Get links for all books
2. Using links obtained from step one extract:
    * book id
    * book title
    * book category
    * book price
    * book rating
    * book product type
    * price (incl) and (excl) tax
    * tax
    * number of reviews
    * book description
    
Libraries: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [requests](https://requests.readthedocs.io/en/latest/), and [pandas](https://pandas.pydata.org/docs/) 

---

## Setup

In [1]:
# imports
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
#base url
base_url = 'https://books.toscrape.com/catalogue/'

---

## 1. Get link for all books

In [4]:
# empty link list to be populated
links = []

#while loop parameters
fetching = True
current = 1

#begin while loop
while fetching:
    #get current url and setup beautiful soup
    url = f'{base_url}page-{current}.html'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html')
    articles = soup.find_all('article', class_='product_pod')
    
    print(f'fetching links in page {current}...')
    
    #get link to all articles in current page
    for article in articles:
        for link in article.find_all('a'):
            href = link.get('href')
        links.append(f'https://books.toscrape.com/catalogue/{href}')
    
    
    #next page functionality and loop termnination
    next_page = soup.find('li', class_='next')
    if next_page:
        current += 1
    else:
        print('fetching complete')
        fetching = False

fetching links in page 1...
fetching links in page 2...
fetching links in page 3...
fetching links in page 4...
fetching links in page 5...
fetching links in page 6...
fetching links in page 7...
fetching links in page 8...
fetching links in page 9...
fetching links in page 10...
fetching links in page 11...
fetching links in page 12...
fetching links in page 13...
fetching links in page 14...
fetching links in page 15...
fetching links in page 16...
fetching links in page 17...
fetching links in page 18...
fetching links in page 19...
fetching links in page 20...
fetching links in page 21...
fetching links in page 22...
fetching links in page 23...
fetching links in page 24...
fetching links in page 25...
fetching links in page 26...
fetching links in page 27...
fetching links in page 28...
fetching links in page 29...
fetching links in page 30...
fetching links in page 31...
fetching links in page 32...
fetching links in page 33...
fetching links in page 34...
fetching links in page 

In [5]:
#display links
links[0:5]

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html']

In [6]:
#check links length
len(links)

1000

---

## Extract individual book details

In [169]:
#helper functions
import re

# format prices
def format_price(price):
    """
    prices come in the format => Â£51.77 
    we want to remove Â£ and convert to float
    """
    return float(price.replace("Â£", ''))

#format availability
def format_availabilty(availability):
    """
    availability takes the form => In stock (22 available)
    we want to extract the digit using regex
    """
    match = re.findall(r'-?\d+\.?\d*', availability)[0]
    return int(match)

# format ratings
def format_rating(rating):
    """
    ratings come in the format => one, two, three...
    we want to convert to number using the string map below
    """
    string_map = {
        "Zero": "0",
        "One": "1",
        "Two": "2",
        "Three": "3",
        "Four": "4",
        "Five": "5",
    }
    
    rating = rating.replace(rating, string_map[rating])
    return int(rating)

In [None]:
book_list = []
for i, link in enumerate(links):
    #setup
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'html') 
    active_book = soup.find('li', class_='active').previous_sibling.previous_sibling
    article = soup.find('article', class_='product_page')
    table = article.find('table', class_='table-striped')
    
    print(f'fetching book {i+1}...')
    
    # extract book information
    book_id = table.find_all('td')[0].text
    title = article.find('h1').text
    category = active_book.text.strip()
    price = format_price(article.find('p', class_='price_color').text)
    rating = format_rating(article.find('p', class_='star-rating')['class'][1])
    product_type = table.find_all('td')[1].text
    price_excl_tax = format_price(table.find_all('td')[2].text)
    price_incl_tax = format_price(table.find_all('td')[3].text)
    tax = format_price(table.find_all('td')[4].text)
    availability = format_availabilty(table.find_all('td')[5].text)
    num_reviews = int(table.find_all('td')[6].text)
    img_url = base_url + article.find('img')['src']
    description = (article.find('div', id='product_description').next_sibling.next_sibling.text 
               if article.find('div', id='product_description') 
               else "no description")
    
    # store information in a dictionary
    book = {'id': book_id, 'title': title, 'category': category, 'price': price, 'rating': rating, 'product_type': product_type,
    'price_excl_tax':price_excl_tax, 'price_incl_tax': price_incl_tax, 'tax':tax, 'availability': availability,
     'num_reviews': num_reviews, 'img_url': img_url, 'description': description
    }
    
    #add book to books list
    book_list.append(book)
    
#create books dataframe    
books = pd.DataFrame(book_list)

fetching book 1...
fetching book 2...
fetching book 3...
fetching book 4...
fetching book 5...
fetching book 6...
fetching book 7...
fetching book 8...
fetching book 9...
fetching book 10...
fetching book 11...
fetching book 12...
fetching book 13...
fetching book 14...
fetching book 15...
fetching book 16...
fetching book 17...
fetching book 18...
fetching book 19...
fetching book 20...
fetching book 21...
fetching book 22...
fetching book 23...
fetching book 24...
fetching book 25...
fetching book 26...
fetching book 27...
fetching book 28...
fetching book 29...
fetching book 30...
fetching book 31...
fetching book 32...
fetching book 33...
fetching book 34...
fetching book 35...
fetching book 36...
fetching book 37...
fetching book 38...
fetching book 39...
fetching book 40...
fetching book 41...
fetching book 42...
fetching book 43...
fetching book 44...
fetching book 45...
fetching book 46...
fetching book 47...
fetching book 48...
fetching book 49...
fetching book 50...
fetching 