# Scraping

The plan is to scrape the [Books to scrape](https://books.toscrape.com) website

<img src="books_screenshot.png" width=700 height=100 alt="books.toscrape.com">

The website contains 1000 books split accross 50 pages. 

Detailed plan include:

1. Get links for all books
2. Using links obtained from step one extract:
    * book id
    * book title
    * book category
    * book price
    * book rating
    * book product type
    * price (incl) and (excl) tax
    * tax
    * number of reviews
    * book description
    
Libraries: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [requests](https://requests.readthedocs.io/en/latest/), and [pandas](https://pandas.pydata.org/docs/) 

---

## Setup

In [1]:
# imports
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
#base url
base_url = 'https://books.toscrape.com/catalogue/'

---

## 1. Get link for all books

In [4]:
# empty link list to be populated
links = []

#while loop parameters
fetching = True
current = 1

#begin while loop
while fetching:
    #get current url and setup beautiful soup
    url = f'{base_url}page-{current}.html'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html')
    articles = soup.find_all('article', class_='product_pod')
    
    print(f'fetching links in page {current}...')
    
    #get link to all articles in current page
    for article in articles:
        for link in article.find_all('a'):
            href = link.get('href')
        links.append(f'https://books.toscrape.com/catalogue/{href}')
    
    
    #next page functionality and loop termnination
    next_page = soup.find('li', class_='next')
    if next_page:
        current += 1
    else:
        print('fetching complete')
        fetching = False

fetching links in page 1...
fetching links in page 2...
fetching links in page 3...
fetching links in page 4...
fetching links in page 5...
fetching links in page 6...
fetching links in page 7...
fetching links in page 8...
fetching links in page 9...
fetching links in page 10...
fetching links in page 11...
fetching links in page 12...
fetching links in page 13...
fetching links in page 14...
fetching links in page 15...
fetching links in page 16...
fetching links in page 17...
fetching links in page 18...
fetching links in page 19...
fetching links in page 20...
fetching links in page 21...
fetching links in page 22...
fetching links in page 23...
fetching links in page 24...
fetching links in page 25...
fetching links in page 26...
fetching links in page 27...
fetching links in page 28...
fetching links in page 29...
fetching links in page 30...
fetching links in page 31...
fetching links in page 32...
fetching links in page 33...
fetching links in page 34...
fetching links in page 

In [5]:
#display links
links[0:5]

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html']

In [6]:
#check links length
len(links)

1000

---

## Extract individual book details

In [169]:
#helper functions
import re

# format prices
def format_price(price):
    """
    prices come in the format => Â£51.77 
    we want to remove Â£ and convert to float
    """
    return float(price.replace("Â£", ''))

#format availability
def format_availabilty(availability):
    """
    availability takes the form => In stock (22 available)
    we want to extract the digit using regex
    """
    match = re.findall(r'-?\d+\.?\d*', availability)[0]
    return int(match)

# format ratings
def format_rating(rating):
    """
    ratings come in the format => one, two, three...
    we want to convert to number using the string map below
    """
    string_map = {
        "Zero": "0",
        "One": "1",
        "Two": "2",
        "Three": "3",
        "Four": "4",
        "Five": "5",
    }
    
    rating = rating.replace(rating, string_map[rating])
    return int(rating)

In [181]:
book_list = []
for i, link in enumerate(links):
    #setup
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'html') 
    active_book = soup.find('li', class_='active').previous_sibling.previous_sibling
    article = soup.find('article', class_='product_page')
    table = article.find('table', class_='table-striped')
    
    print(f'fetching book {i+1}...')
    
    # extract book information
    book_id = table.find_all('td')[0].text
    title = article.find('h1').text
    category = active_book.text.strip()
    price = format_price(article.find('p', class_='price_color').text)
    rating = format_rating(article.find('p', class_='star-rating')['class'][1])
    product_type = table.find_all('td')[1].text
    price_excl_tax = format_price(table.find_all('td')[2].text)
    price_incl_tax = format_price(table.find_all('td')[3].text)
    tax = format_price(table.find_all('td')[4].text)
    availability = format_availabilty(table.find_all('td')[5].text)
    num_reviews = int(table.find_all('td')[6].text)
    img_url = base_url + article.find('img')['src']
    description = (article.find('div', id='product_description').next_sibling.next_sibling.text 
               if article.find('div', id='product_description') 
               else "no description")
    
    # store information in a dictionary
    book = {'id': book_id, 'title': title, 'category': category, 'price': price, 'rating': rating, 'product_type': product_type,
    'price_excl_tax':price_excl_tax, 'price_incl_tax': price_incl_tax, 'tax':tax, 'availability': availability,
     'num_reviews': num_reviews, 'img_url': img_url, 'description': description
    }
    
    #add book to books list
    book_list.append(book)
    
#create books dataframe    
books = pd.DataFrame(book_list)

fetching book 1...
fetching book 2...
fetching book 3...
fetching book 4...
fetching book 5...
fetching book 6...
fetching book 7...
fetching book 8...
fetching book 9...
fetching book 10...
fetching book 11...
fetching book 12...
fetching book 13...
fetching book 14...
fetching book 15...
fetching book 16...
fetching book 17...
fetching book 18...
fetching book 19...
fetching book 20...
fetching book 21...
fetching book 22...
fetching book 23...
fetching book 24...
fetching book 25...
fetching book 26...
fetching book 27...
fetching book 28...
fetching book 29...
fetching book 30...
fetching book 31...
fetching book 32...
fetching book 33...
fetching book 34...
fetching book 35...
fetching book 36...
fetching book 37...
fetching book 38...
fetching book 39...
fetching book 40...
fetching book 41...
fetching book 42...
fetching book 43...
fetching book 44...
fetching book 45...
fetching book 46...
fetching book 47...
fetching book 48...
fetching book 49...
fetching book 50...
fetching 

fetching book 397...
fetching book 398...
fetching book 399...
fetching book 400...
fetching book 401...
fetching book 402...
fetching book 403...
fetching book 404...
fetching book 405...
fetching book 406...
fetching book 407...
fetching book 408...
fetching book 409...
fetching book 410...
fetching book 411...
fetching book 412...
fetching book 413...
fetching book 414...
fetching book 415...
fetching book 416...
fetching book 417...
fetching book 418...
fetching book 419...
fetching book 420...
fetching book 421...
fetching book 422...
fetching book 423...
fetching book 424...
fetching book 425...
fetching book 426...
fetching book 427...
fetching book 428...
fetching book 429...
fetching book 430...
fetching book 431...
fetching book 432...
fetching book 433...
fetching book 434...
fetching book 435...
fetching book 436...
fetching book 437...
fetching book 438...
fetching book 439...
fetching book 440...
fetching book 441...
fetching book 442...
fetching book 443...
fetching book

fetching book 788...
fetching book 789...
fetching book 790...
fetching book 791...
fetching book 792...
fetching book 793...
fetching book 794...
fetching book 795...
fetching book 796...
fetching book 797...
fetching book 798...
fetching book 799...
fetching book 800...
fetching book 801...
fetching book 802...
fetching book 803...
fetching book 804...
fetching book 805...
fetching book 806...
fetching book 807...
fetching book 808...
fetching book 809...
fetching book 810...
fetching book 811...
fetching book 812...
fetching book 813...
fetching book 814...
fetching book 815...
fetching book 816...
fetching book 817...
fetching book 818...
fetching book 819...
fetching book 820...
fetching book 821...
fetching book 822...
fetching book 823...
fetching book 824...
fetching book 825...
fetching book 826...
fetching book 827...
fetching book 828...
fetching book 829...
fetching book 830...
fetching book 831...
fetching book 832...
fetching book 833...
fetching book 834...
fetching book

In [183]:
books.head()

Unnamed: 0,id,title,category,price,rating,product_type,price_excl_tax,price_incl_tax,tax,availability,num_reviews,img_url,description
0,a897fe39b1053632,A Light in the Attic,Poetry,51.77,3,Books,51.77,51.77,0.0,22,0,https://books.toscrape.com/catalogue/../../med...,It's hard to imagine a world without A Light i...
1,90fa61229261140a,Tipping the Velvet,Historical Fiction,53.74,1,Books,53.74,53.74,0.0,20,0,https://books.toscrape.com/catalogue/../../med...,"""Erotic and absorbing...Written with starling ..."
2,6957f44c3847a760,Soumission,Fiction,50.1,1,Books,50.1,50.1,0.0,20,0,https://books.toscrape.com/catalogue/../../med...,"Dans une France assez proche de la nÃ´tre, un ..."
3,e00eb4fd7b871a48,Sharp Objects,Mystery,47.82,4,Books,47.82,47.82,0.0,20,0,https://books.toscrape.com/catalogue/../../med...,"WICKED above her hipbone, GIRL across her hear..."
4,4165285e1663650f,Sapiens: A Brief History of Humankind,History,54.23,5,Books,54.23,54.23,0.0,20,0,https://books.toscrape.com/catalogue/../../med...,From a renowned historian comes a groundbreaki...


In [197]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              1000 non-null   object 
 1   title           1000 non-null   object 
 2   category        1000 non-null   object 
 3   price           1000 non-null   float64
 4   rating          1000 non-null   int64  
 5   product_type    1000 non-null   object 
 6   price_excl_tax  1000 non-null   float64
 7   price_incl_tax  1000 non-null   float64
 8   tax             1000 non-null   float64
 9   availability    1000 non-null   int64  
 10  num_reviews     1000 non-null   int64  
 11  img_url         1000 non-null   object 
 12  description     1000 non-null   object 
dtypes: float64(4), int64(3), object(6)
memory usage: 101.7+ KB


In [182]:
#save csv
books.to_csv('books.csv', index=False)