## Objective

We will be scraping data of an online book store: http://books.toscrape.com/

This website is fictional so we can scrape it as much as we want.

We will be focussing on gathering the following information about all the products of the website:

- book title
- price
- availability
- image
- category
- rating

<img src="images/books.toscrape.png">

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Getting the content of the main page

Will use the requests module to get the HTML of the website's main page.

In [2]:
url = "http://books.toscrape.com/index.html"

In [3]:
source=requests.get(url).text
soup=BeautifulSoup(source,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

## get and parse a page

In [4]:
def getandParseURL(url):
    try:
        source=requests.get(url).text
        soup=BeautifulSoup(source,'lxml')
        return(soup)
    except:
        print("Url not accessible")
url="http://books.toscrape.com/index.html"  
getandParseURL(url)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

## Find book URLs on the main page

<img src="images/inspect.png">

In [5]:
article=soup.find('article',class_="product_pod")
article

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [6]:
#finding for all 
mainpage_prod_urls=[]
for urls in soup.find_all('article',class_="product_pod"):
    prod_link=urls.find('div',class_='image_container').a["href"]
    mainpage_prod_urls.append(prod_link)


In [7]:
mainpage_prod_urls

['catalogue/a-light-in-the-attic_1000/index.html',
 'catalogue/tipping-the-velvet_999/index.html',
 'catalogue/soumission_998/index.html',
 'catalogue/sharp-objects_997/index.html',
 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'catalogue/the-requiem-red_995/index.html',
 'catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'catalogue/the-black-maria_991/index.html',
 'catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
 'catalogue/shakespeares-sonnets_989/index.html',
 'catalogue/set-me-free_988/index.html',
 'catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 'catalogue/rip-it-up-and-start-again_986/index.html',
 'catalogue/our-band-could-be-your-life-scene

In [8]:
'/'.join(url.split('/')[:-1])

'http://books.toscrape.com'

## get all books in page

In [9]:
def getbookurls(url):
    soup = getandParseURL(url)
    # remove the index.html part of the base url before returning the results
    return(["/".join(url.split("/")[:-1]) + "/" + x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")])   
getbookurls(url)

['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990

In [10]:
#to fetch all product urls lets fetch all categories url then -->all page urls in the category-->
#and then fetch product urls from each page
article


<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [11]:
soup.find('div',class_='side_categories').ul.ul.li.a['href']

'catalogue/category/books/travel_2/index.html'

## Find book categories URLs on the main page

Now let's try retrieving the URLs corresponding the different product categories:

<img src="images/inspect2.png">

In [12]:
#to find all categories using reg ex "catalogue/category/books" as all begin with this
import re
def getallcategories(url):
    categories_url=[]
    categories_name=[]
    for categories in soup.find('div',class_='side_categories').find_all('a',href=re.compile("catalogue/category/books")):
        categories_url.append('/'.join(url.split('/')[:-1])+'/'+categories['href'])
        categories_name.append(categories.text.split())
    return(categories_url[1:])    

In [13]:
getallcategories(url)

['http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'http://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
 'http://books.toscrape.com/catalogue/category/boo

## Scrape all books data

<img src="images/next.png">

On the next pages there is also a 'previous' button to come back to the last product page.

<img src="images/previous_next.png">

### get all pages URLs

<img src="images/next_inspect.png">

In [14]:
#get all pages url
pages_urls=[url]
soup=getandParseURL(pages_urls[0])
while len(soup.findAll("a", href=re.compile("page")))==2 or len(pages_urls)==1:
    new_url='/'.join(pages_urls[-1].split('/')[:-1])+'/'+soup.findAll("a", href=re.compile("page"))[-1]['href']
    #add url to list
    pages_urls.append(new_url)
    #get and parse next page
    soup=getandParseURL(new_url)

In [15]:
pages_urls

['http://books.toscrape.com/index.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html',
 'http://books.toscrape.com/catalogue/page-6.html',
 'http://books.toscrape.com/catalogue/page-7.html',
 'http://books.toscrape.com/catalogue/page-8.html',
 'http://books.toscrape.com/catalogue/page-9.html',
 'http://books.toscrape.com/catalogue/page-10.html',
 'http://books.toscrape.com/catalogue/page-11.html',
 'http://books.toscrape.com/catalogue/page-12.html',
 'http://books.toscrape.com/catalogue/page-13.html',
 'http://books.toscrape.com/catalogue/page-14.html',
 'http://books.toscrape.com/catalogue/page-15.html',
 'http://books.toscrape.com/catalogue/page-16.html',
 'http://books.toscrape.com/catalogue/page-17.html',
 'http://books.toscrape.com/catalogue/page-18.html',
 'http://books.toscrape.com/catalogue/page-19.html',
 'http://book

In [16]:
result=requests.get(pages_urls[-1])
print(result.status_code)

200


We successfully managed to get the 50 pages URLs. What is interesting here is that the URL of those pages is highly predictable. We could have just created this list by incrementing 'page-X.html' until 50.

This solution could work for this exact example but would not work anymore if the number of pages changed (e.g. if the website decided to print more products per pages, or if the catalog changed).

One solution could be to increment the value until we get on a 404 page.

<img src="images/404.png">

In [17]:
pages_urls = []

new_page = "http://books.toscrape.com/catalogue/page-1.html"
while requests.get(new_page).status_code == 200:
    pages_urls.append(new_page)
    new_page = pages_urls[-1].split("-")[0] + "-" + str(int(pages_urls[-1].split("-")[1].split(".")[0]) + 1) + ".html"
    
    # Please find below the explanation of the above line of code
    # pages_urls[-1].split("-")[0]->gets the last url from new_page variable->"http://books.toscrape.com/catalogue/page-1.html"
    # splits it by ("-") to get the part "http://books.toscrape.com/catalogue/page"
    
    #str(int(pages_urls[-1].split("-")[1].split(".")[0]) + 1)-> extracts 1 from "http://books.toscrape.com/catalogue/page-1.html"
    #Adds 1 to the above extracted url and also appends ".html" part to the url
    
    #Final url becomes "http://books.toscrape.com/catalogue/page-2.html"

In [18]:
print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html']

### Get all products URLs

In [19]:
#get all products 
booksURLs=[]
for page in pages_urls:
    booksURLs.extend(getbookurls(page))
    
    

In [20]:
booksURLs

['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990

### Get product data

<img src="images/product_inspect.png">

We can easily retrieve a lot of information for every book:
- book title
- price
- availability
- image
- category
- rating

In [21]:
names = []
prices = []
nb_in_stock = []
img_urls = []
categories = []
ratings = []

# scrape data for every book URL: this may take some time
for url in booksURLs:
    soup = getandParseURL(url)
    
        # scrape data for every book URL: this may take some time
     #product name
    names.append(soup.find("div", class_ = re.compile("product_main")).h1.text)
     # product price
    prices.append(soup.find("p", class_ = "price_color").text[2:]) # get rid of the pound sign
     # number of available products
    nb_in_stock.append(re.sub("[^0-9]", "", soup.find("p", class_ = "instock availability").text)) # get rid of non numerical characters
     # image url
    img_urls.append('/'.join(url.split('/')[:3])+'/'+'/'.join(soup.find("img").get("src").split('/')[2:]))
        # product category
    categories.append(re.sub("[^a-zA-Z]", "",soup.find("a", href = re.compile("../category/books/")).get("href").split("/")[3]))
        # ratings
    ratings.append(soup.find("p", class_ = re.compile("star-rating")).get("class")[1])

In [22]:
# add data into pandas df
import pandas as pd

scraped_data = pd.DataFrame({'name': names, 'price': prices, 'nb_in_stock': nb_in_stock, "url_img": img_urls, "product_category": categories, "rating": ratings})
scraped_data.head()

Unnamed: 0,name,price,nb_in_stock,url_img,product_category,rating
0,A Light in the Attic,51.77,22,http://books.toscrape.com/media/cache/fe/72/fe...,poetry,Three
1,Tipping the Velvet,53.74,20,http://books.toscrape.com/media/cache/08/e9/08...,historicalfiction,One
2,Soumission,50.1,20,http://books.toscrape.com/media/cache/ee/cf/ee...,fiction,One
3,Sharp Objects,47.82,20,http://books.toscrape.com/media/cache/c0/59/c0...,mystery,Four
4,Sapiens: A Brief History of Humankind,54.23,20,http://books.toscrape.com/media/cache/ce/5f/ce...,history,Five


In [23]:
scraped_data.shape

(1000, 6)

### Export the data to an excel/csv file

Also we need to transform the ratings into numerical values

In [24]:
scraped_data['rating'].value_counts()

One      226
Three    203
Two      196
Five     196
Four     179
Name: rating, dtype: int64

In [25]:
for x in range(len(scraped_data['rating'].values)):
    values=scraped_data['rating']
    if values[x]=="One":
        values[x]=1
    if values[x]=="Two":
        values[x]=2
    if values[x]=="Three":
        values[x]=3
    if values[x]=="Four":
        values[x]=4
    if values[x]=="Five":
        values[x]=5    

In [26]:
scraped_data['rating'].value_counts()

1    226
3    203
5    196
2    196
4    179
Name: rating, dtype: int64

In [27]:
scraped_data.to_csv('bookScraped.csv',index=False)

<img src="images/finaldata.png">

In [28]:
scraped_data.head()

Unnamed: 0,name,price,nb_in_stock,url_img,product_category,rating
0,A Light in the Attic,51.77,22,http://books.toscrape.com/media/cache/fe/72/fe...,poetry,3
1,Tipping the Velvet,53.74,20,http://books.toscrape.com/media/cache/08/e9/08...,historicalfiction,1
2,Soumission,50.1,20,http://books.toscrape.com/media/cache/ee/cf/ee...,fiction,1
3,Sharp Objects,47.82,20,http://books.toscrape.com/media/cache/c0/59/c0...,mystery,4
4,Sapiens: A Brief History of Humankind,54.23,20,http://books.toscrape.com/media/cache/ce/5f/ce...,history,5
