[Books to Scrape](http://books.toscrape.com/) is a site built for the sole purpose of scraping practice. It contains a list of 1000 books.

Task: Create a scraper that crawls through the website and scrapes details about all 1000 books. 

For each book, collect the:

 - Name
 - Image URL
 - Price
 - Rating

These details are to be stored in a pandas dataframe.


In [40]:
#import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [41]:
#load the webpage content
r = requests.get('http://books.toscrape.com/')

#create a BeautifulSoup object with
soup = BeautifulSoup(r.content, 'html.parser')

print(soup.prettify())


<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [42]:
books = soup.find_all(class_='product_pod')

### Extract data for one book

In [43]:
book_one = books[0]

In [44]:
book_one

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [45]:
name = book_one.h3.a['title']
print(name)

A Light in the Attic


In [46]:
url = book_one.a['href']
print(url)

catalogue/a-light-in-the-attic_1000/index.html


In [47]:
price = book_one.find('p', class_='price_color').text
print(price)

£51.77


In [48]:
rating = book_one.find('p')['class']
print(rating[1])

Three


### Extract information from one page

In [49]:
#lists to store scraped data in
names = []
image_urls = []
prices = []
ratings = []

#extract data from one book
for book in books:
    #the name
    name = book.h3.a['title']
    names.append(name)
    
    #the image_url
    url = book.a['href']
    image_urls.append(url)
    
    #the price
    price = book.find('p', class_='price_color').text
    prices.append(price)
    
    #the rating
    rating = book.find('p')['class']
    ratings.append(rating[1])
    
    
    
    
    

In [50]:
books_df = pd.DataFrame({'Name':names, 'Image URL':image_urls, 'Price':prices, 'Rating':ratings})
books_df

Unnamed: 0,Name,Image URL,Price,Rating
0,A Light in the Attic,catalogue/a-light-in-the-attic_1000/index.html,£51.77,Three
1,Tipping the Velvet,catalogue/tipping-the-velvet_999/index.html,£53.74,One
2,Soumission,catalogue/soumission_998/index.html,£50.10,One
3,Sharp Objects,catalogue/sharp-objects_997/index.html,£47.82,Four
4,Sapiens: A Brief History of Humankind,catalogue/sapiens-a-brief-history-of-humankind...,£54.23,Five
5,The Requiem Red,catalogue/the-requiem-red_995/index.html,£22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,catalogue/the-dirty-little-secrets-of-getting-...,£33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,catalogue/the-coming-woman-a-novel-based-on-th...,£17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,catalogue/the-boys-in-the-boat-nine-americans-...,£22.60,Four
9,The Black Maria,catalogue/the-black-maria_991/index.html,£52.15,One


### Extracting from all the pages

In [51]:
pages = [str(i) for i in range(1,51)]

In [52]:
#lists to store scraped data in
names_all = []
image_urls_all = []
prices_all = []
ratings_all = []


for page in pages:
    response = requests.get('http://books.toscrape.com/catalogue/page-' + page + '.html')
    
    response.encoding = 'utf-8'
    
    #parse the content with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')
    
    #select all the books on the page
    books_all = page_html.find_all(class_='product_pod')
    
    #for each book on the page
    for book_all in books_all:
        
        #the name
        name = book_all.h3.a['title']
        names_all.append(name)

        #the image_url
        url = book_all.a['href']
        image_urls_all.append(url)

        #the price
        price = book_all.find('p', class_='price_color').text
        prices_all.append(price)

        #the rating
        rating = book_all.find('p')['class']
        ratings_all.append(rating[1])

In [53]:
#the final dataframe
books_final_df = pd.DataFrame({'Name':names_all, 'Image URL':image_urls_all, 'Price':prices_all, 'Rating':ratings_all})

In [54]:
books_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       1000 non-null   object
 1   Image URL  1000 non-null   object
 2   Price      1000 non-null   object
 3   Rating     1000 non-null   object
dtypes: object(4)
memory usage: 31.4+ KB


In [55]:
books_final_df.head()

Unnamed: 0,Name,Image URL,Price,Rating
0,A Light in the Attic,a-light-in-the-attic_1000/index.html,£51.77,Three
1,Tipping the Velvet,tipping-the-velvet_999/index.html,£53.74,One
2,Soumission,soumission_998/index.html,£50.10,One
3,Sharp Objects,sharp-objects_997/index.html,£47.82,Four
4,Sapiens: A Brief History of Humankind,sapiens-a-brief-history-of-humankind_996/index...,£54.23,Five


In [56]:
numbers = {'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
books_final_df['Rating'] = books_final_df['Rating'].map(numbers)

In [57]:
books_final_df.head()

Unnamed: 0,Name,Image URL,Price,Rating
0,A Light in the Attic,a-light-in-the-attic_1000/index.html,£51.77,3
1,Tipping the Velvet,tipping-the-velvet_999/index.html,£53.74,1
2,Soumission,soumission_998/index.html,£50.10,1
3,Sharp Objects,sharp-objects_997/index.html,£47.82,4
4,Sapiens: A Brief History of Humankind,sapiens-a-brief-history-of-humankind_996/index...,£54.23,5
