# Webscraping
We have been tasked to scrape the website https://books.toscrape.com/, a website designed to be scraped for educational purposes. 

In this attempt, we will first try to scrape one book, then books from one page, then the entire website.

In [1]:
from bs4 import BeautifulSoup as bs          
import requests 
import pandas as pd

In [2]:
html_code = requests.get('http://books.toscrape.com/') 
soup = bs(html_code.content, 'html.parser') 

In [3]:
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

## Scraping the Chunk of Books on Page 1

When inspected, each book and its details like ratings, price etc was placed in the following < li> class attribute:
<br>
< li class="col-xs-6 col-sm-4 col-md-3 col-lg-3" >

For our first step, we will find the code that contains these chunks

In [4]:
soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

#find_all method finds all occurence
#find method finds only the first occurence

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_

In [5]:
# storing the books on page 1 (first 20 books) into a variable

page1 = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

## Scraping One Book's Details

Our second step would be to select one book's chunk and work on it to webscrape it, before applying the method to the entire webpage

In [6]:
# viweing the first book's html
page1[0]

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>

In [7]:
# storing the first book into a variable
book1 = page1[0]


In [8]:
#Finding the book title
book1.find('h3').find('a')['title']

'A Light in the Attic'

In [9]:
#finding the book price
book1.find('p', {'class': 'price_color'})

<p class="price_color">£51.77</p>

In [10]:
#extracting the book price text
book1.find('p', {'class': 'price_color'}).text

'£51.77'

In [11]:
# gettting the link of the book
book1.find('a')['href']


'catalogue/a-light-in-the-attic_1000/index.html'

In [12]:
# using regex to get the star ratings of the book

import re
regex = re.compile("star-rating (.*)")
book1.find('p', {'class': regex})

<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

In [13]:
book1.find('p', {'class': regex})['class']

['star-rating', 'Three']

## Scraping one page's books

In [14]:
# creating a user defined function to help scrape the books and put them into a dictionary

def book_scrape(book):
    info = {}   
    
    info['title'] = book.find('h3').find('a')['title']
        
    info['rating'] = book.find('p', {'class': regex})['class'][-1]
    
    info['price'] = book.find('p', {'class': 'price_color'}).text
    
    info['link'] = 'http://books.toscrape.com/' + book.find('a')['href']
    
    return info

In [19]:
# running the function to scrape the first 20 books stored in the page 1 variable

page1_dict = [book_scrape(book) for book in page1]

page1_dict[0:3]

[{'title': 'A Light in the Attic',
  'rating': 'Three',
  'price': '£51.77',
  'link': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'},
 {'title': 'Tipping the Velvet',
  'rating': 'One',
  'price': '£53.74',
  'link': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'},
 {'title': 'Soumission',
  'rating': 'One',
  'price': '£50.10',
  'link': 'http://books.toscrape.com/catalogue/soumission_998/index.html'}]

In [20]:
# crosschecking the length of the dictionary
len(page1_dict)

20

In [21]:
# viewing the dictionary into a dataframe
pd.DataFrame(page1_dict)

Unnamed: 0,title,rating,price,link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,One,£50.10,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/catalogue/sapiens-a-...
5,The Requiem Red,One,£22.65,http://books.toscrape.com/catalogue/the-requie...
6,The Dirty Little Secrets of Getting Your Dream...,Four,£33.34,http://books.toscrape.com/catalogue/the-dirty-...
7,The Coming Woman: A Novel Based on the Life of...,Three,£17.93,http://books.toscrape.com/catalogue/the-coming...
8,The Boys in the Boat: Nine Americans and Their...,Four,£22.60,http://books.toscrape.com/catalogue/the-boys-i...
9,The Black Maria,One,£52.15,http://books.toscrape.com/catalogue/the-black-...


## Scraping the entire website (all pages)

In [22]:
url = 'http://books.toscrape.com/catalogue/page-1.html'

In [23]:
# constructing a  list of all the 50 pages in the website that contain book informations

urls = ['http://books.toscrape.com/catalogue/page-{}.html'.format(i) for i in range(1, 51)]
urls

['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html',
 'http://books.toscrape.com/catalogue/page-6.html',
 'http://books.toscrape.com/catalogue/page-7.html',
 'http://books.toscrape.com/catalogue/page-8.html',
 'http://books.toscrape.com/catalogue/page-9.html',
 'http://books.toscrape.com/catalogue/page-10.html',
 'http://books.toscrape.com/catalogue/page-11.html',
 'http://books.toscrape.com/catalogue/page-12.html',
 'http://books.toscrape.com/catalogue/page-13.html',
 'http://books.toscrape.com/catalogue/page-14.html',
 'http://books.toscrape.com/catalogue/page-15.html',
 'http://books.toscrape.com/catalogue/page-16.html',
 'http://books.toscrape.com/catalogue/page-17.html',
 'http://books.toscrape.com/catalogue/page-18.html',
 'http://books.toscrape.com/catalogue/page-19.html',
 '

In [24]:
# Using functions within functions to get the 20 books present in each url

def get_page_books(url):     #for each url
    
    html_page = requests.get(url)                 #each url will be requested for and get parsed through the soup
    soup = bs(html_page.content, 'html.parser')         
    
    #the raw data in chunks will be retrived for the 20 books on the url
    raw_data = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'}) 
    
    #the data will be extracted and placed in dictionaries as done for page1 using the book_scrape udf above
    to_dict = [book_scrape(book) for book in raw_data]
    
    return to_dict

In [26]:
# scraping the 20 books stored in each url and putting them into a dictionary

all_books_dicts = []

for url in urls:
    all_books_dicts.extend(get_page_books(url))

print(len(all_books_dicts))
all_books_dicts

1000


[{'title': 'A Light in the Attic',
  'rating': 'Three',
  'price': '£51.77',
  'link': 'http://books.toscrape.com/a-light-in-the-attic_1000/index.html'},
 {'title': 'Tipping the Velvet',
  'rating': 'One',
  'price': '£53.74',
  'link': 'http://books.toscrape.com/tipping-the-velvet_999/index.html'},
 {'title': 'Soumission',
  'rating': 'One',
  'price': '£50.10',
  'link': 'http://books.toscrape.com/soumission_998/index.html'},
 {'title': 'Sharp Objects',
  'rating': 'Four',
  'price': '£47.82',
  'link': 'http://books.toscrape.com/sharp-objects_997/index.html'},
 {'title': 'Sapiens: A Brief History of Humankind',
  'rating': 'Five',
  'price': '£54.23',
  'link': 'http://books.toscrape.com/sapiens-a-brief-history-of-humankind_996/index.html'},
 {'title': 'The Requiem Red',
  'rating': 'One',
  'price': '£22.65',
  'link': 'http://books.toscrape.com/the-requiem-red_995/index.html'},
 {'title': 'The Dirty Little Secrets of Getting Your Dream Job',
  'rating': 'Four',
  'price': '£33.34'

In [27]:
df = pd.DataFrame(all_books_dicts)
df

Unnamed: 0,title,rating,price,link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/a-light-in-the-attic...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/tipping-the-velvet_9...
2,Soumission,One,£50.10,http://books.toscrape.com/soumission_998/index...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/sharp-objects_997/in...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/sapiens-a-brief-hist...
...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,One,£55.53,http://books.toscrape.com/alice-in-wonderland-...
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,£57.06,http://books.toscrape.com/ajin-demi-human-volu...
997,A Spy's Devotion (The Regency Spies of London #1),Five,£16.97,http://books.toscrape.com/a-spys-devotion-the-...
998,1st to Die (Women's Murder Club #1),One,£53.98,http://books.toscrape.com/1st-to-die-womens-mu...


### Exporting the scraped books as a CSV

In [28]:
# converting the df to a csv file
df.to_csv('Scrapped_Books.csv', index=False)