## Bookstore Web Scraper
This notebook is a practice exercise in web scraping. It can scrape data from selected pages of http://books.toscrape.com/catalogue/.

The user needs to provide a base url (http://books.toscrape.com/catalogue/ in this case) and pages to scrape.

The function will return a Pandas dataframe of the scraped data (Rating, Book Name, Price of Book, In Stock)

In [5]:
# Importing necessary modules
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from pandas import DataFrame

### Main Function

In [10]:
'''
This is the main function. You provide it with the base url for the website (http://books.toscrape.com/catalogue/ in this case).
and the page numbers you'd like to scrape from the website. It returns the scraped data in the form of a Pandas DataFrame
'''
def bookstore_scraper(base,*pages):
    dataframe = defaultdict(list)
    for page in range(pages[0],pages[1]+1):
        url = generate_url(base,page)
        data = connect_to_page(url)
        rating_list,header_list,price_list,in_stock_list = scrape_page(data)
        dataframe['rating_out_of_5'].extend(rating_list)
        dataframe['name'].extend(header_list)
        dataframe['price'].extend(price_list)
        dataframe['in_stock'].extend(in_stock_list)
    return DataFrame(dataframe)

### Helper Functions

In [11]:
'''
This function appends /page-(page_number).html to the base url.
Eg: generate_url('www.hello.com/',1) gives 'www.hello.com/page-1.html'
'''
def generate_url(base,page_number):
    return base + f'page-{page_number}.html'


'''
This function requests a url for its HTML and returns the HTML content from the url
'''
def connect_to_page(url):
    response = requests.get(url)
    return BeautifulSoup(response.content)


'''
This function takes HTML content as input "page" and returns 4 lists of data (ratings,
name, price, in stock)
'''
def scrape_page(page):
    main_stuff = page.findAll('ol')
    rating_list = []
    header_list = []
    price_list = []
    in_stock_list = []
    for ele in main_stuff:
        for li in ele.findAll('li'):
            rating = li.find('p', {'class':'star-rating'})['class'][1]
            rating_list.append(rating)
            header = li.findAll('a')[1]['title']
            header_list.append(header)
            price = li.find('p',{'class':'price_color'}).text
            price_list.append(price)
            in_stock = li.find('p',{'class':'instock availability'}).text.strip()
            in_stock_list.append(in_stock)
    return rating_list,header_list,price_list,in_stock_list  

### Sample Usage
Below, I've scraped data for pages 1 to 3 using the functions above

In [12]:
data = bookstore_scraper('http://books.toscrape.com/catalogue/',1,3)
data.head()

Unnamed: 0,rating_out_of_5,name,price,in_stock
0,Three,A Light in the Attic,£51.77,In stock
1,One,Tipping the Velvet,£53.74,In stock
2,One,Soumission,£50.10,In stock
3,Four,Sharp Objects,£47.82,In stock
4,Five,Sapiens: A Brief History of Humankind,£54.23,In stock
