### Week 2 Assessment
#### Scraping Brief
`Website`:  [All products | Books to Scrape - Sandbox](http://books.toscrape.com/)

`Detail`: Books to Scrape is a site built for the sole purpose of scraping practice. It contains a list of 1000 books.

`Task`: Create a scraper that crawls through the website and scrapes details about all 1000 books. For each book, collect the:
- Name
- Image URL
- Price
- Rating

These details are to be stored in a pandas dataframe.

In [1]:
# import packages
import requests
from bs4 import BeautifulSoup

import pandas as pd

In [2]:
# initialise necessary variables
titles = []
prices = []
ratings = []
imageURLs = []
page = ''
page_count = 1


# crawl through all pages
while True:    
    base_site = f'http://books.toscrape.com/{page}'
    
    # Making a get request
    response = requests.get(base_site)

    # Extracting the HTML
    html = response.content
    soup = BeautifulSoup(html, "html.parser")

    parent = soup.find_all('article')
    titles.extend([title.find_all('a')[1].get('title') for title in parent])                # extract book title
    prices.extend([price.text for price in soup.select('.price_color')])                    # extract prices
    ratings.extend([rating.attrs['class'][1] for rating in soup.select('.star-rating')])    # extract rating
    # or: ratings.extend([rating.get('class')[1] for rating in soup.select('.star-rating')])    # extract rating
    imageURLs.extend([image.find_all('a')[0].find('img').get('src') for image in parent])   # extact imageURL
    
    
    # if there next page button, extract the relative url; else end the scraping process
    if soup.select('.next'):
        page = 'catalogue/{}'.format(soup.select('.next')[0].select('a')[0].get('href').split('/')[-1])
        print('Done scraping page {} ... '.format(page_count))
        page_count += 1
    else:
        print('Done scraping the last page!')
        break

Done scraping page 1 ... 
Done scraping page 2 ... 
Done scraping page 3 ... 
Done scraping page 4 ... 
Done scraping page 5 ... 
Done scraping page 6 ... 
Done scraping page 7 ... 
Done scraping page 8 ... 
Done scraping page 9 ... 
Done scraping page 10 ... 
Done scraping page 11 ... 
Done scraping page 12 ... 
Done scraping page 13 ... 
Done scraping page 14 ... 
Done scraping page 15 ... 
Done scraping page 16 ... 
Done scraping page 17 ... 
Done scraping page 18 ... 
Done scraping page 19 ... 
Done scraping page 20 ... 
Done scraping page 21 ... 
Done scraping page 22 ... 
Done scraping page 23 ... 
Done scraping page 24 ... 
Done scraping page 25 ... 
Done scraping page 26 ... 
Done scraping page 27 ... 
Done scraping page 28 ... 
Done scraping page 29 ... 
Done scraping page 30 ... 
Done scraping page 31 ... 
Done scraping page 32 ... 
Done scraping page 33 ... 
Done scraping page 34 ... 
Done scraping page 35 ... 
Done scraping page 36 ... 
Done scraping page 37 ... 
Done scrap

In [3]:
books = pd.DataFrame({'Name': titles, "Price": prices, "Rating": ratings,'ImageURL': imageURLs})
print(books.shape, end='')
books.head()

(1000, 4)

Unnamed: 0,Name,Price,Rating,ImageURL
0,A Light in the Attic,£51.77,Three,media/cache/2c/da/2cdad67c44b002e7ead0cc35693c...
1,Tipping the Velvet,£53.74,One,media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f...
2,Soumission,£50.10,One,media/cache/3e/ef/3eef99c9d9adef34639f51066202...
3,Sharp Objects,£47.82,Four,media/cache/32/51/3251cf3a3412f53f339e42cac213...
4,Sapiens: A Brief History of Humankind,£54.23,Five,media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c...


In [4]:
# convert rating from words to digits
ratings = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
books.Rating = books.Rating.map(ratings)

# save to csv
books.to_csv('dsc_web_scraping.csv', encoding='utf-8-sig', index=False) 

`jottings:`
<!-- # less dynamic:
# titles = []
# prices = []
# ratings = []
# imageURLs = []

# # crawl through all 50 pages
# for i in range(1, 51):
#     base_site = f'http://books.toscrape.com/catalogue/page-{i}.html'.format(i)
    
#     # Making a get request
#     response = requests.get(base_site)
    
#     # Extracting the HTML
#     html = response.content
#     soup = BeautifulSoup(html, "html.parser")
    
#     parent = soup.find_all('article')
#     titles.extend([title.find_all('a')[1].get('title') for title in parent])                # extract book title
#     prices.extend([price.text for price in soup.select('.price_color')])                    # extract prices
#     ratings.extend([rating.attrs['class'][1] for rating in soup.select('.star-rating')])    # extract rating
#     imageURLs.extend([image.find_all('a')[0].find('img').get('src') for image in parent])   # extact imageURL


<!-- # setting encoding to utf-8-sig removes the 'Â' character that shows in prices column when saved to csv without the encoding 

# books.to_excel('dsc_web_scraping.xlsx', index=False)  # no Â character when you save to excel -->