<div align="center"> <h1>Web Scraping Code</h1> </div>

<img src="https://images.unsplash.com/photo-1492515114975-b062d1a270ae?q=80&w=2940&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" style="width:100%; height:200px; object-fit:cover;" />

In [1]:
## Import our libraries

from bs4 import BeautifulSoup  ## the BeautifulSoup library for scraping from the bs4 package
import requests ## Establish website connection using the requests library
import pandas as pd
import numpy as np
import re ## RegEx for pattern matching

In [3]:
# function to get the site (with right page number) for each page
def sites():
  sites = []
  for i in range(1,51):
    sites.append(f'https://books.toscrape.com/catalogue/category/books_1/page-{i}.html')
  return sites

In [5]:
sites = sites()

In [7]:
# function for book links
def book_links(sites):
  book_links = []
  for site in sites:
    resp = requests.get(site)
    soup = BeautifulSoup(resp.text, 
              'html.parser') 
    book_links_page = ['https://books.toscrape.com/catalogue/' + link.get('href')[6:] for link in soup.find('ol').findAll('a')]
    book_links.extend(book_links_page)
  book_links = set(book_links)
  return book_links

In [9]:
book_links = book_links(sites)

In [10]:
len(book_links) # there are a 1000 books on this site

1000

In [11]:
# function for appending book name and category
def book_info(book_links):
  book_name = []
  category = []
  for link in book_links:
    resp = requests.get(link)
    soup = BeautifulSoup(resp.text,
                     'html.parser')
    book_name.append(soup.h1.text)
    category.append(soup.find('ul', class_ = 'breadcrumb').findAll('a')[2].text)
  return book_name, category

In [12]:
book_title, category = book_info(book_links)

In [13]:
df_books = pd.DataFrame(list(zip(book_title, category)),
              columns=['Title', 'Category'])

In [14]:
df_books

Unnamed: 0,Title,Category
0,Soft Apocalypse,Science Fiction
1,The Bear and the Piano,Childrens
2,Anonymous,Default
3,Dear Mr. Knightley,Fiction
4,Proofs of God: Classical Arguments from Tertul...,Philosophy
...,...,...
995,"Where'd You Go, Bernadette",Default
996,The E-Myth Revisited: Why Most Small Businesse...,Business
997,The Power Greens Cookbook: 140 Delicious Super...,Food and Drink
998,Out of Print: City Lights Spotlight No. 14,Poetry


Exercise 3: (FINAL EXERCISE) Scrape the website! We want a dataframe that has the following columns:

Book_name || Book_link || Category || Category_link || Book_stock_availability || Book_price || Book_UPC || Book_Tax || Book_number_of_reviews || Book_description

Notes:

The book name should be in full, not abbreviated

The links should be working as they are (not just the end extension)

Stock availability should contain the # of products available; Eg: In stock (8 available)

Price and Tax should be consistent (either both have the £ sign, or neither do)