# Upwork Samples
<font size=4 color='blue'>Web Scrape - Book Store</font>   
***  

**Project Summary:**   
This project contains a variety of samples used in my Upwork portfolio

**Notebook Scope:**  
This notebook includes code to scrape data from a [Web Scraping Sandbox](https://toscrape.com/) site.

**Output:**  
The resulting data will be saved to a CSV file for further analysis.
***  

***
# Notebook Setup
***

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [2]:
# Set pandas display settings
pd.set_option('max_colwidth', 30)

***  
# Load Book Categories
***

In [3]:
# Scrape data from the Books to Scrape web site
url = 'https://books.toscrape.com/'
web_data = requests.get(url).text

In [4]:
# Create a BeautifulSoup object
soup = BeautifulSoup(web_data, 'html5lib')

In [5]:
# Create a dictionary of book categories and the link to the index page for each
# This information is available in the left side navigation
side_nav = soup.find('div', {'class': 'side_categories'})
side_nav_links_list = side_nav.find_all('a', href=True)
book_cats = {}

for link in side_nav_links_list:
    book_cats[link.contents[0].strip()] = link['href']

In [6]:
# Delete the side navigation heading from the book category dictionary
del book_cats['Books']

In [7]:
# Preview book categories
list(book_cats.items())[:5]

[('Travel', 'catalogue/category/books/travel_2/index.html'),
 ('Mystery', 'catalogue/category/books/mystery_3/index.html'),
 ('Historical Fiction',
  'catalogue/category/books/historical-fiction_4/index.html'),
 ('Sequential Art', 'catalogue/category/books/sequential-art_5/index.html'),
 ('Classics', 'catalogue/category/books/classics_6/index.html')]

***  
# Load Book Quantity by Category
***

In [8]:
# Load the index page for each category and read the total results
# Save totals to a quantity dictionary
cat_qty = {}
for cat in book_cats:
    url = 'https://books.toscrape.com/' + book_cats[cat]
    web_data = requests.get(url).text
    soup = BeautifulSoup(web_data, 'html5lib')
    cat_qty[cat] = soup.find('form').find('strong').contents[0]

In [9]:
# Preview book category quantities
list(cat_qty.items())[:5]

[('Travel', '11'),
 ('Mystery', '32'),
 ('Historical Fiction', '26'),
 ('Sequential Art', '75'),
 ('Classics', '19')]

***
# Load All Book Data
***

In [10]:
# loop through summary pages to pull title and URL for all books. Store this in a dataframe
# Set the URL for the first page of results
base_url = 'https://books.toscrape.com/catalogue/category/books_1/'
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
books_df = pd.DataFrame(columns=['Title', 'url'])

# Loop through each page
while url is not None:
    web_data = requests.get(url).text
    soup = BeautifulSoup(web_data, 'html5lib')
    books = soup.find_all('article', {'class': 'product_pod'})

    # Loop through each book on the page
    for book in books:
        book_heading = book.find('h3').contents[0]
        book_title = book_heading['title']
        book_url = 'http://books.toscrape.com/catalogue/' + book_heading['href'][6:]
        books_df = pd.concat([books_df, pd.DataFrame({'Title': [book_title], 'url': [book_url]})])

    # Get the next page
    next_class = soup.find('li', {'class': 'next'})
    if next_class != None:
        url = base_url + next_class.find('a').attrs['href']
    else:
        url = None

In [11]:
# Validate length of dataframe - we should have 1000 books
len(books_df)

1000

In [12]:
# Set dataframe index and preview data
books_df.set_index('Title', drop=True, inplace=True)
books_df.head()

Unnamed: 0_level_0,url
Title,Unnamed: 1_level_1
A Light in the Attic,http://books.toscrape.com/...
Tipping the Velvet,http://books.toscrape.com/...
Soumission,http://books.toscrape.com/...
Sharp Objects,http://books.toscrape.com/...
Sapiens: A Brief History of Humankind,http://books.toscrape.com/...


In [13]:
# Look through each book detail page and scrape remaining data
ordinal_words = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4 , 'Five': 5}

for row in books_df.iterrows():
    # Read in the web page and create a soup object
    url = row[1]['url']
    web_data = requests.get(url).text
    soup = BeautifulSoup(web_data, 'html5lib')
    
    # Find eoements of interest and add to books dataframe
    rating_str = soup.find('p', {'class': re.compile('star-rating*')}).attrs['class'][1]
    books_df.at[row[0], 'Rating'] = ordinal_words[rating_str]
    books_df.at[row[0], 'UPC'] = soup.find('th', string='UPC').next_element.next_element.contents[0]
    books_df.at[row[0], 'Price'] = soup.find('th', string='Price (incl. tax)').next_element.next_element.contents[0]
    books_df.at[row[0], 'Tax'] = soup.find('th', string='Tax').next_element.next_element.contents[0]
    avail = soup.find('th', string='Availability').next_element.next_element.next_element.next_element
    if avail.startswith('In stock'):
        books_df.at[row[0], 'Qty on Hand'] = avail.split('(')[1].split()[0]
        books_df.at[row[0], 'Availability'] = 'In Stock'
    else:
        books_df.at[row[0], 'Availability'] = 'Out of Stock'
    books_df.at[row[0], 'Reviews'] = soup.find('th', string='Number of reviews').next_element.next_element.next_element.next_element
    if soup.find('h2', string='Product Description') is not None:
        books_df.at[row[0], 'Description'] = soup.find('h2', string='Product Description').next_element.next_element.next_element.next_element.contents[0]

In [14]:
# Preview books dataframe
books_df.head()

Unnamed: 0_level_0,url,Rating,UPC,Price,Tax,Qty on Hand,Availability,Reviews,Description
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A Light in the Attic,http://books.toscrape.com/...,3.0,a897fe39b1053632,Â£51.77,Â£0.00,22,In Stock,0,It's hard to imagine a wor...
Tipping the Velvet,http://books.toscrape.com/...,1.0,90fa61229261140a,Â£53.74,Â£0.00,20,In Stock,0,"""Erotic and absorbing...Wr..."
Soumission,http://books.toscrape.com/...,1.0,6957f44c3847a760,Â£50.10,Â£0.00,20,In Stock,0,Dans une France assez proc...
Sharp Objects,http://books.toscrape.com/...,4.0,e00eb4fd7b871a48,Â£47.82,Â£0.00,20,In Stock,0,"WICKED above her hipbone, ..."
Sapiens: A Brief History of Humankind,http://books.toscrape.com/...,5.0,4165285e1663650f,Â£54.23,Â£0.00,20,In Stock,0,From a renowned historian ...


***
# Validate Data
***

In [15]:
# Verify no duplicate book titles
book_cts = pd.DataFrame(books_df.index.value_counts())
print('The following books have more than 1 entry in our dataframe:')
print(book_cts[book_cts['count'] > 1].to_string(header=False, index_names=False))

The following books have more than 1 entry in our dataframe:
The Star-Touched Queen  2


In [16]:
# Look for missing data
pd.DataFrame(books_df.isna().sum(), columns=['NaN Count'])

Unnamed: 0,NaN Count
url,0
Rating,0
UPC,0
Price,0
Tax,0
Qty on Hand,0
Availability,0
Reviews,0
Description,2


In [17]:
# List books that are missing a description
print('Books without a description:')
for title in books_df[books_df['Description'].isna()].index:
    print(f'   {title}')

Books without a description:
   The Bridge to Consciousness: I'm Writing the Bridge Between Science and Our Old and New Beliefs.
   Alice in Wonderland (Alice's Adventures in Wonderland #1)


In [18]:
# Review ratings data
books_df['Rating'].value_counts()

Rating
1.0    226
3.0    203
5.0    196
2.0    196
4.0    179
Name: count, dtype: int64

In [19]:
# Review price range
print(f'Lowest price: {books_df['Price'].str[2:].min()}')
print(f'Highest price: {books_df['Price'].str[2:].max()}')

Lowest price: 10.00
Highest price: 59.99


In [20]:
# Review tax range
print(f'Lowest tax: {books_df['Tax'].str[2:].min()}')
print(f'Highest tax: {books_df['Tax'].str[2:].max()}')

Lowest tax: 0.00
Highest tax: 0.00


In [21]:
# Review qty on hand range
print(f'Lowest qty on hand: {books_df['Qty on Hand'].min()}')
print(f'Highest qty on hand: {books_df['Qty on Hand'].max()}')

Lowest qty on hand: 1
Highest qty on hand: 9


In [22]:
# Review review range
print(f'Lowest number of reviews: {books_df['Reviews'].min()}')
print(f'Highest number of reviews: {books_df['Reviews'].max()}')

Lowest number of reviews: 0
Highest number of reviews: 0


***
# Data Cleanup
***

In [23]:
# Review duplicate rows
books_df[books_df.duplicated()]

Unnamed: 0_level_0,url,Rating,UPC,Price,Tax,Qty on Hand,Availability,Reviews,Description
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


***
<font color='blue'>**Note:**</font>
There are no duplicate rose, but there there is a duplicated title. More investigation would be needed to determine the differences, and determine which row of data was the "correct" row.
***

In [24]:
# Convert ratings to int
books_df['Rating'] = books_df['Rating'].astype('int')

In [25]:
# Cleanup price and tax fields
books_df['Price'] = books_df['Price'].str[1:]
books_df['Tax'] = books_df['Tax'].str[1:]

***
# Final Preview
***

In [26]:
books_df.head()

Unnamed: 0_level_0,url,Rating,UPC,Price,Tax,Qty on Hand,Availability,Reviews,Description
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A Light in the Attic,http://books.toscrape.com/...,3,a897fe39b1053632,£51.77,£0.00,22,In Stock,0,It's hard to imagine a wor...
Tipping the Velvet,http://books.toscrape.com/...,1,90fa61229261140a,£53.74,£0.00,20,In Stock,0,"""Erotic and absorbing...Wr..."
Soumission,http://books.toscrape.com/...,1,6957f44c3847a760,£50.10,£0.00,20,In Stock,0,Dans une France assez proc...
Sharp Objects,http://books.toscrape.com/...,4,e00eb4fd7b871a48,£47.82,£0.00,20,In Stock,0,"WICKED above her hipbone, ..."
Sapiens: A Brief History of Humankind,http://books.toscrape.com/...,5,4165285e1663650f,£54.23,£0.00,20,In Stock,0,From a renowned historian ...


***
# Save to CSV
***

In [None]:
injuries_df.to_csv('../data/injuries.csv', index=False)

***
**End**
***