# Web Scraping Goodreads: Exploring the World of Books

Welcome to this web scraping project where extract data from [Goodreads](https://www.goodreads.com/). Welcome into the world of books, data, and insights. If you're a book lover like me, you're in for a treat! And if you're not, well, I believe this project might just inspire you to delve into the captivating realm of literature.

In this project, we'll be harnessing the power of web scraping to extract a wealth of information from Goodreads, a treasure trove of book-related data. From book titles and authors to ratings, and more, Goodreads offers a vast reservoir of knowledge waiting to be explored.

This introductory guide will walk you through the process of setting up your environment, sending HTTP requests, and navigating the structure of web pages to gather data. It's a journey that promises exciting possibilities for data analysis and uncovering hidden insights about the world of books.

So, let's dive in and start exploring the fascinating world of literature through the lens of data! Get ready to scrape, analyze, and discover the stories that await us.


In [1]:
#imports necessary libraries (Pandas, requests, BeautifulSoup) for web scraping and data manipulation.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re

In [2]:
# Define a user-agent header to identify your scraper
user_agent = "MyWebScraper/1.0"

In [3]:
book_titles = []
authors = []
avg_ratings = []
ratings = []
published_years = []
editions = []

In [7]:
pages_to_scrape = 1

# Specify the delay between requests in seconds (e.g., 2 seconds)
request_delay = 3

# Loop through the pages to scrape
for page in range(1, pages_to_scrape + 1):
    
    # Construct the URL for the current page
    url = "https://www.goodreads.com/search?page=" + str(page) + "&q=fiction&search_type=books"
   
    try:
        # Send an HTTP GET request to the URL with the user-agent header
        headers = {"User-Agent": user_agent}
        response = requests.get(url, headers=headers).text

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response, "html.parser")
    
        # Check for server errors or maintenance
        if soup.title and "service unavailable" in soup.title.text.lower():
            print(f"Server error on page {page}. Skipping...")
            continue

        # Select the table containing the list of books
        table = soup.find_all("table")[0]

        # Loop through the rows of the table
        for row in table.find_all("tr"):
            cells = row.find_all("td")[1]

            # Extract book title
            title = cells.find("a").find("span").text
            book_titles.append(title)

            # Extract author's name
            author = cells.find("a", class_="authorName").text
            authors.append(author)
            

            #rating
            all_ratings = cells.find_all('span', class_ = 'minirating')
            all_ratings_text = all_ratings[0].text.strip()
            pattern_2 = re.compile(r"(\d\.?\d*)\savg")
            avg_ratings.append(pattern_2.search(all_ratings_text).group(1))

            #n_ratings
            pattern_4 = re.compile(r"(\d\,?\d*) rating")
            ratings_matches = pattern_4.search(all_ratings_text)
            ratings.append(ratings_matches.group(1) if ratings_matches else 0)  


#             # Extract average rating
#             avg_rating = cells.find("span", class_="greyText smallText uitext").text.split()[0]
#             avg_ratings.append(avg_rating)

#             # Extract rating
#             rating = cells.find("span", class_="greyText smallText uitext").text.split()[4]
#             ratings.append(rating)

            # Extract published year, handling cases where it may not be in the expected format
            year_info = cells.find("span", class_="greyText smallText uitext").text.split()
            year = None
            for item in year_info:
                if item.isdigit() and len(item) == 4:
                    year = item
                    break
            if year:
                published_years.append(year)
            else:
                published_years.append(0)  # Handle cases where year is not found

            # Extract edition information
            edition = cells.find("span", class_="greyText smallText uitext").text.split()[-2]
            editions.append(edition)
            print(title,author,pattern_2.search(all_ratings_text).group(1),ratings_matches.group(1) if ratings_matches else 0,year,edition)
            print(f"{page}/{pages_to_scrape} scraped.  Titles found: {len(book_titles)}", end='\r')

        # Sleep to add a delay between requests
        time.sleep(request_delay)
    
    except requests.exceptions.RequestException as e:
        # Handle HTTP request errors (e.g., connection issues)
        print(f"Error on page {page}: {e}")

    except IndexError as e:
        # Handle "list index out of range" error
        print(f"Index error on page {page}: {e}")

    except Exception as e:
        # Handle other unexpected errors
        print(f"Unexpected error on page {page}: {e}")

Smoke and Mirrors: Short Fiction and Illusions Neil Gaiman 4.02 1,896 1998 99
Fragile Things: Short Fictions and Wonders Neil Gaiman 3.96 9,824 2006 44
What She Left Behind: A Haunting and Heartbreaking Story of 1920s Historical Fiction Ellen Marie Wiseman 3.99 7,397 2015 54
Collected Fictions Jorge Luis Borges 4.57 3,960 1998 63
Better Than Fiction Alexa  Martin 3.42 0,766 2022 8
True Fiction (Ian Ludlow Thrillers #1) Lee Goldberg 3.99 7,267 2018 8
The Red Badge of Courage and Selected Short Fiction Stephen Crane 3.66 8,012 1895 9
Non-Fiction Chuck Palahniuk 3.57 2,584 2004 72
Paperbacks from Hell: The Twisted History of '70s and '80s Horror Fiction Grady Hendrix 4.27 0,010 2017 10
1/1 scraped.  Titles found: 1009

In [5]:
# After scraping all pages, we create a DataFrame from the collected data
data = {
    "Title": book_titles,
    "Author": authors,
    "Average Rating": avg_ratings,
    "Rating": ratings,
    "Year Published": published_years,
    "Edition": editions
}

In [6]:
goodreads = pd.DataFrame(data)

# Display the first five rows of the dataframe
goodreads

Unnamed: 0,Title,Author,Average Rating,Rating,Year Published,Edition
0,Trigger Warning: Short Fictions and Disturbances,Neil Gaiman,3.92,1193,2015,64
1,Smoke and Mirrors: Short Fiction and Illusions,Neil Gaiman,4.02,1896,1998,99
2,What She Left Behind: A Haunting and Heartbrea...,Ellen Marie Wiseman,3.99,7397,2015,54
3,Fragile Things: Short Fictions and Wonders,Neil Gaiman,3.96,9824,2006,44
4,Collected Fictions,Jorge Luis Borges,4.57,3960,1998,63
...,...,...,...,...,...,...
994,Sudden Fiction International: 60 Short-Short S...,Robert Shapard,3.75,348,1989,11
995,The Secrets to Creating Character Arcs: A Fict...,John S. Warner,4.13,108,0,3
996,Imperium: A Fiction of the South Seas,Christian Kracht,3.69,3639,2012,45
997,The Mark on the Wall and Other Short Fiction,Virginia Woolf,3.61,710,2009,12


In [30]:
len(goodreads)

1099

In [8]:
goodreads.to_csv("Goodreads_books.csv", index=False)