# Web Scraping Goodreads: Exploring the World of Books

Welcome to this web scraping project where extract data from [Goodreads](https://www.goodreads.com/). Welcome into the world of books, data, and insights. If you're a book lover like me, you're in for a treat! And if you're not, well, I believe this project might just inspire you to delve into the captivating realm of literature.

In this project, we'll be harnessing the power of web scraping to extract a wealth of information from Goodreads, a treasure trove of book-related data. From book titles and authors to ratings, and more, Goodreads offers a vast reservoir of knowledge waiting to be explored.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re
import random

In [None]:
# Define a user-agent header to identify your scraper
user_agent = "MyWebScraper/1.0"

In [48]:
book_titles = []
authors = []
avg_ratings = []
ratings = []
published_years = []
editions = []

In [49]:
pages_to_scrape = 150

# Specify the delay between requests in seconds (e.g., 2 seconds)
request_delay = random.randint(2,6)

# Loop through the pages to scrape
for page in range(1,pages_to_scrape):
    
    # Construct the URL for the current page
    url = "https://www.goodreads.com/search?page=" + str(page) + "&q=fiction&search_type=books"
   
    try:
        # Send an HTTP GET request to the URL with the user-agent header
        headers = {"User-Agent": user_agent}
        response = requests.get(url, headers=headers).text

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response, "html.parser")
    
        # Check for server errors or maintenance
        if soup.title and "service unavailable" in soup.title.text.lower():
            print(f"Server error on page {page}. Skipping...")
            continue

        # Select the table containing the list of books
        table = soup.find_all("table")[0]

        # Loop through the rows of the table
        for row in table.find_all("tr"):
            cells = row.find_all("td")[1]

            # Extract book title
            title = cells.find("a").find("span").text
            book_titles.append(title)

            # Extract author's name
            author = cells.find("a", class_="authorName").text
            authors.append(author)
            

            #rating
            all_ratings = cells.find_all('span', class_ = 'minirating')
            all_ratings_text = all_ratings[0].text.strip()
            pattern_2 = re.compile(r"(\d\.?\d*)\savg")
            avg_ratings.append(pattern_2.search(all_ratings_text).group(1))

            #n_ratings
            pattern_4 = re.compile(r"(?<=— )([\d,]+)(?= ratings)")
            ratings_matches = pattern_4.search(all_ratings_text)
            ratings.append(ratings_matches.group(1) if ratings_matches else 0)  


#             # Extract average rating
#             avg_rating = cells.find("span", class_="greyText smallText uitext").text.split()[0]
#             avg_ratings.append(avg_rating)

#             # Extract rating
#             rating = cells.find("span", class_="greyText smallText uitext").text.split()[4]
#             ratings.append(rating)

            # Extract published year, handling cases where it may not be in the expected format
            year_info = cells.find("span", class_="greyText smallText uitext").text.split()
            year = None
            for item in year_info:
                if item.isdigit() and len(item) == 4:
                    year = item
                    break
            if year:
                published_years.append(year)
            else:
                published_years.append(0)  # Handle cases where year is not found

            # Extract edition information
            edition = cells.find("span", class_="greyText smallText uitext").text.split()[-2]
            editions.append(edition)
            print(f"{page}/{pages_to_scrape} scraped.  Titles found: {len(book_titles)}", end='\r')

        # Sleep to add a delay between requests
        time.sleep(request_delay)
    
    except requests.exceptions.RequestException as e:
        # Handle HTTP request errors (e.g., connection issues)
        print(f"Error on page {page}: {e}")

    except IndexError as e:
        # Handle "list index out of range" error
        print(f"Index error on page {page}: {e}")

    except Exception as e:
        # Handle other unexpected errors
        print(f"Unexpected error on page {page}: {e}")

Error on page 97: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /search?page=97&q=fiction&search_type=books (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1480e60d0>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))
Error on page 131: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /search?page=131&q=fiction&search_type=books (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1488504d0>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))
Error on page 135: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /search?page=135&q=fiction&search_type=books (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x148a6fb50>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))
Error on page 143: HTTPSConnectionPool(host='www

In [50]:
# After scraping all pages, we create a DataFrame from the collected data
data = {
    "Title": book_titles,
    "Author": authors,
    "Average Rating": avg_ratings,
    "Rating": ratings,
    "Year Published": published_years,
    "Edition": editions
}

In [51]:
goodreads = pd.DataFrame(data)

# Display the first five rows of the dataframe
goodreads

Unnamed: 0,Title,Author,Average Rating,Rating,Year Published,Edition
0,Trigger Warning: Short Fictions and Disturbances,Neil Gaiman,3.92,61194,2015,23
1,Smoke and Mirrors: Short Fiction and Illusions,Neil Gaiman,4.02,71896,1998,99
2,Fragile Things: Short Fictions and Wonders,Neil Gaiman,3.96,69823,2006,44
3,What She Left Behind: A Haunting and Heartbrea...,Ellen Marie Wiseman,3.99,57408,2015,54
4,Collected Fictions,Jorge Luis Borges,4.57,23960,1998,63
...,...,...,...,...,...,...
1434,Desire and Domestic Fiction: A Political Histo...,Nancy Armstrong,3.89,186,1987,11
1435,Wandering Stars: An Anthology of Jewish Fantas...,Jack Dann,3.72,222,1974,15
1436,The Book of Ramallah: A City in Short Fiction,Maya Abu al-Hayat,3.85,100,0,2
1437,Sudden Fiction International: 60 Short-Short S...,Robert Shapard,3.75,348,1989,11


In [52]:
goodreads.to_csv("goodreads_books.csv", index=False)
# goodreads.to_csv("goodreads_books.csv", mode='a', index=False, header=False)