# Data Collection and Cleaning


In [None]:
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
from tqdm import tqdm
import pickle

# Data Collection: Scraping Web Data

We will be scraping data off of the website [librarything.com](https://www.librarything.com/), which contains over 155 million books catalogued from Amazon, the Library of Congress and 4,941 other libraries. It also contains metadata such as: 
- Book title 
- Blurb 
- Author
- Publication Date 
- First words 
- Publishing house 
- Average rating
- Number of reviews (we will filter for books with more than 10 reviews)
- Book covers

However, the books are stored under random IDs in their url strings. For example, George Orwell's '1984' is under: https://www.librarything.com/work/1472. So first, we needed to scrape author pages and extract the urls of all their works. Author pages are accessible under: https://www.librarything.com/author/orwellgeorge. 

We first scrape a list of author names using LibraryThing's author gallery, Wikipedia Bestselling Authors, and The Guardian's List of Authors. This resulted in 1613 authors.

We then put the corresponding names into the correct format to loop through 1613 author pages and access all of their works.

**Using this method, our raw dataset contains 25,108 books.**

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/gdrive')
data_dir = "/content/gdrive/MyDrive/Data Science Final Project"

Mounted at /content/gdrive


## Functions for scraping & saving web data

This is a function for scraping author names to obtain them in the format we want (LastnameFirstname).

In [None]:
#make a function for turning author names into the appropriate format (LastnameFirstname)
import re
def process_name(name):
  '''
  give a firstname lastname string and return the url format of the string
  '''
  author_name = name.split()
  for i in range(len(author_name)):
    author_name[i] = re.sub(r'[^\w\s]', '', author_name[i])
  return author_name[-1]+"".join(author_name[:-1])

In [None]:
#function to add the correct "/author/" prefix to loop through the author page URLs on LibraryThing
def add_prefix(authors):
    return ["/author/" + author.lower() for author in authors]

In [None]:
# define load and save functions -- to store scraped data into a local csv/txt file
def save(list_to_save: list, key: str):
  with open(key, 'wb') as fp:
    pickle.dump(list_to_save, fp)

def load(key: str):
  with open (key, 'rb') as fp:
    itemlist = pickle.load(fp)
    return itemlist

## Collecting author names to scrape book metadata

#### Step 1: Scrape author names from various sources (LibraryThing, The Guardian, Wikipedia)

LibraryThing already provides the author names in the correct format for URL processing, so we do not need to call the process_name function.

In [None]:
#scrape 1000 authors from Librarything
url = 'https://www.librarything.com/zeitgeist/authorgallery'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

authors_LT = []
divs = soup.find_all('div', {'class': ['picture', 'lt2_columnar_item']})
for div in divs:
    for a_tag in div.find_all('a'):
        if "/author/" in a_tag.get('href') and "http" not in a_tag.get('href'):
          authors_LT.append(a_tag.get('href'))

print(f"{len(authors_LT)} authors from LibraryThing's Author Gallery")

1000 authors from LibraryThing's Author Gallery


In [None]:
#print examples of the LibraryThing author scrape
authors_LT[:5]

['/author/rowlingjk',
 '/author/kingstephen-1',
 '/author/pratchettterry',
 '/author/lewiscs',
 '/author/tolkienjrr']

In [None]:
#scrape author names from Wikipedia's bestseller author list, got 57 additional authors
response = requests.get(
    "https://en.wikipedia.org/wiki/List_of_best-selling_fiction_authors")
soup = BeautifulSoup(response.text, "html.parser")
table_0=soup.find("table")
table_0_body=table_0.find("tbody")
authors_wiki = []
for row in table_0_body.find_all("tr")[1:]:
  value=row.find_all("td")
  name=value[0].find("a").text
  #what about names with three words
  if not f"/author/{process_name(name).lower()}" in authors_LT:
    authors_wiki.append(process_name(name))
print(f"{len(authors_wiki)} authors from Wikipedia's Bestseller List")

57 authors from Wikipedia's Bestseller List


In [None]:
#print example of the Wiki author scrape
authors_wiki[:5]

['CartlandBarbara',
 'RobbinsHarold',
 'OdaEiichiro',
 'PattenGilbert',
 'ToriyamaAkira']

In [None]:
# scrape author names from the Guardian's author page, got 556 additional author names
response = requests.get(
    "https://www.theguardian.com/books/list/authorsaz")
soup = BeautifulSoup(response.text, "html.parser")
initials=soup.find_all("div", attrs={
            "class":"countries dir-first"})

authors_guardian = []

for initial in initials:
  names = initial.find_all('a')
  for name in names:
    processed = process_name(name.text)
    if not processed in authors_wiki:
      authors_guardian.append(processed)
print(f"{len(authors_guardian)} authors from The Guardian's Bestseller List")

556 authors from The Guardian's Bestseller List


In [None]:
authors_final = authors_LT + add_prefix(authors_wiki) + add_prefix(authors_guardian)
print(f"{len(authors_final)} total authors")
# print example of the author data
print(authors_final[-5:])

1613 total authors
['/author/jonesdianawynne', '/author/xinran', '/author/yeatswb', '/author/zizekslavoj', '/author/zolaemile']


In [None]:
#use author names to get author urls to prepare for scraping later
author_urls = []
for author in authors_final:
  author = f"https://www.librarything.com{author}"
  author_urls.append(author)
author_urls
# We are saving the names of 1613 authors in total.
save(author_urls, "author_urls")

#### Step 2: Using our list of authors, we scraped LibraryThing's author pages to access individual book pages and saved all our book pages under book_url. This yielded us 119,215 URLs in total.

For a more manageable dataset, we scraped a maximum of 30 books per author. This yielded us 46,271 webpages to scrape from.

In [None]:
#collect list of book page urls through author pages
book_urls = []

for url in author_urls:
  time.sleep(0.5)
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  div_tags = soup.find_all('div', {'class': 'li_donthave'})

# Loop through the 'div' elements and extract the 'href' value (url of individual books) from the 'a' element within it
  if len(div_tags) > 30:
    div_tags = div_tags[:30]
  for div_tag in div_tags[:30]:
    book_id = div_tag.find('a')
    if book_id:
      book_urls.append(f"https://www.librarything.com/{book_id['href']}")      

save(book_urls, "book_urls")
print(f"{len(book_urls)} books to scrape data from")

46271

##Scrape book metadata

Step 3: Now that we have all the book page urls, we can start scraping the book metadata to build a dataframe with information about indiviudal books. We scraped the data from three sources: Wikipedia's top 100 authors page, The Guardian's authors page, and LibraryThing's authors page. 

Since we had 46, 271 URLs to scrape, we did the scraping on Visual Studio Code so that the runtime would not exceed Colab's maximum. We divided this task between us and concatenated our dataframes into a final df_books.

We will scrape data on the following features:
1.   title
2.   author
3. book index
4. average rating of the book
5. genres associated with the book
6. short description of the book
7. book's publication year
8. the first sentence of the book
9. the number of reviews the book has
10. a url to the book's cover


In [None]:
#ran this code on VS Code
#scrape individual book pages for details on book title, author, rating, blurb, book cover image url
#drop books with less than 10 reviews/no rating

import numpy as np
import pandas as pd

add_book_info = pd.DataFrame(columns=["title", 'author', 'book_index', 'book_url', 'avg_rating', 'genre', 'description', 'publication_year','first_sentence',"number_of_reviews",'image_url', 'srcset'])

for i, url in tqdm(list(enumerate(add_book_urls))):
  if i % 100 == 0:
   save(add_book_info, 'add_book_info')
  response = requests.get(url)
  html_content = response.content

  # Parse the HTML content using BeautifulSoup
  try:
    soup = BeautifulSoup(html_content, 'html.parser')
  except:
    continue

  # Find the 'h1' tag element
  try:
    h1_tag = soup.find('div', {'class': 'headsummary'}).find('h1')
    h2_tag = soup.find('div', {'class': 'headsummary'}).find('h2')
  except AttributeError:
    print(f"Error in {i}: Could not find 'div' element with class 'headsummary' or 'h1' or 'h2' tags.")
    continue  # skip the current iteration and move on to the next one
      

  # Extract the text content of the 'h1' tag element and its child 'span' tag element
  title = h1_tag.text.strip()  # Text content of 'h1' tag element
  if title is None: continue
  #year = soup.find('div', {'class': 'fwikiAtomicValue'}).find('a').text.strip()
  author = h2_tag.find('a').text.strip()
  if author is None: continue

  #not all books have a rating, need to filter
  if soup.find('span', {'class': 'dark_hint'}):
    avg_rating = float(soup.find('span', {'class': 'dark_hint'}).text.strip('()'))
  else:
    continue
  # number of reviews
  reviewnum = soup.find("table", attrs={
              "cellpadding":"0",'cellspacing':'0','class':'wsltable'})
  if reviewnum:
    review = reviewnum.find('tr', attrs={'class':'wslcontent'})
    if review:
      reviewnumber = review.find_all("td")[1]
      if reviewnumber:
        reviewnumber = reviewnumber.text.strip()
        if reviewnumber == "None" or int(reviewnumber) <= 10: continue
      else:
        continue
    else:
      continue
  else:
    continue

  # genre
  genre = soup.find('div', {'id': 'genregreenbox'})
  if genre: #need to change the empty lists into NaN or empty string
    tags = [x.text.strip() for x in genre.find_all('div',{'class': 'genreline'})]
    if len(tags) == 0:
      continue
  else:
    continue

  # description
  book_description = soup.find('tr', {'class': 'wslcontent wslsummary'})
  if book_description:
    description = book_description.find('div').text.strip()
    if description:
      pass
    else:
      continue
  else:
    continue

  #publication year
  year = soup.find("div", attrs={
            "class":"fwikiItem divoriginalpublicationdate"})
  if year:
    publication = year.find('a')
    if publication:
      publication = publication.text
    else:
      continue
  else: 
    continue

  #first words
  firstword = soup.find("div", attrs={
            "class":"fwikiItem divfirstwords"})
  if firstword:
    firstwords = firstword.find("div", attrs={
            "class":"fwikiAtomicValue", 'style':'min-height:12px;'})
    if firstwords:
      firstwords = firstwords.text
    else:
      continue
  else:
    continue

  #image url 
  img = soup.find("div", attrs={
            "id":"maincover"})
  if img:
    img = img.find_all('img')[1]
    if img:
      # if srcset exists set srcset to srcset, otherwise set srcset to None
      srcset = img.get('srcset') if img.get('srcset') else None
      img = img.get('src')
    else:
      continue
  else:
    continue

  if title:
    add_book_info = pd.concat([add_book_info, pd.DataFrame.from_records([{ 
      "title":title,
      "author":author, 
      "book_index": i,
      "book_url": url,
      "avg_rating":avg_rating, 
      "genre":tags,
      "description":description,
      "publication_year": publication,
      "first_sentence": firstwords,
      "number_of_reviews": reviewnumber,
      'image_url': img,
      'srcset': srcset
     }]
    )], ignore_index=True)
  else:
    continue
save(add_book_info, 'add_book_info')

## Converting to DataFrame

Because we split the URLs across multiple runtimes, we now need to concatenate all of our book_info files to compile a single large dataframe with all of our books and book metadata. Finally, we saved this as a csv.

In [None]:
import os
folder_path = '/content/drive/MyDrive/Data Science Final Project/book_info'

# Create an empty list to store the data frames
df_list = []

# Loop through each file in the folder
for file_name in os.listdir(folder_path):
  file_path = os.path.join(folder_path, file_name)
  df = load(file_path)
  df_list.append(df)

# Concatenate all the data frames 
df_books = pd.concat(df_list, axis=0).reset_index(drop=True)

In [None]:
print(f"Our raw dataset contains: {len(df_books)} books")

Our raw dataset contains: 25108 books


## Data Cleaning & Converting to .csv

Cleaning elements in the book dataframe to prepare it for analysis and ML. To do this, we needed to standardize all of our columns. For example, some books had a full publication date, whereas others only had their years. Thus, we standardized to years only. Additionally, as we'l be analyizing trends in book titles, we needed to remove the publication years from book titles.


In [None]:
#load pre-saved csv of book data

df_books = pd.read_csv("/content/gdrive/MyDrive/Data Science Final Project/books.csv")

In [None]:
#remove year from book titles (e.g "And Then There Were None (1939)")
df_books['title'] = df_books['title'].str.replace(r'\(\d{4}\)', '')

#change publication date to just year, not date, convert back to int
df_books['publication_year'] = df_books['publication_year'].astype(str)
df_books['publication_year'] = df_books['publication_year'].str[:4]

df_books

  df_books['title'] = df_books['title'].str.replace(r'\(\d{4}\)', '')


Unnamed: 0,title,author,book_index,book_url,avg_rating,genre,description,publication_year,first_sentence,number_of_reviews,image_url,srcset,genre_str
0,And Then There Were None,Agatha Christie,0,https://www.librarything.com//work/7962202,4.14,"['Fiction and Literature', 'Mystery']","Ten houseguests, trapped on an isolated island...",1939,In the corner of a first-class smoking carriag...,500,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
1,Murder on the Orient Express,Agatha Christie,1,https://www.librarything.com//work/2742,4.07,"['Fiction and Literature', 'Mystery']","Agatha Christie's most famous murder mystery, ...",1934,It was five o'clock on a winter's morning in S...,394,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
2,The Murder of Roger Ackroyd,Agatha Christie,2,https://www.librarything.com//work/3011,4.06,"['Fiction and Literature', 'Mystery']",Agatha Christie's most daring crime mystery - ...,1926,Mrs Ferrars died on the night of the 16th-17th...,291,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
3,The Mysterious Affair at Styles,Agatha Christie,3,https://www.librarything.com//work/2921950,3.75,"['Fiction and Literature', 'Mystery']","Set in the summer of 1917, the story follows t...",1920,The intense interest aroused in the public by ...,261,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
4,Death on the Nile,Agatha Christie,4,https://www.librarything.com//work/29995,3.93,"['Fiction and Literature', 'Mystery']","Linnet Doyle is young, beautiful, and rich. Sh...",1937,'Linnet Ridgeway!',168,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
24966,Cadillac Jukebox,James Lee Burke,18095,https://www.librarything.com//work/82967,3.80,"['Fiction and Literature', 'Mystery']","A Louisiana farmer is jailed for the murder, 3...",1996,Aaron Crown should not have come back into our...,13,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
24967,Sunset Limited,James Lee Burke,18096,https://www.librarything.com//work/16445,3.75,"['Fiction and Literature', 'Mystery']",Detective Dave Robicheaux returns to center st...,1998,I had seen a dawn like this one only twice in ...,10,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
24968,Crusader's Cross,James Lee Burke,18097,https://www.librarything.com//work/32025,3.97,"['Fiction and Literature', 'Mystery']",A conversation between Robicheaux and a dying ...,2005,"It was the end of an era, one that I suspect h...",28,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"
24969,Burning Angel,James Lee Burke,18098,https://www.librarything.com//work/70226,3.83,"['Fiction and Literature', 'Mystery']","Dave Robicheaux, New Orleans detective, is puz...",1995,The Giacano family had locked up the action in...,16,https://images-na.ssl-images-amazon.com/images...,https://images-na.ssl-images-amazon.com/images...,"'Fiction and Literature', 'Mystery"


In [None]:
print(f"Our cleaned dataset contains {len(df_books)} books")

Our cleaned dataset contains 24971 books


In [None]:
df_books.to_csv('books.csv', index=False)

# Prepping data for machine learning

We need to clean the data further so that all columns can be used as features for ML. This means re-formatting the genre column so that it is an easily accessible list, converting publication years to strings (categorical variable). We also filtered for books with >10 reviews, so not all scraped URLs made it to the dataframe -- so we need to adjust the book index so that it is equivalent to the book in the dataframe.

In [None]:
df_books = pd.read_csv("/content/gdrive/MyDrive/Data Science Final Project/books.csv")

In [None]:
df_clean = df_books.copy()
df_clean['genre'] = df_clean['genre'].apply(lambda x: [genre.strip().strip("'") for genre in x[2:-2].split(',')])
df_clean['book_index'] = df_books.index
df_clean = df_clean.drop(['genre_str', 'srcset'], axis=1)
df_clean.loc[df_clean['publication_year'] == 2918, 'publication_year'] = 2018
df_clean.loc[df_clean['publication_year'] == 2029, 'publication_year'] = 2020
df_clean["publication_year"] = df_clean["publication_year"].astype(str)
df_clean.loc[df_clean["number_of_reviews"] == "None", "number_of_reviews"] = np.nan
df_clean["number_of_reviews"] = df_clean["number_of_reviews"].fillna(0).astype(int)

df_clean

Unnamed: 0,title,author,book_index,book_url,avg_rating,genre,description,publication_year,first_sentence,number_of_reviews,image_url
0,And Then There Were None,Agatha Christie,0,https://www.librarything.com//work/7962202,4.14,"[Fiction and Literature, Mystery]","Ten houseguests, trapped on an isolated island...",1939,In the corner of a first-class smoking carriag...,500,https://images-na.ssl-images-amazon.com/images...
1,Murder on the Orient Express,Agatha Christie,1,https://www.librarything.com//work/2742,4.07,"[Fiction and Literature, Mystery]","Agatha Christie's most famous murder mystery, ...",1934,It was five o'clock on a winter's morning in S...,394,https://images-na.ssl-images-amazon.com/images...
2,The Murder of Roger Ackroyd,Agatha Christie,2,https://www.librarything.com//work/3011,4.06,"[Fiction and Literature, Mystery]",Agatha Christie's most daring crime mystery - ...,1926,Mrs Ferrars died on the night of the 16th-17th...,291,https://images-na.ssl-images-amazon.com/images...
3,The Mysterious Affair at Styles,Agatha Christie,3,https://www.librarything.com//work/2921950,3.75,"[Fiction and Literature, Mystery]","Set in the summer of 1917, the story follows t...",1920,The intense interest aroused in the public by ...,261,https://images-na.ssl-images-amazon.com/images...
4,Death on the Nile,Agatha Christie,4,https://www.librarything.com//work/29995,3.93,"[Fiction and Literature, Mystery]","Linnet Doyle is young, beautiful, and rich. Sh...",1937,'Linnet Ridgeway!',168,https://images-na.ssl-images-amazon.com/images...
...,...,...,...,...,...,...,...,...,...,...,...
24966,Cadillac Jukebox,James Lee Burke,24966,https://www.librarything.com//work/82967,3.80,"[Fiction and Literature, Mystery]","A Louisiana farmer is jailed for the murder, 3...",1996,Aaron Crown should not have come back into our...,13,https://images-na.ssl-images-amazon.com/images...
24967,Sunset Limited,James Lee Burke,24967,https://www.librarything.com//work/16445,3.75,"[Fiction and Literature, Mystery]",Detective Dave Robicheaux returns to center st...,1998,I had seen a dawn like this one only twice in ...,10,https://images-na.ssl-images-amazon.com/images...
24968,Crusader's Cross,James Lee Burke,24968,https://www.librarything.com//work/32025,3.97,"[Fiction and Literature, Mystery]",A conversation between Robicheaux and a dying ...,2005,"It was the end of an era, one that I suspect h...",28,https://images-na.ssl-images-amazon.com/images...
24969,Burning Angel,James Lee Burke,24969,https://www.librarything.com//work/70226,3.83,"[Fiction and Literature, Mystery]","Dave Robicheaux, New Orleans detective, is puz...",1995,The Giacano family had locked up the action in...,16,https://images-na.ssl-images-amazon.com/images...


Save ML data to .csv

In [None]:
df_clean.to_csv('/content/gdrive/MyDrive/Data Science Final Project/books_clean.csv', index=False)