# Scraping Open Library

This notebook is part of a larger project to dig into the choice of color for book covers.

This notebook in particular is the source for the csv file 'library' in this repository. The information contained within the file 'library.csv' was scraped from openlibrary.org on March 18th and 19th of 2023.

#### Imports

In [1]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

import json # for storing & loading manually retrieved URLs
import time # for avoiding overloading server with requests

import caffeine # long script to run for web scraping



#### Scraping Script

On the "subjects" page of Open Library, there is (conveniently) a list of broad categories and subgenres which we can use to categorize the books for which we scrape information. We will also then use these genres and subgenres to categorize the books we use for our book cover analysis. Here is the note from Open Library about the genres / subgenres provided on the subjects page at the time of this scraping:

> What's a subject heading?
    
> As the wise Wikipedia says: "The Library of Congress Subject Headings (LCSH) comprise a thesaurus (in the information science sense, a controlled vocabulary) of subject headings, maintained by the United States Library of Congress, for use in bibliographic records. LC Subject Headings are an integral part of bibliographic control, which is the function by which libraries collect, organize, and disseminate documents.... Subject headings are normally applied to every item within a library's collection and facilitate a user's access to items in the catalog that pertain to similar subject matter."

Send a GET request to the "subjects" page of openlibrary.org. Turn the response text into soup (beautiful soup!)

In [2]:
URL = 'https://openlibrary.org/subjects'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')

At the time of scraping, the overarching subject headings were stored as "heading 3"s.

In [3]:
genres = [header.text for header in soup.find_all('h3')][:-1]

In [26]:
genres

['Arts',
 'Animals',
 'Fiction',
 'Science & Mathematics',
 'Business & Finance',
 "Children's",
 'History',
 'Health & Wellness',
 'Biography',
 'Social Sciences',
 'Places',
 'Textbooks',
 'Books by Language']

The subgenres within these genres were bulleted items (unordered list, list items) beneath the header 3s.

In [5]:
subgenres = []

for genre in soup.find(id='subjectsPage').find_all('ul'):
    subgenres.append([element.text.strip() for element in genre.find_all('li')])

In [27]:
subgenres

[['Architecture',
  'Art Instruction',
  'Art History',
  'Dance',
  'Design',
  'Fashion',
  'Film',
  'Graphic Design',
  'Music',
  'Music Theory',
  'Painting',
  'Photography'],
 ['Bears', 'Cats', 'Kittens', 'Dogs', 'Puppies'],
 ['Fantasy',
  'Historical Fiction',
  'Horror',
  'Humor',
  'Literature',
  'Magic',
  'Mystery and detective stories',
  'Plays',
  'Poetry',
  'Romance',
  'Science Fiction',
  'Short Stories',
  'Thriller',
  'Young Adult'],
 ['Biology', 'Chemistry', 'Mathematics', 'Physics', 'Programming'],
 ['Management',
  'Entrepreneurship',
  'Business Economics',
  'Business Success',
  'Finance'],
 ['Kids Books',
  'Stories in Rhyme',
  'Baby Books',
  'Bedtime Books',
  'Picture Books'],
 ['Ancient Civilization',
  'Archaeology',
  'Anthropology',
  'World War II',
  'Social Life and Customs'],
 ['Cooking',
  'Cookbooks',
  'Mental Health',
  'Exercise',
  'Nutrition',
  'Self-help'],
 ['Autobiographies',
  'History',
  'Politics and Government',
  'World War I

I had some issues scraping the URLs for the searches for each subject (even with Selenium), so I ended up copying them manually and saving them in a text file which is loaded below.

In [8]:
with open('genre_urls.txt', 'r') as f:
    genre_urls = json.load(f)

Here, the work begins (of scraping the book results for each genre / subgenre search.)

Unfortunately, I didn't plan for the execution to take over night, so I had to keyboard interrupt the process after the Fiction: Literature genre / subgenre, save progress so far to file, install a new package called caffeine which allows my computer to stay on while the interpreter is running (though the screen can turn off), and then copy the entire loop below to a cell further down and add some restrictions on i and j to force the process to pick back up where it left off.

In [85]:
# instantiate an empty library to hold all book data
library = pd.DataFrame({'title': [], 'first_published': [], 'authors': [], 'cover_img_url': [],
                        'languages_available': [], 'subgenre': [], 'genre': []})

for i, genre in enumerate(genres):
    print(f'Scraping data for the following genre: {genre}')
    for j, subgenre in enumerate(subgenres[i]):
        print(f'Scraping data for the following subgenre: {subgenre}')
        # set base URL for this genre / subgenre
        base_URL = genre_urls[i][j]
        
        # scrape data in dictionary format
        book_data = get_books_in_subgenre(genre, subgenre, base_URL)
        
        # update library dataframe
        library = pd.concat([library, pd.DataFrame(book_data)], axis=0)
    print(f'Number of books currently in library: {library.shape[0]}')
    print('\n')

Scraping data for the following genre: Arts
Scraping data for the following subgenre: Architecture
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successf

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Music Theory
Successful Request
Successful Request
Successful Request
Successful Request
Successful Req

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Puppies
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request


KeyboardInterrupt: 

In [86]:
library.tail()

Unnamed: 0,title,first_published,authors,cover_img_url,languages_available,subgenre,genre
995,Abridgment of Murray's English Grammar,1800.0,[Lindley Murray],archive.org/services/img/anabridgementlm00murr...,1.0,Literature,Fiction
996,The Evil Shepherd,1922.0,[Edward Phillips Oppenheim],archive.org/services/img/evilshepherd00oppeuoft,2.0,Literature,Fiction
997,Paradiso,1595.0,"[Dante Alighieri, John Ciardi]",archive.org/services/img/ladivinacommedi00witt...,8.0,Literature,Fiction
998,The Well at the World's End,1896.0,[William Morris],archive.org/services/img/lasourceauboutdu0000morr,3.0,Literature,Fiction
999,Heretics,1905.0,[Gilbert Keith Chesterton],archive.org/services/img/hereticschesrich,3.0,Literature,Fiction


In [87]:
library.shape

(21260, 7)

In [90]:
len(np.unique(library.index))

1000

The progress so far was saved to a csv file called "library_to_literature" which I then loaded into the variable library after installing caffeine.

In [9]:
# open library so far
library = pd.read_csv('library_to_literature.csv', index_col=0)

Now we use caffeine to keep the computer on while the notebook is running, and start from the genre Fiction at the subgenre after Literature, concatenating new results with the dataframe library (which already contains previous scraping results.)

In [14]:
# continue where we left off; keep power on until "interpreter" is closed
caffeine.on(display=False)

for i, genre in enumerate(genres):
    if i < 2:
        continue
    print(f'Scraping data for the following genre: {genre}')
    for j, subgenre in enumerate(subgenres[i]):
        if i == 2:
            if j < 5:
                continue
        print(f'Scraping data for the following subgenre: {subgenre}')
        # set base URL for this genre / subgenre
        base_URL = genre_urls[i][j]

        # scrape data in dictionary format
        book_data = get_books_in_subgenre(genre, subgenre, base_URL)

        # update library dataframe
        library = pd.concat([library, pd.DataFrame(book_data)], axis=0)
    print(f'Number of books currently in library: {library.shape[0]}')
    print('\n')

Scraping data for the following genre: Fiction
Scraping data for the following subgenre: Magic
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful R

Successful Request
Scraping data for the following subgenre: Short Stories
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Re

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Physics
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request


Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Baby Books
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Reque

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: World War II
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Req

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Self-help
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Reques

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Women
Successful Request
Successful Request
Successful Request
Su

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Number of books currently in library: 67440


Scraping data for the following genre: Places
Scraping data for the following subgenre: Brazil
Successful Request
Successful Request
Succ

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Education
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Reques

Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request
Scraping data for the following subgenre: Russian
Successful Request
Successful Request
Successful Request
Successful Request
Successful Request


MissingSchema: Invalid URL '': No scheme supplied. Perhaps you meant http://?

I forgot to leave out the last "subgenre" which is "See more..." and doesn't have a URL associated with it. That's the only reason there is an error message; the actual data seems to have been scraped successfully.

In [16]:
library.to_csv('library.csv')

In [18]:
pd.read_csv('library.csv', index_col=0)

Unnamed: 0,title,first_published,authors,cover_img_url,languages_available,subgenre,genre
0,The Stones of Venice,1851.0,['John Ruskin'],archive.org/services/img/stonesofvenice01ruskiala,2.0,Architecture,Arts
1,Memórias póstumas de Brás Cubas,1900.0,['Machado de Assis'],archive.org/services/img/architetturamedi01arat,7.0,Architecture,Arts
2,The Alhambra,1800.0,['Washington Irving'],archive.org/services/img/alhambra13irvi,5.0,Architecture,Arts
3,Coriolanus,1734.0,['William Shakespeare'],archive.org/services/img/templeshakespear03shak,14.0,Architecture,Arts
4,A Child's History of England,1800.0,['Charles Dickens'],archive.org/services/img/childshistoryofe00dic...,2.0,Architecture,Arts
...,...,...,...,...,...,...,...
995,Bēkon zuisōshū,1618.0,['Francis Bacon'],archive.org/services/img/baconsadvancemen00bac...,7.0,Japanese,Books by Language
996,Janguru,1905.0,['Upton Sinclair'],archive.org/services/img/jungle0000sinc,16.0,Japanese,Books by Language
997,Le comte de Monte-Cristo,1830.0,"['Alexandre Dumas', 'Hollybooks', 'Luis José S...",archive.org/services/img/countofmontecris00duma_7,10.0,Japanese,Books by Language
998,Max Ernst.,1956.0,"['Max Ernst', 'Werner Spies', 'Fabrice Hergott...",archive.org/services/img/maxernstlifework0000erns,10.0,Japanese,Books by Language


In [28]:
caffeine.off()

---

#### Helper Function

A note on time: for reference, running an earlier version of the below function for ONE genre /subgenre (and with NO wait time between requests!) had a "wall time" of:

Wall time: 2min 39s

In [13]:
def get_books_in_subgenre(genre, subgenre, URL):
    '''Takes in a genre (string), subgenre (string) and URL (string); returns the book data from that URL in
    dictionary format.'''

    # start with an empty dictionary to store information from scraping
    book_data = {'title': [], 'first_published': [], 'authors': [], 'cover_img_url': [], 'languages_available': [],
                 'subgenre': [], 'genre': []}

    # set initial value for page number
    page_no = 1

    # set flag to end while loop
    flag = 1

    # check whether there are still books to be scraped
    while flag == 1:
        # set URL of web page to scrape
        page_URL = URL + f'&page{page_no}'
        
        # wait 3 seconds, to avoid overloading the server with requests
        time.sleep(3)
        
        # request webpage
        response = requests.get(URL)
        
        # print update to command line (to monitor progress)
        if response.status_code==200:
            print('Successful Request')
        else:
            print(f'Unsuccessful Request: Status Code {response.status_code}')
            
        # make soup
        soup = BeautifulSoup(response.text, 'html.parser')

        # if there are no books on this page, exit loop
        books_in_soup = soup.find_all('li', class_='searchResultItem')

        if books_in_soup:    
            for book in books_in_soup:
            # append title, year of first publication, list of authors, cover image URL and no. of languages available
                try:
                    book_data['title'].append(book.find('h3', class_='booktitle').find('a').text)
                except:
                    print('Warning: title null')
                    book_data['title'].append(None)

                try:
                    book_data['first_published'].append(
                        int(book.find('span', class_='publishedYear').text.strip().split()[-1]))
                except:
                    print('Warning: year of first publication null')
                    book_data['first_published'].append(None)

                try:
                    book_data['authors'].append(
                        [a_tag.text for a_tag in book.find('span', class_='bookauthor').find_all('a', class_='results')])
                except:
                    print('Warning: authors null')
                    book_data['authors'].append(None)

                try:
                    book_data['cover_img_url'].append(book.find_all('img')[-1]['src'].strip('/'))
                except:
                    print('Warning: cover image URL null')
                    book_data['cover_img_url'].append(None)

                try:
                    book_data['languages_available'].append(
                        int(book.find('span', class_='languages').find('a').text.split()[0]))
                except:
                    print('Warning: number of languages available null')
                    book_data['languages_available'].append(None)

                # also append subgenre and genre
                book_data['subgenre'].append(subgenre)
                book_data['genre'].append(genre)

            # when finished scraping books in soup, increment page number
            page_no += 1

            # set a limit on number of pages to scrape
            if page_no <= 50:
                flag = 1
            else:
                flag = 0
                break

        else:
            flag = 0
            break
            
    return book_data

---

#### 'Documentation'

I don't foresee ever needing to scrape openlibrary.org again now that we have the file 'library.csv' saved to this repository and backed up on GitHub.

However, in the event that it is necessary or desirable to do so, I am leaving the cells below for my future self to easily troubleshoot any changes in the format of openlibrary search results.

<u>'Walkthrough' of Scraping Information from Search Results on OpenLibrary.org</u>

1. Building the URL using subject and page number.

In [19]:
base_URL = 'https://openlibrary.org/search?'

In [20]:
subject = 'Fantasy'

In [21]:
page_no = 1

In [24]:
complete_URL = base_URL + f'subject={subject }' + f'&page={page_no}'
print(complete_URL)

https://openlibrary.org/search?subject=Fantasy&page=1


2. The usual GET request, checking the status code and turning the response text into soup.

In [25]:
response = requests.get(complete_URL)
print(response.status_code)

200


In [77]:
soup = BeautifulSoup(response.text, 'html.parser')

How many books show up per page in search results?

In [78]:
len(soup.find_all('li', class_='searchResultItem'))

20

3. Test out extracting information on an Example Book:

In [79]:
example_book = soup.find_all('li', class_='searchResultItem')[0]
print(example_book)

<li class="searchResultItem" itemscope="" itemtype="https://schema.org/Book">
<span class="bookcover">
<a href="/works/OL262385W?edition=ia%3Acihm_78964"><img alt="Cover of: Sky Island: being the further exciting adventures of Trot and Cap'n Bill after their visit to the sea fairies" itemprop="image" src="//covers.openlibrary.org/b/olid/OL19285157M-M.jpg" title="Cover of: Sky Island: being the further exciting adventures of Trot and Cap'n Bill after their visit to the sea fairies"/></a>
</span>
<div class="details">
<div class="resultTitle">
<h3 class="booktitle" itemprop="name">
<a class="results" href="/works/OL262385W?edition=ia%3Acihm_78964" itemprop="url">Sky Island: being the further exciting adventures of Trot and Cap'n Bill after their visit to the sea fairies</a>
</h3>
</div>
<span class="bookauthor" itemprop="author" itemscope="" itemtype="https://schema.org/Organization">
        
by <a class="results" href="/authors/OL9348793A/L._Frank_Baum">L. Frank Baum</a>, <a class="res

**Find the title of the book** // string

In [80]:
example_book.find('h3', class_='booktitle').find('a').text

"Sky Island: being the further exciting adventures of Trot and Cap'n Bill after their visit to the sea fairies"

**Find the year of (first) publication** // integer

In [54]:
int(example_book.find('span', class_='publishedYear').text.strip().split()[-1])

1912

**Find the link to cover image of the book** // string

In [40]:
example_book.find_all('img')[-1]['src'].strip('/')

'archive.org/services/img/skyisland00baum'

**Find the author name(s)** // list of strings

In [46]:
[a_tag.text for a_tag in example_book.find('span', class_='bookauthor').find_all('a', class_='results')]

['L. Frank Baum', 'Mint Editions', 'John R. (John Rea) Neill']

**Find the number of languages in which this book is available** // int

In [52]:
int(example_book.find('span', class_='languages').find('a').text.split()[0])

1

4. Test out the above on all books on the page (then expand to iterate over all pages!)

Store data in dictionaries, to transform into a Pandas DataFrame of fantasy books.

In [62]:
fantasy = {'title': [], 'first_published': [], 'authors': [], 'cover_img_url': [], 'languages_available': [],
           'genre': []}

for book in soup.find_all('li', class_='searchResultItem'):
    # append title, year of first publication, list of authors, cover image URL and number of languages available
    try:
        fantasy['title'].append(book.find('h3', class_='booktitle').find('a').text)
    except:
        print('Warning: title null')
        fantasy['title'].append(None)
        
    try:
        fantasy['first_published'].append(
            int(book.find('span', class_='publishedYear').text.strip().split()[-1]))
    except:
        print('Warning: year of first publication null')
        fantasy['first_published'].append(None)
        
    try:
        fantasy['authors'].append(
            [a_tag.text for a_tag in book.find('span', class_='bookauthor').find_all('a', class_='results')])
    except:
        print('Warning: authors null')
        fantasy['authors'].append(None)
        
    try:
        fantasy['cover_img_url'].append(book.find_all('img')[-1]['src'].strip('/'))
    except:
        print('Warning: cover image URL null')
        fantasy['cover_img_url'].append(None)
        
    try:
        fantasy['languages_available'].append(
            int(book.find('span', class_='languages').find('a').text.split()[0]))
    except:
        print('Warning: number of languages available null')
        fantasy['languages_available'].append(None)
        
    fantasy['genre'].append('fantasy')
        
library = pd.DataFrame(fantasy)

library.head()

Unnamed: 0,title,first_published,authors,cover_img_url,languages_available,genre
0,Sky Island: being the further exciting adventu...,1912,"[L. Frank Baum, Mint Editions, John R. (John R...",archive.org/services/img/skyisland00baum,1,fantasy
1,The Well at the World's End,1896,[William Morris],archive.org/services/img/lasourceauboutdu0000morr,3,fantasy
2,Phantastes: a faerie romance,1850,[George MacDonald],archive.org/services/img/phantastes00geor,1,fantasy
3,Wet Magic (Books of Wonder),1937,[Edith Nesbit],archive.org/services/img/wetmagic0000nesb,1,fantasy
4,The Magic City,1910,[Edith Nesbit],archive.org/services/img/magiccity0000nesb,1,fantasy
