The goal of this code is to gather books tagged with "fantasy" on Goodreads and explore how users have tagged those books (or "shelved" them, on the Goodreads site).

The Goodreads API allows users to with a developer key to retrieve information about a book given a Goodreads book ID. You can request a developer key and read documentation for the Goodreads API here: https://www.goodreads.com/api

The Python wrapper and documentation for the Goodreads is available at https://github.com/sefakilic/goodreads

In [1]:
# Import libraries

from goodreads import client
from lxml import html
import requests
from collections import Counter

gc = client.GoodreadsClient("Ha1oKI3R0fqeApxCJIcQ", "wbHQF5APtAgwKrY0kQ9gFSxyVEqt0kEqWwi3HUf0t7A")

# Thank you internet for helping with scraping
# http://docs.python-guide.org/en/latest/scenarios/scrape/

The loop below pulls the ID numbers using an XPath query from all the results that come up when I search "fantasy" in the genre tag on Goodreads, which returns approximately 50k results, as below:

<img src="goodreads-search-big.png">

In [2]:
# set up base search page, page number to increment, and list container for book ids
search_page = 'http://www.goodreads.com/search?page=1&q=fantasy&search%5Bfield%5D=genre&search_type=books&tab=books&utf8=%E2%9C%93'
page_num = 1
book_IDs = []
for page in range(100): # repeat 100 times for each page of search results 
    # get page using requests:
    fantasy_books = requests.get(search_page)
    # read into a tree object:
    tree = html.fromstring(fantasy_books.content)
    # use XPath query to get each book ID number and add it to the list.
    book_IDs.extend(tree.xpath("//form[@class='hiddenShelfForm']/input[@name='book_id']/@value"))
    # increment the page number for the next page of search results:
    page_num += 1
    search_page = 'http://www.goodreads.com/search?page='+(str(page_num))+'&q=fantasy&search%5Bfield%5D=genre&search_type=books&tab=books&utf8=%E2%9C%93'
    
# print(book_IDs)

I revised the loop above from my earlier code (in the notebook "goodreads fantasy books") to produce a single list using .extend instead of creating a list of lists, since each page will return a list of ID numbers. 

Next, I want to loop through the list and collect the title and shelf data for each ID number using the Goodreads API, and put that information into a dictionary. Then, I can use that data to identify titles tagged ("shelved") as "diverse." (This takes about half an hour to run.)

In [3]:
books_and_shelves = {} # key = book title, value = list of popular shelves

for booknum in book_IDs:
    # get the book using the ID number
    book = gc.book(booknum)
    # Use the .title and .popular_shelves functions to get the data and put it in the dictionary
    books_and_shelves[book.title] = book.popular_shelves
    
# print(books_and_shelves)

Now I have my book titles and their popular shelves in a dictionary. I discovered in my intitial coding that the list of shelves returned by the API isn't a list of strings, so before I move on I want to make all my values in the dictionary (the shelf lists) into lists of strings so I can work with them more easily.

In [4]:
# change the shelflists into list of strings

for title, shelves in books_and_shelves.items():
    # create a list from the goodreads shelf item for each dictionary entry:
    list_of_shelves = list(shelves)
    # Make an empty list container for the shelf names as strings:
    shelves_str = []
    # iterate over the list of shelf names:
    for shelf in list_of_shelves:
        # change the shelf names into strings and add them to the list
        shelves_str.append(str(shelf))
    # replace the value with the list of strings
    books_and_shelves[title] = shelves_str
    
# print(books_and_shelves)

Now that all the items in the shelf lists are strings, I can put them all into one list and create a list of unique shelves using the Counter.

In [13]:
# get just the values from the dictionary and create a counter object

allshelves = list(books_and_shelves.values())

# loop over the list of lists to get a single list of shelves

oneshelflist = []
for shelf_list in allshelves:
    oneshelflist.extend(shelf_list)
    
# Use the Counter to get a list of unique shelves
unique_shelves = list(Counter(oneshelflist).keys())

# print(unique_shelves)


Now that I have a list of unique shelves, I want to find the ones that are related to diversity. To simplify matters, I'm going to look for shelves with "diverse" and "diversity" in them using the string "divers". This gets me a list of six shelves to work with.

In [14]:
diverse_shelves = []
for shelf in unique_shelves:
    if "divers" in shelf:
        diverse_shelves.append(shelf)
        
# print(diverse_shelves)

Now I'll use the list of shelves I got above to loop through my dictionary and find book titles that are tagged with one or more of these shelves.

In [15]:
diverse_titles = []
for title, shelves in books_and_shelves.items():
    for shelf in shelves:
        for tag in diverse_shelves:
            if tag in shelf:
                diverse_titles.append(title)
                
# print(diverse_titles)

In [16]:
# First, get the list of unique titles
unique_titles = list(Counter(diverse_titles).keys())

# Loop through and put the shelf lists of the titles above into their own list
diverse_shelves = []
for title, shelves in books_and_shelves.items():
    if title in unique_titles:
        diverse_shelves.extend(shelves)


I've stopped here due to time, but from here I could visualize this data to see the most frequent intersections of shelves with "diverse" and "diversity" labels, look at specific intersections (what's the intersection of books labeled "young adult" and "diverse"?) or perhaps go back and add other terms to my list of "diverse" shelves.