## Finding the most reccomeneded books on Reddit

People often use Amazon links (to books) in Reddit comments to act as a proxy for what books are being mentioned the most on Reddit. This is due to the fact that Amazon links are easy to parse (see regex) and look up (see Amazon PA-API). However, this isn't necessarily an accurate proxy, as there are countless mentions to books by just using the title and author (e.g. *The Intelligent Investor by Benjamin Graham*).

Many of these book mentions come from submissions asking questionss such as these:
* [Reddit, what are some "MUST read" books?](https://www.reddit.com/r/AskReddit/comments/34m5n6/reddit_what_are_some_must_read_books/)
* [What are /r/investing's favorite books? - Future side bar link.](https://www.reddit.com/r/investing/comments/166ha8/what_are_rinvestings_favorite_books_future_side/)
* [What is a good cook book for a beginner?](https://www.reddit.com/r/Cooking/comments/6m5enh/what_is_a_good_cook_book_for_a_beginner/)

Taking a brief look at these posts, there are almost no Amazon links, and consequently modern scrapers will not pick up these book reccomendations. Even more, these posts are highly targeted, and garner attention from the entire community--often providing hundreds of book reccomendations with in-depth discussions for each one. To miss out on these would be very detrimental to a reccomendation service that strives to be accurate.

**Our goal in this notebook is to find a reliable method capable of finding which books were mentioned in a comment.** Here are some observations that may lead to such an algortihm:
- Books are almost always mentioned in the top-level comments (in the kind of submissions mentioned above)
- Most people capitalize the book title
- Most people mention the author
    - e.g. The Intelligent Investor **by Benjamin Graham**




In [62]:
# base imports
import praw
import requests
import time
from urllib.error import HTTPError, URLError
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from pathlib import Path

### Getting some sample comments to work with
In order to first build and test our parser, we will gather sample comments from two subreddits that are dedicated soley to suggseting books:
* [r/SuggestMeABook](https://reddit.com/r/suggestmeabook)
* [r/booksuggestions](https://reddit.com/r/booksuggestions)

We will futher refine our sample by only gathering comments from submissions that are asking a question (i.e. the title ends with a '?')


In [2]:
def get_reddit_client():
    api_creds = {}
    with open('../puller-api-creds.env') as f:
        for line in f:
            k, v = line.rstrip().split('=')
            api_creds[k] = v
    return praw.Reddit( user_agent='book submission parser',
                        client_id=api_creds['CLIENT_ID'],
                        client_secret=api_creds['CLIENT_SECRET'],
                        username=api_creds['USERNAME'],
                        password=api_creds['PASSWORD'] )


def sub_exists(reddit, subreddit):
    from prawcore import NotFound
    exists = True
    try:
        reddit.subreddits.search_by_name(subreddit, exact=True)
    except NotFound:
        exists = False
    return exists


def get_subreddit_sample_comments(reddit, subreddit_name):
    if not sub_exists(reddit, subreddit_name):
        raise ValueError("please enter a valid subreddit name")
    comments = []
    subreddit = reddit.subreddit(subreddit_name)
    for submission in subreddit.top(time_filter='week'):
        if submission.title[-1] != '?':
            continue
        comments.extend(submission.comments)
        break
    return comments

In [3]:
# instantiate the client
reddit = get_reddit_client()
# gather some arbritrary sample comments
comments = get_subreddit_sample_comments(reddit, 'suggestmeabook')

In [4]:
print(f'Gathered {len(comments)} sample comments')
for comment in comments[:3]:
    print('-' * 80)
    print(comment.body)

Gathered 68 sample comments
--------------------------------------------------------------------------------
*The Painted Bird* by Kozinski should make you lose faith in humanity.
--------------------------------------------------------------------------------
The Collector by John Fowles

It was unsettling how 'normal' the characters thoughts and actions became as you kept reading 
--------------------------------------------------------------------------------
The Conspiracy Against The Human Race by Thomas Ligotti. I swear to god this book is dangerous, I read it for the first time when I was already in a really nihilistic, existentially distraught place and it pretty much confirmed and enforced the way I was feeling. I am still and always will be a big Ligotti fan but I do wonder if the last 5 years would have been different if I didn’t go into such a downward spiral at such a crucial time in my life.


### Building a preliminary parsing pipeline
The stages of parsing book titles out of a comment will be broken into the following steps:

**Important Note**: Comments on Reddit are represented by markdown

1. Get the text-representation of the rendered markdown 
2. Tokenize the comment text into sentences
3. Tokenize each sentence into words
4. Find consecutive sequences of capitalized non-stopword words
    * e.g. I thought that *The Intelligent Investor: The Definitive Book on Value Investing* was a great book

In [56]:
from bs4 import BeautifulSoup
from markdown import markdown

DEBUG = False
# Really abritrary title length requirement
MIN_TITLE_LEN = 4
# Global stopwords set (for quick lookup)
STOPWORDS = set(stopwords.words())

def markdown_to_text(md):
    html = markdown(md)
    return ''.join(BeautifulSoup(html, 'lxml').findAll(text=True))


def _trim_trailing_stopwords(words):
    while words and words[-1] in STOPWORDS:
        words.pop()
    return words


def _match_titles(sentence):
    """
    Return title(s) found in a sentence, where a title is defined as:
        a consecutive sequence of capitalized non-stopword words
    """
    titles, seq = [], []
    # filter out the special chars, e.g. ''', '"', ',', etc.
    words = filter(lambda w: w.isalnum(), word_tokenize(sentence))
    for word in words:
        # title 'ends' on a non-stopword non-capitalized word
        if seq and word not in STOPWORDS and not word[0].isupper():
            titles.append(seq[:])
            seq = []
        elif seq and word in STOPWORDS:
            seq.append(word)
        elif word[0].isupper():
            seq.append(word)
            
    titles.append(seq)
    trimmed = map(_trim_trailing_stopwords, titles)
    filtered = filter(lambda l: len(l) >= MIN_TITLE_LEN, trimmed)
    return [' '.join(title) for title in filtered if title]
        

def extract_titles_from_comment(comment):
    """ 
    Extracts all book titles found in a comment body
    See the pipeline steps mentioned in the cell above.
    """
    titles = set()
    # avoid dealing with all the special markup characters
    text = markdown_to_text(comment.body)
    # for each sentence, extract the title(s)
    for sentence in sent_tokenize(text):
        
        titles_found = _match_titles(sentence)
        if DEBUG and not titles_found:
            print(f'sentence: {sentence}')
            print('titles found:')
            print("\n".join(titles_found))
            print('-' * 80)
        titles.update(titles_found)
    return titles


def bulk_extract(comments):
    """ """
    all_titles = set()
    for comment in comments:
        titles = extract_titles_from_comment(comment)
        all_titles.update(titles)
    return list(all_titles)
    

In [57]:
sample_extracted_titles = bulk_extract(comments)
sample_extracted_titles[:10]

['The Laws of Nature by Ashley Franz Holzmann',
 'The Surgeon Tess Gerritsen',
 'Blood Meridian by Cormac McCarthy',
 'Flowers In The Attic VC Andrews',
 'The Tsar of Love and Techno by Anthony Marra',
 'Eleven Twenty Three by Jason Hornsby Preta Realm by J Thorn A',
 'Also Lull by Kelly Link',
 'Twenty Days of Turin The Water Knife',
 'The Stuff of Nightmares by Malorie Blackman',
 'Slade House by David Mitchell']

### Leveraging the Google books API to get book metadata from just the title
Using Google's API provides the simplest method, however there is a cap at 1000 requests per day, so it must be used sparingly.

In [77]:
GOOGLE_BOOKS_API_URL = 'https://www.googleapis.com/books/v1/volumes'

def google_book_search(title):
    """ https://developers.google.com/books/docs/v1/using#PerformingSearch """
    params = {
        'q': title,
        'key': 'AIzaSyAgwbY2ojVCKMnnxoua7QJ0aYiYJxePmcQ',
        'maxResults': 1,
    }
    try:
        resp = requests.get(GOOGLE_BOOKS_API_URL, params=params).json()
        if 'items' not in resp:
            print(f'no google search results found for title: {title}')
            resp = None  # nothing really to work with when 0 items returend
        return resp
    except requests.exceptions.RequestException as e:
        print(e)
        return

def google_metadata_from_title(title):
    """
    Get the most relevant search result from Google books API and returns a dict of metadata:
    {'isbn': <isbn>, 'title': <title>}
    """
    resp = google_book_search(title)
    if not resp:
        return
    try:
        metadata = resp['items'][0]['volumeInfo']
        ids = metadata['industryIdentifiers']  # isbn10, isbn13, etc.
        
        return {
            'isbn': next(d['identifier'] for d in ids if d['type'] == 'ISBN_10'), 
            'title': metadata['title'],
            'authors': metadata['authors']
        }
    except StopIteration:
        print(f'incompatible google search result format for title: {title}')

#### Testing the Google API results

In [81]:
google_results = []
for title in sample_extracted_titles[:10]:
    print(f'Extracted title: {title}')
    result = google_metadata_from_title(title)
    google_results.append(result if result else {})
    if not result:
        continue
    print(f'Matched title: {result["title"]} by {", ".join(result["authors"])}')
    print('-' * 80)
    time.sleep(.5)

Extracted title: The Laws of Nature by Ashley Franz Holzmann
Matched title: The Laws of Nature by Ashley Franz Holzmann
--------------------------------------------------------------------------------
Extracted title: The Surgeon Tess Gerritsen
Matched title: The Surgeon by Tess Gerritsen
--------------------------------------------------------------------------------
Extracted title: Blood Meridian by Cormac McCarthy
Matched title: Blood Meridian by Cormac McCarthy
--------------------------------------------------------------------------------
Extracted title: Flowers In The Attic VC Andrews
Matched title: Flowers In The Attic by V.C. Andrews
--------------------------------------------------------------------------------
Extracted title: The Tsar of Love and Techno by Anthony Marra
Matched title: The Tsar of Love and Techno by Anthony Marra
--------------------------------------------------------------------------------
Extracted title: Eleven Twenty Three by Jason Hornsby Preta Rea

### Leveraging the Goodread's API to get book metadata from just the title
The goodreads API is slightly more involved, as title searches only returns Goodreads internal book id, which you must then translate to an ISBN.

In [48]:
GOODREADS_SEARCH_API_URL = 'https://www.goodreads.com/search/index.xml'
def goodreads_title_search(title):
    """ https://www.goodreads.com/api/index#search.books """
    params = {
        'q': title,
        'key': 'CZ44l5tAA26Dp2hGQywKg',
    }
    try:
        resp = requests.get(GOODREADS_SEARCH_API_URL, params=params)
        xml = BeautifulSoup(resp.text, 'xml')
        if not xml.find('results') or not xml.find('results').find('work'):
            xml = None  # no results is essnetially useless
        return xml
    except requests.exceptions.RequestException as e:
        print(e)

GOODREADS_SHOW_API_URL = 'https://www.goodreads.com/book/show'
def goodreads_show_by_id(book_id):
    """ 
    https://www.goodreads.com/api/index#book.show 
    Lookup book reviews and metadata by goodread's interal book id
    """
    endpoint = f'{book_id}.xml'  # they use a very weird endpoint format
    params = {
        'key': 'CZ44l5tAA26Dp2hGQywKg'
    }
    try:
        resp = requests.get(f'{GOODREADS_SHOW_API_URL}/{endpoint}', params=params)
        xml = BeautifulSoup(resp.text, 'xml')
        if not xml.find('book'):
            xml = None
        return xml
    except requests.exceptions.RequestException as e:
        print(e)


    
def goodreads_id_from_title(title):
    """  """
    resp = goodreads_title_search(title)
    if not resp:
        print(f'no goodreads book search results returned for title: {title}')
        return
    try:
        work = resp.find('results').find('work')
        if work.find('best_book_id'):
            return work.find('best_book_id').text
        else:
            return work.find('best_book').find('id').text
    except AttributeError:
        print(f'incomplete goodreads book API response for title: {title}')
        return
    

def goodreads_isbn_from_id(book_id):
    resp = goodreads_show_by_id(book_id)
    if not resp:
        print(f'no goodreads books returned for goodreads book id: {book_id}')
        return
    try:
        isbn = resp.find('isbn').text
        if ' ' in isbn:
            return None
        return isbn
    except AttributeError as e:
        print(f'no isbn contained in response for goodreads book id: {book_id}')
        

def goodreads_isbn_from_title(title):
    goodreads_id = goodreads_id_from_title(title)
    if not goodreads_id:
        return
    isbn = goodreads_isbn_from_id(goodreads_id)
    return isbn


## Parsing problems
* Comments may mention many books, separated by commas, which the current pipeline cannot handle