## Finding the most reccomeneded books on Reddit

People often use Amazon links (to books) in Reddit comments to act as a proxy for what books are being mentioned the most on Reddit. This is due to the fact that Amazon links are easy to parse and look up. However, this isn't always a reliable proxy, as there are countless mentions to books by just the title and author (e.g. *The Intelligent Investor by Benjamin Graham*).

Many Reddit submissions ask questions such as:
* [Reddit, what are some "MUST read" books?](https://www.reddit.com/r/AskReddit/comments/34m5n6/reddit_what_are_some_must_read_books/)
* [What are /r/investing's favorite books? - Future side bar link.](https://www.reddit.com/r/investing/comments/166ha8/what_are_rinvestings_favorite_books_future_side/)
* [What is a good cook book for a beginner?](https://www.reddit.com/r/Cooking/comments/6m5enh/what_is_a_good_cook_book_for_a_beginner/)

Taking a brief look at these posts, there are almost no Amazon links, and consequently modern scrapers will not pick up these book reccomendations. Even more, these posts are highly targeted, and garner attention from the entire community, often providing hundreds of book reccomendations and in-depth discussion and feedback on each one. To miss out on these would be very detrimental to a reccomendation service that strives to be accurate.

So the main problem is parsing comments to determine what exact books were mentioned. Here are some observations that may lead to a reliable method:
- Books are almost always mentioned in the top-level comments
- Most people capitalize the book title
- Most people mention the author
    - e.g. The Intelligent Investor **by Benjamin Graham**


In [25]:
import praw
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from pathlib import Path

In [13]:
def get_reddit_client():
    api_creds = {}
    with open('../puller-api-creds.env') as f:
        for line in f:
            k, v = line.rstrip().split('=')
            api_creds[k] = v
    return praw.Reddit( user_agent='book submission parser',
                        client_id=api_creds['CLIENT_ID'],
                        client_secret=api_creds['CLIENT_SECRET'],
                        username=api_creds['USERNAME'],
                        password=api_creds['PASSWORD'] )
        

In [17]:
# instantiate the client
reddit = get_reddit_client()

In [20]:
# get a test sumbissions comments
test_submission_link = 'https://www.reddit.com/r/AskReddit/comments/34m5n6/reddit_what_are_some_must_read_books/'
test_submission = reddit.submission(url=test_submission_link)
top_level_comments[:10]

[Comment(id='cqw2x18'),
 Comment(id='cqw67zs'),
 Comment(id='cqvzjfi'),
 Comment(id='cqvzdne'),
 Comment(id='cqw2tes'),
 Comment(id='cqw060u'),
 Comment(id='cqw3hpx'),
 Comment(id='cqvyudr'),
 Comment(id='cqw8wgi'),
 Comment(id='cqw1hkz')]

In [23]:
def extract_titles_from_comment(body):
    for sentence in sent_tokenize(body):


def extract_titles(reddit, submission_url):
    submission = reddit.submission(submission_url)
    top_level_comments = list(test_submission.comments)
    for comment in top_level_comments:
        titles = extract_titles_from_comment(comment.body)
        
    

In [30]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [29]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True