# Reddit Part One: Getting Data

You're going to scrape the front page of https://www.reddit.com! Reddit is a magic land made of many many semi-independent kingdoms, called subreddits. We need to find out which are the most powerful.

You are going to scrape the front page of reddit every 4 hours, saving a CSV file that includes:
* The title of the post
* The number of votes it has (the number between the up and down arrows)
* The number of comments it has
* What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
* When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago")
* The URL to the post itself
* The URL of the thumbnail image associated with the post

Note:

<p>Ugh, reddit is horrible when it hasn't been customized to your tastes. If you would like something more exciting/less idiotic, try scraping a multireddit page - https://www.reddit.com/r/multihub/top/?sort=top&t=year - they're subreddits clustered by topics.

<p>For example, you could scrape https://www.reddit.com/user/CrownReserve/m/improveyoself which is all self-improvement subreddits. You can follow the links at https://www.reddit.com/r/multihub/top/?sort=top&t=year or use the "Find Multireddits" link on the Multireddit page to find more.

In [83]:
from bs4 import BeautifulSoup
import requests

user_agent = {'User-agent': 'Mozilla/5.0'}
html_str = requests.get('https://www.reddit.com/', headers = user_agent).text

In [84]:
html_str

'<!doctype html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="canonical" href="https://www.reddit.com/" /><link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.reddit.com/" /><meta name="viewport" content="width=1024"><link rel=\'icon\' href="//www.redditstatic.com/icon.png" sizes="256x256" type="image/png" /><link rel=\'shortcut icon\' href="//www.redditstatic.com/favicon.ico" type="image/x-icon" /><link rel=\'apple-touch-icon-precomposed\' href="//www.redditstatic.com/icon-touch.png" /><link rel="alternate" type="application/atom+xml" title="RSS" href="https://www.reddit.com/.rss" /><link rel="stylesheet" type=

In [85]:
document = BeautifulSoup(html_str, 'html.parser')

In [86]:
# The title of the post
    # The whole post is under `<div>` class = ' thing id-t3_4 ....'
        # <div> class = 'entry unvoted'
        # <p> class = 'title'
        # `<a>` class = 'title may-blank '
# The number of votes it has (the number between the up and down arrows)
    # The number of votes is in <div> class = 'score unvoted'
    # sometimes this is &bull;
# The number of comments it has
    # There's a
        # <div> class = 'entry unvoted'
        # <ul> class = 'flat-list buttons'
        # <li> class = 'first'
        # <a> class = 'bylink comments may-blank'
# What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
    # <div> class = 'entry unvoted'
    # <p> class='tagline'
    # <a> class = 'subreddit hover may-blank'
# When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago")
    # <div> class = 'entry unvoted'
    # <p> class='tagline'
    # <time> it's actually in the tag
# The URL to the post itself
    # This is in two places. Both inside the main <div> tag and in the same tag with the title.
# The URL of the thumbnail image associated with the post
    # There are two thumbnail urls—the one I guess it's from orginially and the reddit thumbnail. Here's how to get the reddit thumbnail:
        # <a> class = 'thumbnail may-blank'
        # <img> it's actually in the tag
# What I eventually want: 
    posts_today = [
        {'title': '"Two clowns in the same circus" 16 x 12s oil on linen'},
        {'votes': 4246},
        {'comments': 372},
        {'subreddit': '/r/Art'},
        {'timestamp': '2016-06-22T12:33:58+00:00'},
        {'url': 'https://www.reddit.com/r/Art/comments/4pbvk5/two_clowns_in_the_same_circus_16_x_12s_oil_on/'},
        {'thumb_url': 'https://b.thumbs.redditmedia.com/p32PnbLD9t9hqvw9Q5X7eZS2tI7Ygqnh5K5MTxOERSE.jpg'}
    ]

In [87]:
import re

In [88]:
one_sibling_up = document.find_all('div', {'class': 'clearleft'})

In [89]:
# troubleshooting
document

<!DOCTYPE doctype html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>reddit: the front page of the internet</title><meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/><meta content="reddit: the front page of the internet" name="description"/><meta content="always" name="referrer"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="https://www.reddit.com/" rel="canonical"/><link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/><meta content="width=1024" name="viewport"><link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/><link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/><link href="https://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/atom+xml"/><link href="//www.redditstatic.com/

In [90]:
# because only every other clearleft has a post in it:
posts = [tag.find_next_sibling('div') for tag in one_sibling_up if tag.find_next_sibling('div')]

In [96]:
# posts is a list
len(posts)
# There are 10 more posts than show up on the homepage. Seems like the first 9 and last one aren't actual posts.

35

In [97]:
def title(post):
    if post.find('a', {'class': 'title may-blank '}):
        return post.find('a', {'class': 'title may-blank '}).string
    else:
        return 'NO TITLE'

In [120]:
def votes(post):
    if post.find('div', {'class': 'score unvoted'}):
        return post.find('div', {'class': 'score unvoted'}).string
    else:
        return 'NO INFO'

In [135]:
# The number of comments it has
    # There's a
        # <div> class = 'entry unvoted'
        # <ul> class = 'flat-list buttons'
        # <li> class = 'first'
        # <a> class = 'bylink comments may-blank'

num = 0

for post in posts:
    if post.find('a', {'class': 'bylink comments may-blank'}):
        print(r'\d+', re.findall(post.find('a', {'class': 'bylink comments may-blank'})).text)
    else:
        print(0)
    num += 1
    print(num)
    print('')

0
1



TypeError: findall() missing 1 required positional argument: 'string'

In [128]:
posts_today = []
post_dict = {}
for post in posts[9:34]:
    post_dict['title'] = title(post)
    if votes(post) == 'NO INFO':
        post_dict['votes'] = votes(post)
    else:
        post_dict['votes'] = int(votes(post))
    posts_today.append(post_dict)
    post_dict = {}

print(len(posts_today))
posts_today

25


[{'title': 'NO TITLE', 'votes': 'NO INFO'},
 {'title': '"Two clowns in the same circus" 16 x 12s oil on linen',
  'votes': 4195},
 {'title': 'German government agrees to ban fracking indefinitely',
  'votes': 6305},
 {'title': 'This fucking guy.', 'votes': 4997},
 {'title': "Irish fans fixing a dent in somebody's car", 'votes': 7509},
 {'title': 'For men, the importance of safe sex depends on how hot their partner is',
  'votes': 5208},
 {'title': 'PsBattle: Peter Dinklage riding a scooter', 'votes': 5294},
 {'title': 'Ireland fans dent the roof of a French car, fix it straight away',
  'votes': 3223},
 {'title': 'Ill bet this was a drunk idea gone right.', 'votes': 5991},
 {'title': 'LPT: Make a distinct bend in your business card before dropping it in a bowl used for a drawing. The drawer is more likely to pick a card that sticks out rather than lays flat.',
  'votes': 3704},
 {'title': 'Albert Einstein and Charlie Chaplain, 1931', 'votes': 5439},
 {'title': 'Married With Children ca

# Reddit Part Two: Sending data

You'd like to get something in your inbox about what's happening on reddit every morning at 8:30AM. Using a mailgun.com account and their API, send an email to your email address with the the CSV you saved at 8AM attached. The title of the email should be something like "Reddit this morning: January, 1 1970" 

<p>TIP: How are you going to find that csv file? Well, think about specific the datetime stamp in the filename really needs to be.