# Web Scraping for Reddit & Predicting Comments

Please read the readme carefully!

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. Scrape the actual text of the threads using Selenium 
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

# This starter code is just meant to be a guide to help frame the problem.  You DO NOT need to follow it, as long as you address the problem requirements set out in the readme.

You will need to use the `sleep` functionality so that you don't make too many calls to the reddit server (see the reddit API documentation for details).  Below is an example of how the timer works.

In [1]:
import time
print "hello"
time.sleep(3)
print "bye"

hello
bye


### Scraping Thread Info from Reddit.com

#### Set up a request (using requests or chromedriver) to the URL below. Use BeautifulSoup to parse the page and extract all results

In [2]:
import requests
from bs4 import BeautifulSoup
import urllib

In [3]:
url = "http://www.reddit.com"
r = requests.get(url)
#request gets the HTML (in this case), gets stuff from websites

In [4]:
HTML = r.text

In [5]:
#lxml is the parser of HTML for python
soup = BeautifulSoup(HTML, 'lxml')

In [6]:
from selenium import webdriver
from selenium.webdriver import Chrome

driver = webdriver.Chrome(executable_path = './chromedriver 2')

In [7]:
driver.get(url)

In [8]:
from time import sleep

In [9]:
time.sleep(5)
content = driver.page_source
time.sleep(5)
print content

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" class=" js cssanimations csstransforms"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link type="application/opensearchdescription+xml" rel="search" href="/static/opensearch.xml" /><link rel="canonical" href="https://www.reddit.com/" /><meta name="viewport" content="width=1024" /><link rel="dns-prefetch" href="//out.reddit.com" /><link rel="preconnect" href="//out.reddit.com" /><link rel="apple-touch-icon" sizes="57x57" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" /><link rel="apple-touch-icon" sizes="60x60" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" /><link rel="apple-touch-

In [10]:
soup = BeautifulSoup(content, 'lxml')

In [11]:
driver.close()

In [12]:
#how to show the html prettily = use prettify
print soup.prettify()
time.sleep(15)

<!DOCTYPE html>
<html class=" js cssanimations csstransforms" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   reddit: the front page of the internet
  </title>
  <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
  <meta content="reddit: the front page of the internet" name="description"/>
  <meta content="always" name="referrer"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/>
  <link href="https://www.reddit.com/" rel="canonical"/>
  <meta content="width=1024" name="viewport"/>
  <link href="//out.reddit.com" rel="dns-prefetch"/>
  <link href="//out.reddit.com" rel="preconnect"/>
  <link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" rel="apple-touch-icon" siz

In [13]:
print soup.find_all('a', {'data-event-action': 'title'})

[<a class="title may-blank " data-event-action="title" href="//pixel.redditmedia.com/click?url=https%3A%2F%2Fengine.a.redditmedia.com%2Fr%3Fe%3DeyJhdiI6MjMyNjA1LCJhdCI6NCwiYnQiOjAsImNtIjo1ODUzNDksImNoIjo3OTg0LCJjayI6e30sImNyIjo0MTI0Njc2LCJkaSI6IjIxZTkzMjEyNzRmMTQyZDhiMTBmNDdlYjA4N2JlNTVmIiwiZGoiOjAsImlpIjoiODg2YmEyMjFjYjUzNDE1MTllOGJhM2FjNzFkOTQyYzQiLCJkbSI6MywiZmMiOjY4MDA1OTQsImZsIjo2NjEyMDU0LCJpcCI6IjEwMC4xMi4xNjAuNDIiLCJudyI6NTE0NiwicGMiOjEsImRwIjowLjc5LCJkbiI6MC43OSwiZWMiOjEsInByIjo4NDUzNiwicnQiOjIsInJzIjo1MDAsInNhIjoiMTEiLCJzYiI6ImktMGVjYzk3OGJjOGI1NmMyZDciLCJzcCI6MjU4MTMsInN0IjoyNDk1MCwidWsiOiJ0ZXdpODl1IiwidHMiOjE1MTY3NDIwMjgxMjcsInBuIjoiZGl2X2ZlZWQiLCJ1ciI6Imh0dHBzOi8vd3d3LmFtYXpvbi5jb20vZ3AvZ29sZGJveC8_dGFnPXJhMGEwLTIwJmFzY3N1YnRhZz1wdHItRFNTLTEtOS0xNTEyMzk5NDk4MTgxRTgmcmVmXz1wdHJfRFNTXzFfOV8xNTEyMzk5NDk4MTgxRTgifQ%26s%3DAlmzocUn-4MZm1XKVR_5T-OnZ7o&amp;unused=reddit.com&amp;hash=6b6fcc9f2444499d8e93f402bc46b9124bd28baf&amp;id=t3_7m4404-__REDDIT_AD_SERVER__" rel="" tabindex="1">

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The thread title is within an `<a>` tag with the attribute `data-event-action="title"`.
- The time since the thread was created is within a `<time>` tag with attribute `class="live-timestamp"`.
- The subreddit is within an `<a>` tag with the attribute `class="subreddit hover may-blank"`.
- The number of comments is within an `<a>` tag with the attribute data-event-action="comments"`.

In [14]:
#print all of the things classified under "a"
print soup.find_all('a')

[<a href="#content" id="jumpToContent" tabindex="1">jump to content</a>, <a class="bottom-option choice" href="https://www.reddit.com/subreddits/">edit subscriptions</a>, <a class="choice" href="https://www.reddit.com/r/popular/">popular</a>, <a class="choice" href="https://www.reddit.com/r/all/">all</a>, <a class="random choice" href="https://www.reddit.com/r/random/">random</a>, <a class="choice" href="https://www.reddit.com/users/">users</a>, <a class="choice" href="https://www.reddit.com/r/AskReddit/">AskReddit</a>, <a class="choice" href="https://www.reddit.com/r/worldnews/">worldnews</a>, <a class="choice" href="https://www.reddit.com/r/videos/">videos</a>, <a class="choice" href="https://www.reddit.com/r/funny/">funny</a>, <a class="choice" href="https://www.reddit.com/r/todayilearned/">todayilearned</a>, <a class="choice" href="https://www.reddit.com/r/pics/">pics</a>, <a class="choice" href="https://www.reddit.com/r/gaming/">gaming</a>, <a class="choice" href="https://www.redd

In [15]:
#find the html code classified under the tag "a"
print soup.find('a',{'data-event-action':'title'})

<a class="title may-blank " data-event-action="title" href="//pixel.redditmedia.com/click?url=https%3A%2F%2Fengine.a.redditmedia.com%2Fr%3Fe%3DeyJhdiI6MjMyNjA1LCJhdCI6NCwiYnQiOjAsImNtIjo1ODUzNDksImNoIjo3OTg0LCJjayI6e30sImNyIjo0MTI0Njc2LCJkaSI6IjIxZTkzMjEyNzRmMTQyZDhiMTBmNDdlYjA4N2JlNTVmIiwiZGoiOjAsImlpIjoiODg2YmEyMjFjYjUzNDE1MTllOGJhM2FjNzFkOTQyYzQiLCJkbSI6MywiZmMiOjY4MDA1OTQsImZsIjo2NjEyMDU0LCJpcCI6IjEwMC4xMi4xNjAuNDIiLCJudyI6NTE0NiwicGMiOjEsImRwIjowLjc5LCJkbiI6MC43OSwiZWMiOjEsInByIjo4NDUzNiwicnQiOjIsInJzIjo1MDAsInNhIjoiMTEiLCJzYiI6ImktMGVjYzk3OGJjOGI1NmMyZDciLCJzcCI6MjU4MTMsInN0IjoyNDk1MCwidWsiOiJ0ZXdpODl1IiwidHMiOjE1MTY3NDIwMjgxMjcsInBuIjoiZGl2X2ZlZWQiLCJ1ciI6Imh0dHBzOi8vd3d3LmFtYXpvbi5jb20vZ3AvZ29sZGJveC8_dGFnPXJhMGEwLTIwJmFzY3N1YnRhZz1wdHItRFNTLTEtOS0xNTEyMzk5NDk4MTgxRTgmcmVmXz1wdHJfRFNTXzFfOV8xNTEyMzk5NDk4MTgxRTgifQ%26s%3DAlmzocUn-4MZm1XKVR_5T-OnZ7o&amp;unused=reddit.com&amp;hash=6b6fcc9f2444499d8e93f402bc46b9124bd28baf&amp;id=t3_7m4404-__REDDIT_AD_SERVER__" rel="" tabindex="1">W

In [16]:
#start printing only the text with the tag of "a
print soup.find('a',{'data-event-action':'title'}).get_text()

We haven't run out of deals at Amazon! Don't miss out on today's Deal of the Day and other major savings before they're gone!


In [17]:
print soup.find('time',{'class':'live-timestamp'}).get_text()

2 hours ago


In [18]:
print soup.find('a',{'class':'subreddit hover may-blank'}).get_text()

r/gifs


In [19]:
print soup.find('a',{'data-event-action':'comments'}).get_text()

439 comments


## Write 4 functions to extract these items (one function for each): title, time, subreddit, and number of comments.¶
Example
```python
def extract_title_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [20]:
def get_title(html):
    #html is NOT the "HTML" that we found, it is the "soup"
    title_list =[]
    for x in html.findAll('p', {'class':'title'})[1:]:
        title_list.append(x.text)
    return title_list

In [21]:
titles = get_title(soup)
titles
#this is a list, in our df it will be as a dataframe

[u'Dad prevents crash. (i.imgur.com)',
 u'Dad reflexes prevent crash.\u2605\u2605\u2605\u2605\u2605 Dad Reflex (i.imgur.com)',
 u'Synced videos of the Eagles fan running into the pillar (v.redd.it)',
 u'Legit the best place for your squad to stay woke...True FellowKids (i.imgur.com)',
 u'My friend playing Mario Odyssey during class (i.redd.it)',
 u'Quite literally choosing beggar (i.imgur.com)',
 u'One of the best feelings ever (i.imgur.com)',
 u'When your cat crashes his bicycle in his dream... (v.redd.it)',
 u'A tower of giraffes out for a run/r/ALL (i.imgur.com)',
 u'Heatmap of numbers found at the end of Reddit usernames [OC]OC (i.redd.it)',
 u'Tesla\u2019s giant battery in Australia made around $1 million in just a few daysEnergy (electrek.co)',
 u'Hre you OPEN TODAY (i.redd.it)',
 u'Trying to pacifist run SuperHot is insane. (gfycat.com)',
 u'Sartre Day Night Live (i.imgur.com)',
 u'Greece to legalize Medical Marijuana in February (cnednews.com)',
 u'Psychedelic mushrooms reduce 

In [22]:
titles[5]
#trying to pull one title from the list

u'Quite literally choosing beggar (i.imgur.com)'

In [23]:
def get_time(html):
    #html is NOT the "HTML" that we found, it is the "soup"
    time_list =[]
    for x in html.findAll('time',{'class':'live-timestamp'}):
        time_list.append(x.text)
    return time_list

In [24]:
times = get_time(soup)
times
len(times)

25

In [25]:
def get_subreddit(html_soup):
    #html is NOT the "HTML" that we found, it is the "soup"
    subreddit_list =[]
    for x in html_soup.findAll('a',{'class':'subreddit hover may-blank'}):
        try:
            subreddit_list.append(x.text)
        except:
            subreddit_list.append('ERROR')
    return subreddit_list

In [26]:
subreddits = get_subreddit(soup)
subreddits

[u'r/gifs',
 u'r/DadReflexes',
 u'r/sports',
 u'r/FellowKids',
 u'r/gaming',
 u'r/ChoosingBeggars',
 u'r/wholesomebpt',
 u'r/funny',
 u'r/interestingasfuck',
 u'r/dataisbeautiful',
 u'r/Futurology',
 u'r/oldpeoplefacebook',
 u'r/gaming',
 u'r/standupshots',
 u'r/worldnews',
 u'r/science',
 u'r/oddlysatisfying',
 u'r/fakehistoryporn',
 u'r/instant_regret',
 u'r/WeWantPlates',
 u'r/evilbuildings',
 u'r/PoliticalHumor',
 u'r/photoshopbattles',
 u'r/pics',
 u'r/pics']

In [27]:
def get_comments(html_soup):
    #html is NOT the "HTML" that we found, it is the "soup"
    comment_list =[]  
    for x in html_soup.findAll('a',{'data-event-action':'comments'}):
        try:
            comment_list.append(x.text)
        except:
            comment_list.append('ERROR')
    return comment_list

In [28]:
number_comments = get_comments(soup)
number_comments

[u'439 comments',
 u'472 comments',
 u'153 comments',
 u'265 comments',
 u'1035 comments',
 u'753 comments',
 u'139 comments',
 u'746 comments',
 u'564 comments',
 u'2896 comments',
 u'331 comments',
 u'421 comments',
 u'1662 comments',
 u'273 comments',
 u'350 comments',
 u'1032 comments',
 u'797 comments',
 u'56 comments',
 u'341 comments',
 u'612 comments',
 u'151 comments',
 u'2221 comments',
 u'348 comments',
 u'228 comments',
 u'1095 comments']

In [29]:
# def get_upvotes(soup):
#     up_list= []
#     c = 0
#     if soup.find('span', {'class':'promoted-span'}) != None:
#         c = c +1
#     else:
#         c = c
#     for voted in soup.findAll('div', {'class':'score unvoted'})[c:]:
#         try: 
#             up_list.append(int(voted['title']))
#         except:
#             up_list.append(0)
#     print up_list

In [30]:
# upvoted = get_upvotes(soup)
# print upvoted
# len(upvoted)

In [31]:
len(titles)

25

In [32]:
len(times)

25

In [33]:
len(subreddits)

25

In [34]:
len(number_comments)

25

In [35]:
import pandas as pd

In [36]:
dic = {'titles':titles, 'posted':times, 'subreddit':subreddits, 'comments':number_comments}
df_firstpage = pd.DataFrame(dic)

In [37]:
df_firstpage

Unnamed: 0,comments,posted,subreddit,titles
0,439 comments,2 hours ago,r/gifs,Dad prevents crash. (i.imgur.com)
1,472 comments,3 hours ago,r/DadReflexes,Dad reflexes prevent crash.★★★★★ Dad Reflex (i...
2,153 comments,2 hours ago,r/sports,Synced videos of the Eagles fan running into t...
3,265 comments,3 hours ago,r/FellowKids,Legit the best place for your squad to stay wo...
4,1035 comments,5 hours ago,r/gaming,My friend playing Mario Odyssey during class (...
5,753 comments,5 hours ago,r/ChoosingBeggars,Quite literally choosing beggar (i.imgur.com)
6,139 comments,4 hours ago,r/wholesomebpt,One of the best feelings ever (i.imgur.com)
7,746 comments,5 hours ago,r/funny,When your cat crashes his bicycle in his dream...
8,564 comments,6 hours ago,r/interestingasfuck,A tower of giraffes out for a run/r/ALL (i.img...
9,2896 comments,6 hours ago,r/dataisbeautiful,Heatmap of numbers found at the end of Reddit ...


In [101]:
df_firstpage.to_csv('reddit_onepage.csv',encoding= 'utf-8', index=False)

In [39]:
#df get_upvotes()

In [102]:
def reddit_function(soup):
    title = get_title(soup)
    times = get_time(soup)
    subreddit = get_subreddit(soup)
    comments = get_comments(soup)
    #upvotes = get_upvotes(soup)
    dic = {'titles':title, 'posted':times, 'subreddit':subreddit, 'comments':comments}
    if len(title) == len(times) == len(subreddit) == len(comments):
        df = pd.DataFrame(dic)
        return df

In [41]:
reddit_function(soup)

Unnamed: 0,comments,posted,subreddit,titles
0,439 comments,2 hours ago,r/gifs,Dad prevents crash. (i.imgur.com)
1,472 comments,3 hours ago,r/DadReflexes,Dad reflexes prevent crash.★★★★★ Dad Reflex (i...
2,153 comments,2 hours ago,r/sports,Synced videos of the Eagles fan running into t...
3,265 comments,3 hours ago,r/FellowKids,Legit the best place for your squad to stay wo...
4,1035 comments,5 hours ago,r/gaming,My friend playing Mario Odyssey during class (...
5,753 comments,5 hours ago,r/ChoosingBeggars,Quite literally choosing beggar (i.imgur.com)
6,139 comments,4 hours ago,r/wholesomebpt,One of the best feelings ever (i.imgur.com)
7,746 comments,5 hours ago,r/funny,When your cat crashes his bicycle in his dream...
8,564 comments,6 hours ago,r/interestingasfuck,A tower of giraffes out for a run/r/ALL (i.img...
9,2896 comments,6 hours ago,r/dataisbeautiful,Heatmap of numbers found at the end of Reddit ...


Now, to scale up our scraping, we need to accumulate more results.

First, look at the source of a Reddit.com page: (https://www.reddit.com/).
Try manually changing the page by clicking the 'next' button on the bottom. Look at how the url changes.

After leaving the Reddit homepage, the URLs should look something like this:
```
https://www.reddit.com/?count=25&after=t3_787ptc
```

The URL here has two query parameters
- count is the result number that the page starts with
- after is the unique id of the last result on the _previous_ page

In order to scrape lots of pages from Reddit, we'll have to change these parameters every time we make a new request so that we're not just scraping the same page over and over again. Incrementing the count by 25 every time will be easy, but the bizarre code after `after` is a bit trickier.

To start off, let's look at a block of HTML from a Reddit page to see how we might solve this problem:
```html
<div class=" thing id-t3_788tye odd gilded link " data-author="LordSneaux" data-author-fullname="t2_j3pty" data-comments-count="1548" data-context="listing" data-domain="v.redd.it" data-fullname="t3_788tye" data-kind="video" data-num-crossposts="0" data-permalink="/r/funny/comments/788tye/not_all_heroes_wear_capes/" data-rank="25" data-score="51468" data-subreddit="funny" data-subreddit-fullname="t5_2qh33" data-timestamp="1508775581000" data-type="link" data-url="https://v.redd.it/ush0rh2tultz" data-whitelist-status="all_ads" id="thing_t3_788tye" onclick="click_thing(this)">
      <p class="parent">
      </p>
      <span class="rank">
       25
      </span>
      <div class="midcol unvoted">
       <div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" tabindex="0">
       </div>
       <div class="score dislikes" title="53288">
        53.3k
       </div>
       <div class="score unvoted" title="53289">
        53.3k
       </div>
       <div class="score likes" title="53290">
        53.3k
       </div>
       <div aria-label="downvote" class="arrow down login-required access-required" data-event-action="downvote" role="button" tabindex="0">
       </div>
      </div>
```

Notice that within the `div` tag there is an attribute called `id` and it is set to `"thing_t3_788tye"`. By finding the last ID on your scraped page, you can tell your _next_ request where to start (pass everything after "thing_").

For more info on this, you can take a look at the [Reddit API docs](https://github.com/reddit/reddit/wiki/JSON)

## Write one more function that finds the last `id` on the page, and stores it.

In [42]:
#load up 300 pages on one page, need to use selenium to combine the pages
from selenium import webdriver

In [43]:
url = 'http://www.reddit.com/'

In [44]:
driver = webdriver.Chrome('./chromedriver 2')
driver.get(url)

In [45]:
#all of the HTML for all of the pages we designate
html= driver.page_source
html

u'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" class=" js cssanimations csstransforms"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link type="application/opensearchdescription+xml" rel="search" href="/static/opensearch.xml" /><link rel="canonical" href="https://www.reddit.com/" /><meta name="viewport" content="width=1024" /><link rel="dns-prefetch" href="//out.reddit.com" /><link rel="preconnect" href="//out.reddit.com" /><link rel="apple-touch-icon" sizes="57x57" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" /><link rel="apple-touch-icon" sizes="60x60" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" /><link rel="apple-touc

In [46]:
full_soup= BeautifulSoup(html, 'lxml')
full_soup.prettify

<bound method BeautifulSoup.prettify of <!DOCTYPE html>\n<html class=" js cssanimations csstransforms" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>reddit: the front page of the internet</title><meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/><meta content="reddit: the front page of the internet" name="description"/><meta content="always" name="referrer"/><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/><link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/><link href="https://www.reddit.com/" rel="canonical"/><meta content="width=1024" name="viewport"/><link href="//out.reddit.com" rel="dns-prefetch"/><link href="//out.reddit.com" rel="preconnect"/><link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/><link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" rel="apple-touch-icon

In [47]:
driver.close()

In [48]:
import re
#result.find(id=re.compile("thing"))

In [49]:
full_soup.find(id= re.compile('thing'))['id'][6:]

't3_7rkmae'

In [50]:
import re
def get_lastID(mysoup):
    return mysoup.find(id= re.compile('thing'))['id'][6:]

In [51]:
get_lastID(full_soup)

't3_7rkmae'

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

## Now, let's put it all together.

Use the functions you wrote above to parse out the 4 fields - title, time, subreddit, and number of comments. Create a dataframe from the results with those 4 columns.

In [52]:
url_template = "http://www.reddit.com/?count={}&after={}"
max_results = 100 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []

for start in range(0, max_results, 25):
    # Grab the results from the request (as above)
    # Append to the full set of results
    pass

In [53]:
# attempt with Will and group
# import numpy as np

# url_template = "http://www.reddit.com/?count{}=&after={}"
# last_id = get_lastID(soup)

# for i in np.arange(25,5000,25):
    
#     import time
#     time.sleep(5) #sleep 5 seconds in between requests
    
#     driver = webdriver.Chrome('./chromedriver 2')
#     driver.get(url_template.format(i+25, last_id))
#     html = driver.page_source
#     driver.close()
    
#     soup = BeautifulSoup(html,'lxml')
#     new_id = get_lastID(soup)
#     print url

#     time.sleep(5)

In [99]:
def reddit_scrapper(website):
    
    driver = webdriver.Chrome(executable_path="./chromedriver 2")
    driver.get(website)
    
    time.sleep(1)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    
    ids=[]
    for x in soup.find('div', {'class': 'thing'}):
        ids.append(x['data-fullname'])
    
    new_list = zip(ids,range(25,900,25))

    for i, n in new_list:
        
        url_template = "http://www.reddit.com/?count={}&after={}".format(n,i)
        driver.get(url_template)
        time.sleep(3)
        
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')
        #return reddit_function(soup)
        
        time.sleep(3)
        
    driver.close()
    return ids
    #return reddit_function(soup)
    

In [122]:
title = []
subreddit = []
times = []
comment = []
domain = []

def all_elements(x):
  
    for i in soup.findAll('a',{'data-event-action': 'title'}):
        title.append(i.text)
    for i in soup.findAll('a',{'class': 'subreddit hover may-blank'}):
        subreddit.append(i.text)
    for i in soup.findAll('time',{'class': 'live-timestamp'}):
        times.append(i.text)
    for i in soup.findAll('a',{'class':'bylink comments may-blank'}):
        comment.append(i.text)
    for i in soup.findAll('span',{'class':'domain'}):
        domain.append(i.text.replace("(","").replace(")",""))
       
    return pd.DataFrame(zip(title[1:], subreddit, times, comment, domain),
                       columns=['title', 'subreddit','times', 'comment', 'domain',])

In [125]:
# mike's code became my final code because I was able to scrape the largest data set from it. 
pages = range(1,30)
# starting url 
url = 'https://www.reddit.com/'    
for page in pages: 
    # Instantiate a new driver every loop
    driver = webdriver.Chrome(executable_path = './chromedriver 2')
    driver.get(url)
    html = driver.page_source
    # Put the page HTML in a soup object
    soup = BeautifulSoup(html, 'lxml')
    # overwrite the url with the url that the "Next" link points to.
    url = soup.find('span', {'class':'next-button'}).a['href']
    print url
    # Close out the driver
    df = pd.concat([df,reddit_function(soup)])
    driver.close()
    # Sleeping 
    sleep(5)

https://www.reddit.com/?count=25&after=t3_7sfa6j
https://www.reddit.com/?count=50&after=t3_7se7rc
https://www.reddit.com/?count=75&after=t3_7sdxt5
https://www.reddit.com/?count=100&after=t3_7sexoz
https://www.reddit.com/?count=125&after=t3_7sffhi
https://www.reddit.com/?count=150&after=t3_7sfba0
https://www.reddit.com/?count=175&after=t3_7sd5y1
https://www.reddit.com/?count=200&after=t3_7seal2
https://www.reddit.com/?count=225&after=t3_7sedya
https://www.reddit.com/?count=250&after=t3_7sf43k
https://www.reddit.com/?count=275&after=t3_7se6um
https://www.reddit.com/?count=300&after=t3_7sgkwo
https://www.reddit.com/?count=325&after=t3_7sez3f
https://www.reddit.com/?count=350&after=t3_7sfbd3
https://www.reddit.com/?count=375&after=t3_7sf4uo
https://www.reddit.com/?count=400&after=t3_7seg1i
https://www.reddit.com/?count=425&after=t3_7sh1az


AttributeError: 'NoneType' object has no attribute 'a'

In [131]:
df.to_csv('reddit_michael_30.csvcat on that', encoding = 'utf-8',index=False)
df = df.drop_duplicates()
df.to_csv('reddit_michael_30.csv', encoding = 'utf-8',index=False)

In [124]:
df_20.to_csv('reddit_michael_20.csv',encoding= 'utf-8', index=False)

In [87]:
# ids = []
# for x in soup.findAll('div', {'class': 'thing'}):
#     ids.append(x['data-fullname'])

# full_list = zip(ids,range(25,1000,25))

# for i, n in full_list:
#     url_template = "http://www.reddit.com/?count={}&after={}".format(n,i)
#     #print url_template
#     driver = webdriver.Chrome(executable_path="./chromedriver 2")
#     driver.get(url_template)
#     time.sleep(3)
#     html_new = driver.page_source
#     newsoup = BeautifulSoup(html_new,'lxml')
#     for x in newsoup.findAll('div', {'class': 'thing'}):
#         ids.append(x['data-fullname'])
# print ids
# driver.close()

In [88]:
ids

Unnamed: 0,title,subreddit,times,comment,domain


In [89]:
len(ids)

0

In [114]:
def get_endings(website):
    driver = webdriver.Chrome(executable_path="./chromedriver 2")
    driver.get(website)
    
    time.sleep(1)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')

    ids = []
    for x in soup.findAll('div', {'class': 'thing'}):
        ids.append(x['data-fullname'])
    print ids
    

In [116]:
get_endings('http://www.reddit.com')

['t3_7m4404', 't3_7sgrbd', 't3_7sgdzm', 't3_7sgaut', 't3_7sgbfd', 't3_7sfd5v', 't3_7sfi5n', 't3_7sfa7x', 't3_7sfhqk', 't3_7sfdza', 't3_7sf1ra', 't3_7seqkz', 't3_7sewjx', 't3_7sf3oc', 't3_7sexi5', 't3_7selso', 't3_7sel79', 't3_7sfl6j', 't3_7sf5n5', 't3_7sefzn', 't3_7segn4', 't3_7sfrng', 't3_7sebyn', 't3_7selkl', 't3_7seiu5', 't3_7secmb']


In [108]:
list_ids = get_endings('http://www.reddit.com')
print list_ids

['t3_7rkmae', 't3_7sgaut', 't3_7sgdzm', 't3_7sgrbd', 't3_7sgbfd', 't3_7sfhqk', 't3_7sfd5v', 't3_7sfa7x', 't3_7sf3oc', 't3_7sf1ra', 't3_7sewjx', 't3_7sfdza', 't3_7sfi5n', 't3_7sexi5', 't3_7seqkz', 't3_7sf5n5', 't3_7sepwz', 't3_7sfl6j', 't3_7segn4', 't3_7sfrng', 't3_7sel79', 't3_7selkl', 't3_7seiu5', 't3_7sebyn', 't3_7sefzn', 't3_7sf8fp']
None


In [109]:
#full_list

In [110]:
len(range(25,4000,25))

159

In [117]:
ids = get_endings('http://www.reddit.com')
df = pd.DataFrame()

full_list = zip(range(25,4000,25),ids)

for i, n in full_list:
        url_template = "http://www.reddit.com/?count={}&after={}".format(n,i)
        print url_template
        driver = webdriver.Chrome(executable_path="./chromedriver 2")
        driver.get(url_template)
        time.sleep(3)
        
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')
        time.sleep(3)
        df = pd.concat([df,reddit_function(soup)])
        #df.to_csv('file_name', encoding='utf-8', index=False)
        
        driver.close()
df

['t3_7m4404', 't3_7sgrbd', 't3_7sgdzm', 't3_7sgaut', 't3_7sgbfd', 't3_7sfd5v', 't3_7sfi5n', 't3_7sfa7x', 't3_7sfhqk', 't3_7sfdza', 't3_7sf1ra', 't3_7seqkz', 't3_7sewjx', 't3_7sf3oc', 't3_7sexi5', 't3_7selso', 't3_7sel79', 't3_7sfl6j', 't3_7sf5n5', 't3_7sefzn', 't3_7segn4', 't3_7sfrng', 't3_7sebyn', 't3_7selkl', 't3_7seiu5', 't3_7secmb']


TypeError: zip argument #2 must support iteration

In [None]:
df

In [None]:
url_template = '...?count{}&after={}'

In [None]:
#tried a different findall, didn't work as well, sticking to the "compile" thing
#full_soup.findAll({'div':'data-fullname'})

In [None]:
MANUAL

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops or if your computer crashes, you don't lose all your data.

In [None]:
# Export to csv
import pandas as pd
sophie = pd.read_csv('./scraping_results.csv')
sophie.head()

In [None]:
sophie.info()

In [None]:
df = sophie.drop(sophie[['Unnamed: 0', 'created_at','time_now']],axis=1)

In [None]:
df.head()

In [None]:
df.time = pd.to_datetime(df.time_delta)

In [None]:
days = df.time_delta.days
# hours, remainder = divmod(td.seconds, 3600)
# minutes, seconds = divmod(remainder, 60)
# # If you want to take into account fractions of a second
# seconds += td.microseconds / 1e6

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

#### What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE