<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-data-from-the-FiveThirtyEight.com" data-toc-modified-id="Reading-data-from-the-FiveThirtyEight.com-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reading data from the FiveThirtyEight.com</a></span><ul class="toc-item"><li><span><a href="#Class-objects" data-toc-modified-id="Class-objects-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Class objects</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#References" data-toc-modified-id="References-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>References</a></span></li></ul></li><li><span><a href="#Establish-programming-components" data-toc-modified-id="Establish-programming-components-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Establish programming components</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Establish-classes" data-toc-modified-id="Establish-classes-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Establish classes</a></span></li></ul></li><li><span><a href="#Scrape-data" data-toc-modified-id="Scrape-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scrape data</a></span><ul class="toc-item"><li><span><a href="#Instantiate-objects" data-toc-modified-id="Instantiate-objects-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Instantiate objects</a></span></li><li><span><a href="#Get-candidates,-poll-data-and-create-a-DataFrame" data-toc-modified-id="Get-candidates,-poll-data-and-create-a-DataFrame-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Get candidates, poll data and create a DataFrame</a></span></li><li><span><a href="#Save-to-a-CSV-file" data-toc-modified-id="Save-to-a-CSV-file-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Save to a CSV file</a></span></li></ul></li></ul></div>

# Reading data from the FiveThirtyEight.com

This module runs a script to scrape poll data from a specific page at FiveThirtyEight.com.  It sores the data in a DataFrame and outputs it to a comma separated value (CSV) file.  


## Class objects

To retrieve the poll data, this code employs a class object designed to interact with this particular FiveThirtyEight [page](https://projects.fivethirtyeight.com/election-2016/national-primary-polls/republican/) providing presidential primary polling data from the 2016 presidential election.  The class object is Scrape538PollData and it is developed to scrape and store in memory poll results.  It is built using the [Selenium](https://www.seleniumhq.org/) Web Driver application programming interface (API) which can handle Javascript web pages and the Beautiful Soup API for extracting data from HTML objects. 

   * Scrape538PollData attributes:
       * url - the URL of the FiveThirtyPage 
       * chrome_driver - the location on the local machine of the Chrome driver needed by Selenium
       * status_code_ - set to None unless there is an error in retrieving the page
       * html_ - the HTML format of the web page
       * soup_ - A Beautiful Soup object representation of the page (using 'xlml')
       * candidates_ - a list of candidates
       * polls_ - a list of dictionaries containing individual poll data

   * Scrape538PollData methods:
       * collect_page_data() - Collect page data and update the html_ and soup_ attributes
       * extract_polls() - Extract poll data from the page
    


## Data

The data below represent the FiveThirtyEight polling data as of April 28, 2019.  

|Column        |Description    |
|-----------------|--------------------|
|Bush  | poll results|
|Carson| poll results|
|Christie| poll results|
|Cruz  | poll results|
|Fiorina| poll results|
|Huckabee| poll results|
|Kasich  | poll results|
|Paul| poll results|
|Rubio| poll results|
|Santorum  | poll results|
|Trump| poll results|
|dates| date poll was conducted|
|leader| poll leader |
|pollster| Polling organization|
|pollster_url| URL for the polling organization|
|sample| Size of the poll |
|weight| FiveThirtyEight weighting score |



## References

- https://stackoverflow.com/questions/48477688/scrape-page-with-load-more-results-button
- https://projects.fivethirtyeight.com/election-2016/national-primary-polls/republican/
 

# Establish programming components

## Import libraries

In [107]:
# Import libaries
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import time


## Establish classes

In [108]:
class Scrape538PollData:
    # Attributes of the data retrieval
    url = 'https://projects.fivethirtyeight.com/election-2016/national-primary-polls/republican/'
    chrome_driver = '/Users/stephengodfrey/anaconda3/envs/dsi/bin/chromedriver'
    status_code_ = None
    html_ = None
    soup_ = None
    candidates_ = []
    polls_ = []
    
    # Initialization method
    def __init__(self, url = None, chrome_driver = None):
        # If url != None then reset self.url
        if url:
            self.url = url
        # If chrome_driver != None reset self.chrome_driver
        if chrome_driver:
            self.chrome_driver = chrome_driver
    
    # method to collect data from posts
    def collect_page_data(self):
        # Set the selenium driver
        try:
            driver = webdriver.Chrome(self.chrome_driver)
            driver.get(self.url)
        except:
            self.status_code_ = driver.error_handler.check_response
            raise ValueError('Error retrieving the web page; see WebExceptionError for details.')
            
        # Find the read more polls button and click it
        driver.find_element_by_css_selector('.more-polls').click()
        
        # Return the html of the new page
        self.html_ = driver.page_source.encode('utf-8')
        
        # Return the soup version of the page
        self.soup_ = bs(self.html_, 'lxml')

    # method to collect data from posts
    def extract_polls(self):
        # Check to see that self.soup_ has data, if not run collect_page_data
        if self.soup_ == None or len(self.soup_) == 0:
            self.collect_page_data()
            
        # Find the list of candidates
        table = self.soup_.find('table')
        self.candidates_ = [c.text for c in table.find_all('th', {'class':'th th-rotate'})]

        # Find the poll data
        body = self.soup_.find('tbody')
        polls = body.find_all('tr',{'class': 't-row'})

        # For each poll extract information related to the poll
        for poll in polls:
            pl_d = {}
            pl_d['dates'] = poll.find('td', {'class':'t-dates'}).text
            pl_d['pollster_url'] = poll.find('a', href = True)
            pl_d['pollster'] = poll.find('td', {'class': 't-pollster t-left-margin'}).text
            pl_d['sample'] = poll.find('td', {'class': 't-sample t-left-margin t-right-align only-full'}).text
            pl_d['weight'] = poll.find('td', \
                                        {'class': 't-weight t-left-margin t-right-margin double-l-margin t-right-border-dark'}).text
            try:
                pl_d['leader'] = poll.find('td', \
                                        {'class':'t-leader t-left-margin t-right-margin only-full color-text-rep'}).text
            except:
                pl_d['leader'] = ''

            # Get the odds for each candidate except for the last candidate in the table
            for i, odds in enumerate(poll.find_all('td', {'class':'t-center-align td-cand-odds td-block t-right-border'})):
                # This tag is present if a value exists in the poll for that candidate
                if odds.find('div', {'class':'t-cand-odds heat-map-blocks'}):
                    pl_d[self.candidates_[i]] = float(odds.text.replace('%','').strip())/100
                else:
                    pl_d[self.candidates_[i]] = 0

            # Get the odds for the last candidate 
            odds = poll.find('td', {'class':'t-center-align td-cand-odds td-block'})
            # This tag is present if a value exists in the poll for that candidate
            if odds.find('div', {'class':'t-cand-odds heat-map-blocks'}):
                pl_d[self.candidates_[len(self.candidates_) - 1]] = float(odds.text.replace('%','').strip())/100
            else:
                pl_d[self.candidates_[len(self.candidates_) - 1]] = 0

            self.polls_.append(pl_d)
  
        

# Scrape data

## Instantiate objects

In [109]:
# Instantiate and a poll object and get poll data
rep_2016_polls = Scrape538PollData()
rep_2016_polls.extract_polls()


## Get candidates, poll data and create a DataFrame

In [110]:
# Print the candidate list and put the poll data in a DataFrame
print(rep_2016_polls.candidates_)
df = pd.DataFrame(rep_2016_polls.polls_)
df.head()


['Trump', 'Kasich', 'Cruz', 'Rubio', 'Carson', 'Bush', 'Christie', 'Fiorina', 'Santorum', 'Paul', 'Huckabee']


Unnamed: 0,Bush,Carson,Christie,Cruz,Fiorina,Huckabee,Kasich,Paul,Rubio,Santorum,Trump,dates,leader,pollster,pollster_url,sample,weight
0,0.0,0.0,0.0,0.22,0.0,0.0,0.14,0.0,0.0,0.0,0.56,Apr. 25-May 1,Trump +34,SurveyMonkey,"<a href=""http://www.nbcnews.com/politics/2016-...",3479,1.3
1,0.0,0.0,0.0,0.28,0.0,0.0,0.19,0.0,0.0,0.0,0.49,Apr. 22-26,Trump +21,YouGov,"<a href=""https://today.yougov.com/news/2016/04...",499,0.99
2,0.0,0.0,0.0,0.25,0.0,0.0,0.19,0.0,0.0,0.0,0.49,Apr. 28-May 1,Trump +24,Opinion Research Corporation,"<a href=""https://assets.documentcloud.org/docu...",406,0.98
3,0.0,0.0,0.0,0.2,0.0,0.0,0.13,0.0,0.0,0.0,0.56,Apr. 29-May 2,Trump +36,Morning Consult,"<a href=""https://morningconsult.com/2016/05/do...",723,0.96
4,0.0,0.0,0.0,0.26,0.0,0.0,0.17,0.0,0.0,0.0,0.56,Apr. 29-May 3,Trump +30,"Ipsos, online","<a href=""http://polling.reuters.com/#poll/TR13...",244,0.94


In [79]:
df.tail()

Unnamed: 0,Bush,Carson,Christie,Cruz,Fiorina,Huckabee,Kasich,Paul,Rubio,Santorum,Trump,dates,leader,pollster,pollster_url,sample,weight
665,0.16,0.09,0.07,0.04,0.0,0.1,0.02,0.12,0.07,0.01,0.0,Mar. 13-15,Bush +3*,Opinion Research Corporation,"<a href=""http://i2.cdn.turner.com/cnn/2015/ima...",450,\n\n0.00\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...
666,0.16,0.07,0.08,0.06,0.0,0.08,0.01,0.06,0.05,0.02,0.0,Feb. 26-Mar. 2,Walker +2*,Quinnipiac University,"<a href=""http://www.quinnipiac.edu/news-and-ev...",554,\n\n0.00\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...
667,0.16,0.06,0.07,0.05,0.0,0.1,0.01,0.1,0.08,0.02,0.0,Feb. 21-23,Walker +2*,YouGov,"<a href=""http://d25d2506sfb94s.cloudfront.net/...",255,\n\n0.00\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...
668,0.12,0.09,0.07,0.03,0.01,0.17,0.02,0.11,0.06,0.02,0.0,Feb. 12-15,Huckabee +5,Opinion Research Corporation,"<a href=""http://www.realclearpolitics.com/docs...",385,\n\n0.00\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...
669,0.15,0.1,0.06,0.04,0.0,0.13,0.02,0.13,0.05,0.02,0.0,Jan. 25-27,Bush +2,Fox News,"<a href=""http://www.foxnews.com/politics/inter...",394,\n\n0.00\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...


## Save to a CSV file

In [111]:
df.to_csv('../data/poll_data.csv', index = False)