# Data gathering

For each article listed on [fivethirtyeight/politics/features](https://fivethirtyeight.com/politics/features/), at the top, then under "Latest Politics", we store the title of the article, its url, the author(s), the date and time posted, a list of the article's tags, according to 538, and the number of comments.  The hardest part to scrape is the number of comments, since 538 uses the Facebook comments plugin.  First we import the necessary python modules.

In [1]:
# Import requests to get the html
import requests

# Import BeautifulSoup to parse the html
from bs4 import BeautifulSoup

# Use selenium to render JavaScript to scrape the comments
from selenium import webdriver
from selenium.webdriver.common.by import By

# Import the time module to time the execution of the code
import time

Since scraping the comments is the hardest part, we write a function that will do it.  It only works for articles from [fivethirtyeight.com/features](https://fivethirtyeight.com/features).

In [2]:
def num_comments_538_post(url):
    # Start the timer to time the execution of each iteration of this function
    start = time.time()
    # Function only works when the input is a features article from fivethirtyeight.com
    print("Current url:", url) # for debugging
    # Create a webdriver object with selenium that will get the required html    
    # Here Chrome will be used, but modifications to the code for other browsers exists
    driver = webdriver.Chrome()
    # Open the 538 webpage
    driver.get(url)
    # Click the expand comments button
    driver.find_element(By.CLASS_NAME, "fte-expandable-icon").click()
    # Execute the JavaScript after clicking the button
    article_html = driver.execute_script("return document.documentElement.outerHTML;")
    # Close the 538 webpage
    driver.quit()
    # Parse the html
    article_soup = BeautifulSoup(article_html)
    # Find the iframe corresponding to the comments
    comments_frame = article_soup.find('iframe', attrs = {'data-testid':"fb:comments Facebook Social Plugin"})
    # Get the source attribute in the iframe 
    comments_url = comments_frame['src']
    # Redefine the webdriver object (needed to avoid errors)
    driver = webdriver.Chrome()
    # Open the Facebook comments plugin url
    driver.get(comments_url)
    # Execute the JavaScript on that page
    comments_html = driver.execute_script("return document.documentElement.outerHTML;")
    # Close the comments page
    driver.quit()
    # Parse the rendered code
    comments_soup = BeautifulSoup(comments_html)
    # Find the element that contains the number of comments
    number = comments_soup.find('span',  attrs = {'class':"_50f7"}).text.strip(" comments")
    # End the timer
    end = time.time()
    print("Time elapsed:", end-start, "seconds\n") 
    return number

Now we extract the desired data from each headline under "Latest Politics", including the main article, on the fivethirtyeight.com/politics/features page(s).  In this code we throw out the headlines for podcasts and videos and only look at the print articles.

In [3]:
# Set timer for full execution
start_full = time.time()

# How many pages of features to extract data from -- 14 gives all articles for this year (2023)
features_num_pages = 1 #input("How many pages to scrape?  ")
print("This code will scrape data from", features_num_pages, "page(s) worth of posts in 538's politics/features section.\n")

# Here is where all the data will go
posts = []
# Get the data for each article
for i in range(features_num_pages): 
    print("Calling the number of comments from each post in features page "+str(i+1)+" takes a bit:\n") # for debugging
    # Get the html for each headline
    features_url = "https://fivethirtyeight.com/politics/features/page/"+str(i+1)
    features_html = requests.get(features_url)
    # Parse the html
    features_soup = BeautifulSoup(features_html.content)
    # Gather the data for each of articles
    features = features_soup.find_all('h2', attrs = {'class':"article-title entry-title"})
    for post in features:
        # Get post title
        title = post.a.text.strip('\n''\t')
        # Get post url
        url = post.find('a').get('href')
        # Check the url is for an article
        #if "features" in url:
        # Go to the url to get more data
        post_code = requests.get(url)
        post_soup = BeautifulSoup(post_code.content)
        # Get author(s)
        author_links = post_soup.find_all('a', attrs = {'rel':"author"})
        if author_links != None:
            authors = []
            for author in author_links:
                author = author.text
                authors.append(author)
        # Get date and time of post
        date = post_soup.find('time').text.strip('\n''\t')
        # Get tags
        tags = []
        for tag in post_soup.find_all('a', attrs = {'class':"tag"}):
            tags.append(tag.text.split(" (")[0])
        # Use the tags to get the post type
        if "Politics Podcast" in tags:
            post_type = "podcast"
        else:    
            post_type = post.find('a').get('data-content-type') 
        if post_type == None:
            post_type = "feature"
        # Get number of comments
        num_comments = num_comments_538_post(url)    
        # Add all attributes to list
        posts.append([post_type, title, url, authors, date, tags, num_comments])

# End the timer for the full execution
end_full = time.time()

# Compute time elapsed in seconds
total_time_seconds = end_full-start_full
# In minutes 
total_time_minutes = total_time_seconds/60
if total_time_minutes < 60:
    print("Total time elapsed =", total_time_minutes, "minutes")
else: 
    # In hours
    total_time_hours = total_time_minutes/60
    # Print the time elapsed in hours
    print("Total time elapsed =", total_time_hours, "hours")

# The data
print("Number of posts scraped:", len(posts))
posts

This code will scrape data from 1 page(s) worth of posts in 538's politics/features section.

Calling the number of comments from each post in features page 1 takes a bit:

Current url: https://fivethirtyeight.com/features/unions-have-been-under-attack-for-decades-but-michigan-just-gave-them-a-big-win/
Time elapsed: 13.868533372879028 seconds

Current url: https://fivethirtyeight.com/features/taylor-swift-eras-tour-polling/
Time elapsed: 23.412522554397583 seconds

Current url: https://fivethirtyeight.com/videos/will-voters-care-if-trump-gets-indicted/
Time elapsed: 21.280160665512085 seconds

Current url: https://fivethirtyeight.com/videos/why-is-biden-moving-to-the-political-center/
Time elapsed: 21.82349991798401 seconds

Current url: https://fivethirtyeight.com/features/recess-is-good-for-kids-why-dont-more-states-require-it/
Time elapsed: 28.982764720916748 seconds

Current url: https://fivethirtyeight.com/features/what-we-know-about-trumps-legal-troubles/
Time elapsed: 30.1549375

[['feature',
  'Unions Have Been Under Attack For Decades, But Michigan Just Gave Them A Big Win',
  'https://fivethirtyeight.com/features/unions-have-been-under-attack-for-decades-but-michigan-just-gave-them-a-big-win/',
  ['Monica Potts'],
  'Mar. 24, 2023, at 3:37 PM',
  ['Partisanship', 'State Legislatures', 'Labor', 'Unions'],
  '7'],
 ['feature',
  'Which Taylor Swift Album Is The Most Popular?',
  'https://fivethirtyeight.com/features/taylor-swift-eras-tour-polling/',
  ['Nathaniel Rakich'],
  'Mar. 24, 2023, at 6:00 AM',
  ['Polling',
   'Polls',
   'Pollapalooza',
   'Pollsters',
   'Music',
   'Pop Music',
   'Margin Of Error'],
  '4'],
 ['podcast',
  'Will Voters Care If Trump Gets Indicted?',
  'https://fivethirtyeight.com/videos/will-voters-care-if-trump-gets-indicted/',
  ['Galen Druke',
   'Amelia Thomson-DeVeaux',
   'Nathaniel Rakich',
   'Galen Druke',
   'Amelia Thomson-DeVeaux',
   'Nathaniel Rakich'],
  'Mar. 23, 2023',
  ['Donald Trump',
   'Politics Podcast',
   

In [4]:
# file = open('posts_this_year.txt', 'w')
# for article in articles:
#     file.write(str(article)+"\n")
# file.close()               

# Problem statement

## Question
How can fivethirtyeight.com/politics get more traffic to their articles, without compromising their political neutrality and reputation for factually correct content?

## Background

# Stakeholders
News has become more polarized and sensationalized in recent years, all in the name of more clicks. This data analysis could provide some insight into what kind of articles get more traffic, without news organizations having to sacrifice their integrity.

# Key performance indicators (KPIs)
Number of comments an article gets, relative to how long it's been posted