# Data gathering, problem statement, stakeholders, KPIs

## Data gathering

For each headline listed on [fivethirtyeight/politics/features](https://fivethirtyeight.com/politics/features/), at the top, then under "Latest Politics", we store the type of post, its title, its url, the author(s), the date and time posted, a list of the article's tags, according to 538, and the number of comments.  We scrape all headlines from the features pages, except for the live-blogs, which don't have any comments.  The hardest part to scrape is the number of comments, since 538 uses the Facebook comments plugin.  First we import the necessary python modules.

In [1]:
# Get the html
import requests

# Parse the html
from bs4 import BeautifulSoup

# Render JavaScript to scrape the comments
from selenium import webdriver
from selenium.webdriver.common.by import By

# Delays and time for execution of the code
import time # for debugging

# Get the date and time
from datetime import datetime

# For splitting using more than one delimiter
import re

# Makes a csv file quickly
import pandas as pd

Since scraping the comments is the hardest part, we write a function that will do it.  It only works for posts from [fivethirtyeight.com/features](https://fivethirtyeight.com/features).  The function takes some time each time it's run, so there are debugging commands to track its progress.  Any line in the code with the comment "for debugging" can be commented out.

In [2]:
# Input is the url of one of the features posts on fivethirtyeight.com/politics/features pages.
# Output is the number of comments on the post.
def num_comments_538_post(url):
    # Start the timer to time the execution of each iteration of this function
    start = time.time() # for debugging
    # Function only works when the input is a features article from fivethirtyeight.com
    print("Comments scraping current url:", url) # for debugging
    # Create a webdriver object with selenium that will get the required html    
    # Here Chrome will be used, but modifications to the code for other browsers exist
    driver = webdriver.Chrome()
    # Open the 538 webpage after 10 seconds
    time.sleep(10)
    driver.get(url)
    # Click the expand comments button
    driver.find_element(By.CLASS_NAME, "fte-expandable-icon").click()
    # Execute the JavaScript after clicking the button
    article_html = driver.execute_script("return document.documentElement.outerHTML;")
    # Close the 538 webpage
    driver.quit()
    # Parse the html
    article_soup = BeautifulSoup(article_html, "lxml")
    # Find the iframe corresponding to the comments
    comments_frame = article_soup.find('iframe', attrs = {'data-testid':"fb:comments Facebook Social Plugin"})
    # Get the source attribute in the iframe 
    comments_url = comments_frame['src']
    # Redefine the webdriver object (needed to avoid errors)
    driver = webdriver.Chrome()
    # Open the Facebook comments plugin url
    driver.get(comments_url)
    # Execute the JavaScript on that page
    comments_html = driver.execute_script("return document.documentElement.outerHTML;")
    # Close the comments page
    #driver.quit()
    # Parse the rendered code
    comments_soup = BeautifulSoup(comments_html,"lxml")
    # Find the element that contains the number of comments
    number = comments_soup.find('span',  attrs = {'class':"_50f7"}).text.strip(" comments")
    print("The number of comments is "+str(number)+".") # for debugging
    # End the timer
    end = time.time() # for debugging
    print("Time elapsed:", end-start, "seconds\n") # for debugging 
    return number

Now we extract the desired data from each headline under "Latest Politics", including the main article, on the [538 features](https://www.fivethirtyeight.com/politics/features) page(s).  In the following code, the authors and tags are originally stored as lists.  However, when we convert all the data into a data frame later, we will need the data to have the right shape -- we need it to be a list of lists, with no additional nested lists.  For authors and tags we turn the list into a string where the items are separated by semicolons instead of commas.  This will make it possible to create a `.csv` file with the data.

In [3]:
# Get the date and time to put in the name of the output file
now = datetime.now()

# Set timer for full execution
start_full = time.time() # for debugging

# How many pages of features to extract data from
features_num_pages = 51 #input("How many features pages to scrape?  Each has about 10 posts.  ")
#print("This code will scrape data from", features_num_pages, "page(s) worth of posts in 538's politics/features section.\n") # for debugging

# Here is where all the data will go
posts = []
# Get the data for each post
for i in range(features_num_pages): 
    print("\nPage "+str(i+1)+"...\n") # for debugging
    # Get the html for each headline
    features_url = "https://fivethirtyeight.com/politics/features/page/"+str(i+1)
    features_html = requests.get(features_url)
    # Parse the html
    features_soup = BeautifulSoup(features_html.content)
    # Gather the data for each of articles
    features = features_soup.find_all('h2', attrs = {'class':["article-title entry-title", "title entry-title"]})
    for post in features:
        # Get post title from the features page
        title = post.a.text.strip('\n''\t')
        # Get post url from the features page
        url = post.find('a').get('href')
        # Screen for live blogs, which don't have comments
        if "live-blog" in url:
            continue
        # Go to the url to get more data
        post_code = requests.get(url)
        post_soup = BeautifulSoup(post_code.content)
        # Get author(s)
        author_bios = post_soup.find_all('div', attrs = {'class':"mini-bio"})
        if author_bios == []:
            authors = "None/All"
        else:    
            authors_list = []
            for author in author_bios:
                # Extract the author name
                to_extract = author.p.text
                to_extract_list = re.split(" is | reports", to_extract)
                authors_list.append(to_extract_list[0])
            authors = str(authors_list).replace(",", ";").strip("[" "]").replace("\'", "")    
        # Get date and time of post
        date = post_soup.find('time').text.strip('\n''\t')
        # Get tags
        tags_list = []
        for tag in post_soup.find_all('a', attrs = {'class':"tag"}):
            tags_list.append(tag.text.split(" (")[0])
        tags = str(tags_list).replace(",", ";").strip("[" "]").replace("\'", "")      
        # Use the tags to get the post type
        if "Politics Podcast" in tags:
            post_type = "podcast"
        else:    
            post_type = post.find('a').get('data-content-type') 
        if post_type == None:
            post_type = "feature"
        # Change the name "feature" to "article"    
        if post_type == "feature":
            post_type = "article"
        # Get number of comments
        num_comments = num_comments_538_post(url)    
        # Add all attributes to list
        posts.append([post_type, title, url, authors, date, tags, num_comments])
        if len(posts) == 500:
            break

# End the timer for the full execution
end_full = time.time() # for debugging

# Compute time elapsed in seconds
total_time_seconds = end_full-start_full # for debugging
# In minutes 
total_time_minutes = total_time_seconds/60 # for debugging
if total_time_minutes < 60: # for debugging
    print("Total time elapsed =", total_time_minutes, "minutes") # for debugging
else: # for debugging
    # In hours
    total_time_hours = total_time_minutes/60 # for debugging
    # Print the time elapsed in hours
    print("Total time elapsed =", total_time_hours, "hours") # for debugging

# The data
print("Number of posts scraped:", len(posts)) # for debugging
#posts # for debugging


Page 1...

Comments scraping current url: https://fivethirtyeight.com/features/trump-indictment-2024-election/
The number of comments is 5.
Time elapsed: 21.31774663925171 seconds

Comments scraping current url: https://fivethirtyeight.com/features/tiktok-ban-polls/
The number of comments is 5.
Time elapsed: 21.084059715270996 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/new-laws-are-driving-red-and-blue-states-further-apart/
The number of comments is 8.
Time elapsed: 21.445701599121094 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-what-the-laboratories-of-democracy-are-cooking-up/
The number of comments is 2.
Time elapsed: 22.08466100692749 seconds

Comments scraping current url: https://fivethirtyeight.com/features/fighting-inflation-could-hurt-black-workers/
The number of comments is 5.
Time elapsed: 23.80029797554016 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politicians

Comments scraping current url: https://fivethirtyeight.com/features/gop-legislators-trying-pro-life-not-just-anti-abortion/
The number of comments is 8.
Time elapsed: 28.93886685371399 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/why-is-2-percent-the-federal-reserves-magic-number-for-inflation/
The number of comments is 8.
Time elapsed: 22.49292302131653 seconds

Comments scraping current url: https://fivethirtyeight.com/features/why-the-feds-want-2-percent-inflation/
The number of comments is 7.
Time elapsed: 21.50448966026306 seconds

Comments scraping current url: https://fivethirtyeight.com/features/cpac-irrelevant/
The number of comments is 14.
Time elapsed: 22.126647233963013 seconds

Comments scraping current url: https://fivethirtyeight.com/features/covid-lab-leak/
The number of comments is 17.
Time elapsed: 21.40687108039856 seconds

Comments scraping current url: https://fivethirtyeight.com/features/house-435-members-pretty-rare/
The number of co

The number of comments is 5.
Time elapsed: 20.953136682510376 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/will-ufos-push-u-s-china-relations-to-the-brink/
The number of comments is 3.
Time elapsed: 23.04699158668518 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/biden-says-he-doesnt-believe-the-polls-he-should/
The number of comments is 5.
Time elapsed: 21.363701343536377 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-american-opinion-of-china-has-plummeted/
The number of comments is 4.
Time elapsed: 21.28281879425049 seconds


Page 10...

Comments scraping current url: https://fivethirtyeight.com/features/biden-cabinet/
The number of comments is 10.
Time elapsed: 21.741394996643066 seconds

Comments scraping current url: https://fivethirtyeight.com/features/are-americans-ready-for-some-football/
The number of comments is 8.
Time elapsed: 21.561134576797485 seconds

Comments scrapi

Comments scraping current url: https://fivethirtyeight.com/features/the-freedom-caucus-was-designed-to-disrupt/
The number of comments is 8.
Time elapsed: 20.885664224624634 seconds


Page 14...

Comments scraping current url: https://fivethirtyeight.com/features/bidens-approval-rating-is-up-will-his-misplaced-classified-documents-bring-it-down/
The number of comments is 18.
Time elapsed: 21.34200167655945 seconds

Comments scraping current url: https://fivethirtyeight.com/features/universal-school-vouches-education-culture-wars/
The number of comments is 15.
Time elapsed: 21.117494106292725 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/californias-senate-primary-is-going-to-be-a-doozy/
The number of comments is 8.
Time elapsed: 20.539355278015137 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-why-gas-stoves-became-a-casualty-in-the-culture-war/
The number of comments is 8.
Time elapsed: 20.796862840652466 seco

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-is-there-a-political-realignment-among-latino-voters/
The number of comments is 7.
Time elapsed: 20.935256004333496 seconds

Comments scraping current url: https://fivethirtyeight.com/features/how-gen-z-could-transform-american-politics/
The number of comments is 21.
Time elapsed: 21.266273975372314 seconds

Comments scraping current url: https://fivethirtyeight.com/features/democrats-want-to-put-abortion-on-the-ballot-but-many-states-wont-let-them/
The number of comments is 21.
Time elapsed: 21.378711700439453 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/is-this-the-last-time-this-decade-democrats-will-control-the-senate/
The number of comments is 8.
Time elapsed: 21.66794466972351 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-we-answer-your-lingering-questions-about-2022/
The number of comments is 7.
Time elapsed: 22.711464

Comments scraping current url: https://fivethirtyeight.com/features/house-control-republicans/
The number of comments is 19.
Time elapsed: 20.818581342697144 seconds

Comments scraping current url: https://fivethirtyeight.com/features/emergency-podcast-will-trump-win-the-gop-nomination/
The number of comments is 4.
Time elapsed: 20.728351354599 seconds

Comments scraping current url: https://fivethirtyeight.com/features/a-historic-number-of-women-will-be-governors-next-year/
The number of comments is 4.
Time elapsed: 21.306257486343384 seconds

Comments scraping current url: https://fivethirtyeight.com/features/why-desantis-is-a-major-threat-to-trumps-reelection/
The number of comments is 28.
Time elapsed: 21.351527214050293 seconds

Comments scraping current url: https://fivethirtyeight.com/features/trump-2024-president/
The number of comments is 23.
Time elapsed: 21.233996152877808 seconds


Page 23...

Comments scraping current url: https://fivethirtyeight.com/features/turnout-was-h


Page 27...

Comments scraping current url: https://fivethirtyeight.com/features/what-happens-if-georgias-senate-race-goes-to-a-runoff-again/
The number of comments is 7.
Time elapsed: 21.632493257522583 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/how-do-all-these-republican-polls-affect-the-model/
The number of comments is 9.
Time elapsed: 21.74884819984436 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/all-you-need-to-know-about-georgias-senate-race-through-tv-ads/
The number of comments is 3.
Time elapsed: 21.33098840713501 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-there-are-at-least-seven-incredibly-close-senate-races/
The number of comments is 4.
Time elapsed: 20.870085954666138 seconds

Comments scraping current url: https://fivethirtyeight.com/features/the-case-for-a-republican-sweep-on-election-night/
The number of comments is 50.
Time elapsed: 21.25664758682251 second

The number of comments is 8.
Time elapsed: 21.23958730697632 seconds

Comments scraping current url: https://fivethirtyeight.com/features/how-5-asian-american-voters-are-thinking-about-the-midterms/
The number of comments is 10.
Time elapsed: 20.88383674621582 seconds

Comments scraping current url: https://fivethirtyeight.com/features/why-candidates-are-debating-less-often-this-election-cycle/
The number of comments is 13.
Time elapsed: 21.21355438232422 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/want-to-know-who-will-win-the-house-watch-these-4-districts/
The number of comments is 6.
Time elapsed: 21.052101612091064 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/do-americans-actually-get-fed-up-and-move-to-canada/
The number of comments is 8.
Time elapsed: 21.292076587677002 seconds

Comments scraping current url: https://fivethirtyeight.com/features/nevada-election-officials-election-deniers/
The number of comments is 8.
Ti

The number of comments is 5.
Time elapsed: 21.576908111572266 seconds

Comments scraping current url: https://fivethirtyeight.com/features/herschel-walker-scandal/
The number of comments is 33.
Time elapsed: 21.77240800857544 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/georgias-governor-race-will-have-major-implications-for-abortion-access/
The number of comments is 3.
Time elapsed: 22.540541887283325 seconds

Comments scraping current url: https://fivethirtyeight.com/features/young-womens-views-on-abortion-could-reshape-the-midterms-and-the-future-of-politics/
The number of comments is 11.
Time elapsed: 21.514310121536255 seconds

Comments scraping current url: https://fivethirtyeight.com/features/how-5-latino-voters-are-thinking-about-the-midterms/
The number of comments is 9.
Time elapsed: 21.13744854927063 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-what-would-two-more-senators-do-for-democrats/
The nu

Comments scraping current url: https://fivethirtyeight.com/videos/the-midterms-will-determine-if-wisconsins-abortion-laws-stay-in-the-1800s/
The number of comments is 7.
Time elapsed: 20.913244485855103 seconds

Comments scraping current url: https://fivethirtyeight.com/features/this-candidate-thinks-the-2020-election-was-illegitimate-but-hed-rather-you-didnt-know-that/
The number of comments is 20.
Time elapsed: 20.85335659980774 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/republicans-cant-agree-on-new-abortion-restrictions/
The number of comments is 13.
Time elapsed: 21.128437995910645 seconds


Page 40...

Comments scraping current url: https://fivethirtyeight.com/videos/why-are-senate-candidates-avoiding-debates/
The number of comments is 4.
Time elapsed: 21.492499351501465 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/king-charles-iii-isnt-nearly-as-popular-as-the-queen-was-does-that-matter/
The number of comments is 7.
T

The number of comments is 2.
Time elapsed: 21.243118047714233 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/why-do-americans-political-opinions-change/
The number of comments is 8.
Time elapsed: 21.140882968902588 seconds


Page 44...

Comments scraping current url: https://fivethirtyeight.com/features/13-races-to-watch-in-florida-and-oklahoma/
The number of comments is 8.
Time elapsed: 21.401195764541626 seconds

Comments scraping current url: https://fivethirtyeight.com/features/politics-podcast-the-trump-investigations-and-what-americans-think-about-them/
The number of comments is 7.
Time elapsed: 20.972690105438232 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/the-election-deniers-running-in-florida-and-new-york/
The number of comments is 12.
Time elapsed: 20.83375883102417 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/do-you-buy-that-democrats-have-a-chance-of-winning-senate-races-in-ohio-and-no

Comments scraping current url: https://fivethirtyeight.com/features/whats-behind-senate-republicans-hesitancy-toward-same-sex-marriage/
The number of comments is 22.
Time elapsed: 21.13559341430664 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/how-successful-would-a-new-third-party-be/
The number of comments is 9.
Time elapsed: 21.787772178649902 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/a-majority-of-americans-think-we-are-in-a-recession/
The number of comments is 9.
Time elapsed: 21.502696752548218 seconds

Comments scraping current url: https://fivethirtyeight.com/videos/bipartisan-legislation-has-an-unexpected-comeback-in-congress/
The number of comments is 4.
Time elapsed: 21.36817979812622 seconds

Comments scraping current url: https://fivethirtyeight.com/features/will-3-pro-impeachment-house-republicans-survive-tuesdays-primaries/
The number of comments is 6.
Time elapsed: 21.237686157226562 seconds

Comments scrapin

Now we save the data frame to a `.csv` file to use in the data exploration phase.

In [4]:
# Use pandas to make a data frame 
df = pd.DataFrame(posts)
df.columns = ["Post type", "Title", "Post url", "Author(s)", "Date and time posted", "Tags", "No. of comments"]
# Then save it as a .csv file, with the index column removed
df.to_csv("ProblemStatementOutputs/"+str(len(posts))+"_"+now.strftime("%d-%m-%Y_%H-%M-%S")+".csv", index = False)

The name of the file has the form (number of posts)\_(date)-(month)-(year in 4 digits)\_(hour in military time)-(minute)-(seconds).

## Problem statement

Which 538 features posts get the most traffic?

## Stakeholders

News has become more polarized and sensationalized in recent years, all in the name of more clicks. This data analysis could provide some insight into what kind of articles and other content (podcasts and videos) get more traffic, without news organizations having to compromise their neutrality and factual correctness.

## Key performance indicators (KPIs)

- Number of comments a post gets