Temporary Outline
- Intro
- Training sentiment analysis on tweets
- Making sure model generalizes
- Scraping tweets from twitter
- Ajax requests with Selenium
- Sentiment analysis on tweets
- Swing counties in the US
- Predicting swing county outcomes with tweet sentiment as features
- Optional: Better sentiment analysis by clustering of manual initial labels

In [54]:
import os
import sys
import time
import json
import requests
import urllib.request
from bs4 import BeautifulSoup as BS

try:
    from selenium import webdriver
except:
    pass
    #this will install selenium in your environment
#     !{sys.executable} -m pip install --user selenium
#     from selenium import webdriver
    
from selenium.webdriver.common.keys import Keys

try:
    import tweepy
except:
    #this will install tweepy in your environment
    !{sys.executable} -m pip install --user tweepy
    import tweepy


## Twitter scraping

We form the url with our desired params in the common html format, specifying our query, the location near, the date range and the language. Then we specify a user agent to trick the website into thinking that we're a chrome browser so it sends us the real webpage. This is sufficient, however. The actual data we want is loaded with AJAX requests after the html is sent, so using the requests library alone won't be sufficient. This is where selenium comes into play.

In [55]:
def get_data_from_internet(urls):
    all_data = []
    
    user_agent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    headers = {'User-Agent': user_agent}

    #to run this code on your machine you need to specify your chrome browser's binary location
    options = webdriver.ChromeOptions()
    options.binary_location = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
    chrome_driver_binary = "required_files/chromedriver.exe"
    browser = webdriver.Chrome(chrome_driver_binary, options=options)


    for i, url in enumerate(urls):
        
        browser.get(url)
        time.sleep(1)

        body = browser.find_element_by_tag_name("body")

        for _ in range(100):
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)

        print(f"Formatting and cleaning data {i}")
        tweets = browser.find_elements_by_class_name("js-tweet-text-container")
        tweets = [t.text for t in tweets]
        tweets = [t.split("https://")[0].lower() for t in tweets]
    
        print(f"{len(tweets)} tweets found")
        all_data += tweets
        
    browser.quit()
    all_data = list(set(all_data))
    return all_data

To actually get our data, we first check our cache. This prevents unnescessary requests. If we don't have the data for a specific query, then we call the function above to get it. To pass the input as a url, we call the make_url function with our query and location.

In [56]:
#TODO: move this function to python file outside project 
#(we're allowed to have 1 python file with all our code that we don't need the reader to see)

#We figured out optimal urls to get the most possible data
def get_urls(query, location):
    #this function is currently hardcoded
    #return list of urls for current query and location
    urls = []
    if (query == "trump" and location == "philadelphia"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22philadelphia%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22philadelphia,%20PA%22&src=typed_query&f=live")
        
    elif (query == "clinton" and location == "philadelphia"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22philadelphia%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22philadelphia,%20PA%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "chester"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22chester%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22west%20chester,%20PA%22&src=typeahead_click&f=live")
        
    elif (query == "clinton" and location == "chester"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22chester%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22west%20chester,%20PA%22&src=typeahead_click&f=live")
        
    elif (query == "trump" and location == "belmont"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22belmont%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22martins%20ferry,%20OH%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22st%20clairsville,%20OH%22&src=typeahead_click&f=live")
        
    elif (query == "clinton" and location == "belmont"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22belmont%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22martins%20ferry,%20OH%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22st%20clairsville,%20OH%22&src=typeahead_click&f=live")
        
    elif (query == "trump" and location == "hamilton"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hamilton%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22cincinnati,%20OH%22&src=typed_query&f=live")
    
    elif (query == "clinton" and location == "hamilton"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hamilton%20county%22&src=typeahead_click&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22cincinnati,%20OH%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "newhanover"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22new%20hanover%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22wilmington,%20NC%22&src=typed_query&f=live")
    
    elif (query == "clinton" and location == "newhanover"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22new%20hanover%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22wilmington,%20NC%22&src=typed_query&f=live")
        
    elif (query == "trump" and location == "wake"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22wake%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22raleigh,%20NC%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22cary,%20NC%22&src=typed_query&f=live")
    
    elif (query == "clinton" and location == "wake"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22wake%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22raleigh,%20NC%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22cary,%20NC%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "watauga"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22watauga%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22boone,%20NC%22&src=typed_query&f=live")        
    
    elif (query == "clinton" and location == "watauga"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22watauga%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22boone,%20NC%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "duval"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22duval%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22jacksonville,%20FL%22&src=typed_query&f=live")
    
    elif (query == "clinton" and location == "duval"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22duval%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22jacksonville,%20FL%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "hillsborough"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hillsborough%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22tampa,%20FL%22&src=typed_query&f=live")

    elif (query == "clinton" and location == "hillsborough"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hillsborough%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22tampa,%20FL%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "miamidade"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22miami-dade%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22miami,%20FL%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22aventura,%20FL%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22homestead,%20FL%22&src=typed_query&f=live")
            
    elif (query == "clinton" and location == "miamidade"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22miami-dade%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22miami,%20FL%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22aventura,%20FL%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22homestead,%20FL%22&src=typed_query&f=live")
        
    elif (query == "trump" and location == "allegheny"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22allegheny%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22pittsburgh,%20PA%22&src=typed_query&f=live")
    
    elif (query == "clinton" and location == "allegheny"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22allegheny%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22pittsburgh,%20PA%22&src=typed_query&f=live")
    
    elif (query == "trump" and location == "atlantic"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22atlantic%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22atlantic%20city,%20NJ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hammonton,%20NJ%22&src=typed_query&f=live")
        
    elif (query == "clinton" and location == "atlantic"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22atlantic%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22atlantic%20city,%20NJ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22hammonton,%20NJ%22&src=typed_query&f=live")

    elif (query == "trump" and location == "maricopa"):
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22maricopa%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22surprise,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22phoenix,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22scottsdale,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(donald%20OR%20trump)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22chandler,%20AZ%22&src=typed_query&f=live")
        
    elif (query == "clinton" and location == "maricopa"):
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22maricopa%20county%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22surprise,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22phoenix,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22scottsdale,%20AZ%22&src=typed_query&f=live")
        urls.append("https://twitter.com/search?q=(hillary%20OR%20clinton)%20until%3A2016-11-08%20since%3A2016-08-08%20near%3A%22chandler,%20AZ%22&src=typed_query&f=live")
    
    return urls


In [57]:
def get_data(query, location):
    print()
    print(query, location)
    #Get the data, either from cache or from the internet
    urls = get_urls(query, location)
    path = "required_files/"
    try:
        with open(path + query + "_" + location + "_cache.txt", "r") as f:
            data = json.loads(f.read())
        print("Found cache")
    except:
        print("Collecting data from the internet")
        data = get_data_from_internet(urls)
        with open(path + query + "_" + location + "_cache.txt", "w") as f:
            if (data != []):
                f.write(json.dumps(data))
    print(f"Collected {len(data)} unique tweets for this query and location")

In [59]:
#This url can be intuitively modified to suit our query
#This specific url searches for tweets mentioning @realDonaldTrump between the dates 2016-10-08 and 2016-11-08
queries = ["trump", "clinton"]
locations = ["philadelphia", "chester", "belmont", "hamilton", "newhanover", 
             "wake", "watauga", "duval", "hillsborough", "miamidade", 
             "allegheny", "atlantic", "maricopa"]

for query in queries:
    for location in locations:
        get_data(query, location)


trump philadelphia
Found cache
Collected 147 unique tweets for this query and location

trump chester
Found cache
Collected 56 unique tweets for this query and location

trump belmont
Found cache
Collected 7 unique tweets for this query and location

trump hamilton
Found cache
Collected 58 unique tweets for this query and location

trump newhanover
Found cache
Collected 15 unique tweets for this query and location

trump wake
Found cache
Collected 47 unique tweets for this query and location

trump watauga
Found cache
Collected 238 unique tweets for this query and location

trump duval
Found cache
Collected 72 unique tweets for this query and location

trump hillsborough
Found cache
Collected 72 unique tweets for this query and location

trump miamidade
Found cache
Collected 142 unique tweets for this query and location

trump allegheny
Found cache
Collected 76 unique tweets for this query and location

trump atlantic
Found cache
Collected 47 unique tweets for this query and location
