## For Next Time

- Add a date column for the regression.
- Goal is to essentially have poll company + date = next_poll spread 
- Look at white papers to see schedule for date. 
- Fit the model everyday, generate a prediction for every poll, and see if it comes out the next day. 
- Real-time predictions with errors. 
- Just do Joe-Biden Trump, print out predictions for most common polls; maybe make a .txt file output and then compare. 
- Real Clear Polling for that as the data source 
- Sentiment Analysis + Trending search terms
- Plot polling aggregation by day accounting for some sort of bias by the polling companies. 
- Use some sort of standard time series model using all of the polling data. 
- Gaussian process, GAM, splines, time series, polynomial regression, for a more flexible model. 

Poll_outcome = poll company + time_since +....?

Keep thinking about google trends somehow

In [52]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px

# Suppress a specific warning
warnings.simplefilter("ignore", category=FutureWarning)

In [53]:
base_url = "https://www.realclearpolling.com/polls/"
sample_url = "president/general/2024/trump-vs-biden"
url = base_url + sample_url

In [54]:
# Create a webdriver instance and get the page source
driver = webdriver.Chrome()
driver.get(url)

# Allow time for dynamic content to load (you may need to adjust the sleep duration)
time.sleep(3)

# Get the page source after dynamic content has loaded
html_content = driver.page_source

# Close the webdriver
driver.quit()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html5lib')

# Now you can extract the table data using the same approach as before
table = soup.find_all('table')

if len(table)==2:
    table = table[1]
else:
    table = table[0]

table_data = []
for row in table.find_all('tr'):
    row_data = [cell.text.strip() for cell in row.find_all(['td','th'])]
    table_data.append(row_data)

In [55]:
#Convert table data to a dataframe
#We remove the first row since this is RCP summary data that we don't want to use

current_year = str(datetime.now().year)
prev_year = str(datetime.now().year-1)


df = pd.DataFrame(table_data[2:], columns=table_data[0])
df["Difference"] = df["Trump (R)"].astype(float) - df["Biden (D)"].astype(float)
df["Type of Voter"] = df["sample"].str.split(" ").str[1]
df["Sample Size"] = pd.to_numeric(df["sample"].str.split(" ").str[0], errors="coerce").fillna(0).astype(int)
#We need to add the year to the date to make it a datetime object
#We need to make sure the year we add is the year the poll was taken, not necessarily the current year
df["End Date"] = df["date"].str.split("-").str[1] 
df["Poll Month"] = df["date"].str.split("-").str[1].str.split("/").str[0]
df["Poll Month"] = df["Poll Month"].astype(int)
first_dec = df[df["Poll Month"]==12].index[0]
df["Year"] = [current_year]*first_dec + [prev_year]*(len(df)-first_dec)
df["End Date"] = df["End Date"] + "/" + df["Year"]
df["End Date"] = np.array(pd.to_datetime(df["End Date"], format="mixed"))
df = df[df["Type of Voter"].isin(["RV", "LV"])]
df["Biden (D)"] = df["Biden (D)"].astype(float)
df["Trump (R)"] = df["Trump (R)"].astype(float)
df.head()

Unnamed: 0,pollster,date,sample,moe,Trump (R),Biden (D),spread,Difference,Type of Voter,Sample Size,End Date,Poll Month,Year
0,Rasmussen Reports,2/13 - 2/15,868 LV,3.0,47.0,41.0,Trump+6,6.0,LV,868,2024-02-15,2,2024
1,Emerson,2/13 - 2/14,1225 RV,2.7,45.0,44.0,Trump+1,1.0,RV,1225,2024-02-14,2,2024
2,Morning Consult,2/17 - 2/19,6321 RV,1.0,45.0,41.0,Trump+4,4.0,RV,6321,2024-02-19,2,2024
3,Economist/YouGov,2/11 - 2/13,1470 RV,3.1,44.0,44.0,Tie,0.0,RV,1470,2024-02-13,2,2024
4,Morning Consult,2/9 - 2/11,6164 RV,1.0,43.0,42.0,Trump+1,1.0,RV,6164,2024-02-11,2,2024


In [56]:
#Create a time series plot of the candidates' polling numbers using plotly
fig = px.scatter(df, x="End Date", y=["Trump (R)", "Biden (D)"], title="Trump vs. Biden Polling Numbers",  color_discrete_map={"Trump (R)":"red", "Biden (D)":"blue"})
fig.show()

In [57]:
df["Days Since 01-01-23"] = (df["End Date"] - pd.to_datetime("01-01-23")).dt.days

df.head()

Unnamed: 0,pollster,date,sample,moe,Trump (R),Biden (D),spread,Difference,Type of Voter,Sample Size,End Date,Poll Month,Year,Days Since 01-01-23
0,Rasmussen Reports,2/13 - 2/15,868 LV,3.0,47.0,41.0,Trump+6,6.0,LV,868,2024-02-15,2,2024,410
1,Emerson,2/13 - 2/14,1225 RV,2.7,45.0,44.0,Trump+1,1.0,RV,1225,2024-02-14,2,2024,409
2,Morning Consult,2/17 - 2/19,6321 RV,1.0,45.0,41.0,Trump+4,4.0,RV,6321,2024-02-19,2,2024,414
3,Economist/YouGov,2/11 - 2/13,1470 RV,3.1,44.0,44.0,Tie,0.0,RV,1470,2024-02-13,2,2024,408
4,Morning Consult,2/9 - 2/11,6164 RV,1.0,43.0,42.0,Trump+1,1.0,RV,6164,2024-02-11,2,2024,406


In [58]:
ndf = pd.get_dummies(df[["pollster"]], drop_first=True)
ndf["Days Since 01-01-23"] = df["Days Since 01-01-23"]
ndf.head()

Unnamed: 0,pollster_CBS News,pollster_CNN,pollster_Daily Kos/Civiqs,pollster_Data for Progress (D)**,pollster_Economist/YouGov,pollster_Emerson,pollster_FOX News,pollster_Federalist/Susquehanna,pollster_Grinnell/Selzer,pollster_Harvard-Harris,...,pollster_Reuters/Ipsos,pollster_SurveyUSA,pollster_Susquehanna,pollster_The Messenger/HarrisX,pollster_Trafalgar Group (R),pollster_USA Today/Suffolk,pollster_Wall Street Journal,pollster_Yahoo News,pollster_Yahoo News**,Days Since 01-01-23
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,410
1,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,409
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,414
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,408
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,406


In [59]:
# Build a categorical regression model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = ndf
y = df[["Difference"]]

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = LinearRegression()
model.fit(X, y)


In [60]:
from datetime import datetime

today_date = datetime.now().strftime("%m-%d-%Y") #Get today's date
days_since_2023 = (pd.to_datetime(today_date) - pd.to_datetime("01-01-2023")).days
cols = ndf.columns[:-1]
diag = np.diag(np.ones(len(cols)))

tomorrow_df = pd.DataFrame(diag, columns=cols)
tomorrow_df["Days Since 01-01-23"] = days_since_2023 + 1 #Add 1 to the days since 2023 to make it the next day
tomorrow_df.head()

tomorrow_pred = model.predict(tomorrow_df)


In [61]:
#Format the prediction into a readable text

#Make a dataframe 
tomorrow_pred_df = pd.DataFrame(tomorrow_pred, columns=["Predicted Difference"])
tomorrow_pred_df.set_index(cols, inplace=True)
tomorrow_pred_df.head()

Unnamed: 0,Predicted Difference
pollster_CBS News,2.540727
pollster_CNN,3.535495
pollster_Daily Kos/Civiqs,0.115123
pollster_Data for Progress (D)**,-1.717426
pollster_Economist/YouGov,0.491706


In [62]:
# Make an excel file where each sheet is that day's prediction
#Add the sheet to an existing file

path_name = "Datasets/Predictions.xlsx"

# Open the existing file in write mode and add the new sheet
try:
    with pd.ExcelWriter(path_name, engine = 'openpyxl', mode = 'a') as writer:
        tomorrow_pred_df.to_excel(writer, sheet_name=today_date)
except:
    print("File was already added in a past run-through")




File was already added in a past run-through


In [63]:
#Look specifically at the polls released last, we will want to compare these to yesterday's predictions

today_df = df[df["End Date"].iloc[1] == df["End Date"]].copy()
today_df["pollster"] = "pollster_" + today_df["pollster"] 
today_df.head()

Unnamed: 0,pollster,date,sample,moe,Trump (R),Biden (D),spread,Difference,Type of Voter,Sample Size,End Date,Poll Month,Year,Days Since 01-01-23
1,pollster_Emerson,2/13 - 2/14,1225 RV,2.7,45.0,44.0,Trump+1,1.0,RV,1225,2024-02-14,2,2024,409


In [64]:
#Check to see if any of the polls from today are in yesterday's predictions
#If so, we will compare the predictions to the actual results

yesterday_date = (pd.to_datetime(today_date) - pd.Timedelta(days=1)).strftime("%m-%d-%Y")
try:
    yesterday_df = pd.read_excel("Datasets/Predictions.xlsx", sheet_name=yesterday_date)
except:
    yesterday_df = pd.read_excel("Datasets/Predictions.xlsx", sheet_name=-1)
yesterday_df.rename(columns={"Unnamed: 0":"Pollster"}, inplace=True)
yesterday_df.head()

Unnamed: 0,Pollster,Predicted Difference
0,pollster_CBS News,2.514775
1,pollster_CNN,3.509744
2,pollster_Daily Kos/Civiqs,0.105638
3,pollster_Data for Progress (D)**,-1.73339
4,pollster_Economist/YouGov,0.454308


In [65]:
#Subset the predictions to only include the polls from today
values = today_df["pollster"].values
today_pred = yesterday_df[yesterday_df["Pollster"].isin(values)].copy()


today_pred["Actual Difference"] = today_df["Difference"].values

#Calculate the root mean squared error
if len(today_pred) == 0:
    rmse = "No polls from today were in yesterday's predictions"
else:
    rmse = np.sqrt(mean_squared_error(today_pred["Actual Difference"], today_pred["Predicted Difference"]))
print("RMSE IS: " , rmse)

RMSE IS:  1.2342759762844309


In [66]:
# #Look into google trends data to see if there is a correlation between the number of searches for a candidate and their polling numbers?

# from pytrends.request import TrendReq

# pytrends = TrendReq(hl='en-US', tz=360)

# kw_list = ["Donald Trump", "Joe Biden"]
# pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='', gprop='')
# interest_over_time_df = pytrends.interest_over_time()
# interest_over_time_df.head()


In [67]:
# #Plot the google trends data
# fig, ax = plt.subplots()
# ax.plot(interest_over_time_df["Donald Trump"], label="Donald Trump")
# ax.plot(interest_over_time_df["Joe Biden"], label="Joe Biden")
# ax.set_title("Google Trends Data for Trump and Biden")
# ax.legend()
# plt.show()


In [68]:
# X["Date"] = df["End Date"]


# #Merge the polling data with the google trends data
# merged_df = pd.merge(X, interest_over_time_df, left_on="Date",right_on = "date")
#merged_df.head()