# COGS 108 - Final Project (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [X] YES - make available
* [  ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Samuel
- Matthew
- Caitlin
- Darren
- Nick

<a id='research_question'></a>
# Research Question

Do large online communities ( of retail traders) have influence over the stock market?

*Sentiment analysis of positivity on the Reddit subreddit r/WallStreetBets and how this correlates to the performance of the S&P 500 from January 31, 2012 to the present.*

<a id='background'></a>

## Background & Prior Work

### Introduction to Topic

Reddit was considered a good community for us to hone in on for our topic not just because of its size, but also because it has a dedicated system for us to more easily scrape semi-structured data from the site. We are able to do this with the Python Reddit API Wrapper (PRAW). For the past few days the subreddit r/WallStreetBets has been the center of the trading world with its short squeeze of the Gamestock (GME), AMC Theaters (AMC), and BlackBerry (BB) stocks and more. Their history of trading in a reckless manner goes back much further than just the past few weeks. In this project we look at the trading done in the months leading up to and during the Covid-19 pandemic and how it correlates to several stocks including the S&P 500, TSLA, GME, and more.

### Summary of Prior Work

Keith Gill, also know as the Roaring Kitty on Reddit, Tik Tok, and Youtube, invested $53,000 into GameStop (3). Recently he and his followers invested into GameStop against hedge funds and went against Wall Street norms. He and his fellow investors drove up the price of GameStop, which is still climbing to this day (4). Because Gill's base was mainly on Reddit, we were curious if there was any correlation with the positivity of the subreddit he frequented and the S&P 500. We did some research to see if scraping data from reddit was possible, and according to (2), it would be quite simple to create our own dataset using Reddit's API. We also looked into Kaggle, to see what our ideal dataset would look like (1). Using these resources, we feel confident in our ability to webscrape our own Reddit dataset. We also looked into downloading a dataset from websites that have the history and time periods of stocks. Yahoo Finance provides an easy way to download a dataset with the variables we need in .csv format (4).


References (include links):
- 1) https://www.kaggle.com/shergreen/wallstreetbets-subreddit-submissions
- 2) https://towardsdatascience.com/scraping-reddit-data-1c0af3040768
- 3) https://www.nytimes.com/2021/01/29/technology/roaring-kitty-reddit-gamestop-markets.html
- 4) https://finance.yahoo.com/quote/GME

# Hypothesis


We hypothesize that there is a correlation between the positivity on the subreddit and the performance of the S&P 500 due to global and local events. Global events, especially the COVID-19 pandemic, may hold an influence over the subreddit and the market.
- We also hypothesize that there will be a positive linear relationship between the performance of the S&P 500 and positivity on the popular subredditr r/WallStreetBets.

# Dataset(s)

#### Stock Dataset

- Dataset Name: Stock Market S&P 500 History 
- Link to the dataset: https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC
- Number of observations: 2,273

This dataset contains the date, open, high, low, close, adjusted close, and volume of the S&P 500 from January 31, 2012 to February 10, 2021. We got this dataset from Yahoo Finance, which allows us to easily download the history of the S&P 500 into a CSV file. We chose this time period as the subreddit r/wallstreetbets was created on January 31, 2012. In our data cleaning code, we will only keep the date and closing columns.



#### Reddit Dataset

- Dataset Name: Wallstreetbets Subs Full
- Link to the dataset: https://drive.google.com/file/d/1l3NuVbJtf9mdMdvLsKRnj0rcfYYp7o28/view?usp=sharing
- Number of observations: 1,317,200

Our team found a dataset on Kaggle that gave us the submissions on r/wallstreetbets in a dataset that went up to August 2020. To ensure we covered the entire period within the scope of our question we elected to acquire our own data using wrappers for the Reddit API. We have included our webscraping code below. We webscraped the Reddit API from January 31, 2012 (when the subreddit was created) to the present. This dataset contains submissions to the subreddit. Other columns include features in Reddit (awards, removals, etc.) as well as links to posts and authors. We will be cleaning this up to better organize the data by date and post content.

*Our stock dataset is included in the GitHub folder.*

*Our reddit dataset is provided in a Google Drive link as the file is 1.7 gb.*

# Setup

In [2]:
# Setup imports
#!pip install pmaw
%matplotlib inline
import pandas as pd
from pmaw import PushshiftAPI
import os

# Data Cleaning

### Stock Market S&P 500 History 

Our question is associated with how the positivity of the subreddit r/wallstreetbets correlates to the performance of the S&P 500 from January 31, 2012 to the present. This dataset before cleaning is already very clean. We just need to remove certain columns in order to get what we need to answer our research question. Therefore, we will just need the date and the closing price of the S&P 500. We do not need the adjusted close due to that we are working with the S&P 500 and don't need to work with out of hours like we would with an individual stock or the the opening price since it will just be the previous day's closing price. Since we are only considering the performance of the S&P 500, we do not need the volume.

In [3]:
stocks = pd.read_csv("Stock_Market_S&P_500_History.csv")
cleaned_stocks = stocks[['Date', 'Close']]
cleaned_stocks

Unnamed: 0,Date,Close
0,1/31/2012,1312.410034
1,2/1/2012,1324.089966
2,2/2/2012,1325.540039
3,2/3/2012,1344.900024
4,2/6/2012,1344.329956
...,...,...
2268,2/4/2021,3871.739990
2269,2/5/2021,3886.830078
2270,2/8/2021,3915.590088
2271,2/9/2021,3911.229980


### Reddit Dataset

#### Code to webscrape Reddit API

This was used outside of our notebook in order to webscrape the Reddit API for the subreddit r/wallstreetbets from January 31, 2012 to February 12, 2021.

import pandas as pd
from pmaw import PushshiftAPI
import os

"""
outname = 'wallstreetbets_subs_full.csv'

outdir = './data'
if not os.path.exists(outdir):
    os.mkdir(outdir)

fullname = os.path.join(outdir, outname)

api = PushshiftAPI()
submissions = api.search_submissions(subreddit="wallstreetbets", after=1327968000, before=1613160000)

sub_df = pd.DataFrame(submissions)
sub_df.to_csv(fullname, header=True, index=False, columns=list(sub_df.axes[1]))
"""

### Wallstreetbets Subs Full

We webscraped the subreddit r/wallstreetbets for a consistent time period. The Kaggle dataset we found stopped at August 2020. We wanted a time period that would span from when the subreddit was created (January 31, 2012) to the present. Our dataset, when first webscraped, has many unnecessary columns. Considering how our research question only asks about the positivity of the subreddit, columns such as 'subreddit', and 'event_is_live' are unneeded. We will be focusing on the following columns: selftext, author_fullname, title, url, total_awards_received, upvote_ratio, category, and created_utc. Some columns are associated with Reddit features that do not pertain to our research question. We also further cleaned our dataset by renaming some columns for better understanding. Such columns include changing the name from author_fullname to Author ID, and title to Post Title. We avoided removing too many columns right now in the case we find that some of the other columns become relevant later on in our data exploration process.

In [17]:
# The below code was run on a local machine due to size constraints of our CSV

reddit = pd.read_csv("../wallstreetbets_subs_full.csv", low_memory = False)
reddit.head(10)

Unnamed: 0,author,author_created_utc,author_flair_css_class,author_flair_text,author_fullname,created_utc,domain,full_link,gilded,id,...,content_categories,hidden,quarantine,removal_reason,subreddit_name_prefixed,event_end,event_is_live,event_start,collections,top_awarded_type
0,svb688,1302398000.0,,,t2_52yit,1356455353,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15fc9y,...,,,,,,,,,,
1,Dasweb,1279150000.0,,,t2_46mmt,1356378910,finance.yahoo.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15dyf8,...,,,,,,,,,,
2,GroundhogExpert,1292783000.0,,,t2_4mwkh,1356330888,seekingalpha.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15d3ig,...,,,,,,,,,,
3,StockTrader8,1350310000.0,,,t2_9b4e5,1356222842,keeneonthemarket.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15ay8i,...,,,,,,,,,,
4,StockTrader8,1350310000.0,,,t2_9b4e5,1356043510,keeneonthemarket.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,156y2e,...,,,,,,,,,,
5,GroundhogExpert,1292783000.0,,,t2_4mwkh,1356041481,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,156vsr,...,,,,,,,,,,
6,mkipper,1263318000.0,,,t2_3tlyc,1356016701,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,1564oi,...,,,,,,,,,,
7,kdonn,1323061000.0,,,t2_6djdk,1355964582,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,1551zx,...,,,,,,,,,,
8,GroundhogExpert,1292783000.0,,,t2_4mwkh,1355955444,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,154s0h,...,,,,,,,,,,
9,Dasweb,1279150000.0,,,t2_46mmt,1355850121,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15241a,...,,,,,,,,,,


In [18]:
# Print columns to get a better sense of what data is where, and what we know we don't need

collist = list(reddit)
print(collist)

['author', 'author_created_utc', 'author_flair_css_class', 'author_flair_text', 'author_fullname', 'created_utc', 'domain', 'full_link', 'gilded', 'id', 'is_self', 'media_embed', 'mod_reports', 'num_comments', 'over_18', 'permalink', 'retrieved_on', 'score', 'secure_media_embed', 'selftext', 'stickied', 'subreddit', 'subreddit_id', 'thumbnail', 'title', 'url', 'user_reports', 'edited', 'media', 'secure_media', 'banned_by', 'locked', 'post_hint', 'preview', 'link_flair_css_class', 'link_flair_text', 'approved_at_utc', 'banned_at_utc', 'brand_safe', 'can_mod_post', 'contest_mode', 'is_video', 'spoiler', 'suggested_sort', 'thumbnail_height', 'thumbnail_width', 'author_flair_richtext', 'author_flair_type', 'is_crosspostable', 'is_original_content', 'is_reddit_media_domain', 'link_flair_richtext', 'link_flair_text_color', 'link_flair_type', 'media_only', 'no_follow', 'num_crossposts', 'parent_whitelist_status', 'pinned', 'pwls', 'rte_mode', 'send_replies', 'subreddit_subscribers', 'subreddi

In [19]:
# Checking all posts are from one subreddit

allinsubreddit = sum(reddit['subreddit'] != 'wallstreetbets')
print(allinsubreddit)

# Yes they are, dropping the redundant column
reddit.drop('subreddit', axis=1, inplace=True)

# Also dropping 3 columns with information unrelated to our scope
#reddit.drop(['event_end','event_is_live','event_start'], axis=1)

reddit.drop('spoiler', axis=1, inplace=True)
reddit.drop('author_patreon_flair', axis=1, inplace=True)
reddit.drop('author_premium', axis=1, inplace=True)
reddit.drop(reddit.iloc[:, 103:107], inplace=True, axis=1) 
reddit.drop(reddit.iloc[:, 109:114], inplace=True, axis=1) 

0


In [20]:
# Renaming a few columns for clarity

reddit.rename(columns={'author':'Author', 'author_fullname':'Author ID', 'title':'Post Title', 'upvote_ratio':'Upvote Ratio'}, inplace=True)
reddit.head(10)

Unnamed: 0,Author,author_created_utc,author_flair_css_class,author_flair_text,Author ID,created_utc,domain,full_link,gilded,id,...,poll_data,archived,can_gild,category,content_categories,hidden,quarantine,event_start,collections,top_awarded_type
0,svb688,1302398000.0,,,t2_52yit,1356455353,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15fc9y,...,,,,,,,,,,
1,Dasweb,1279150000.0,,,t2_46mmt,1356378910,finance.yahoo.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15dyf8,...,,,,,,,,,,
2,GroundhogExpert,1292783000.0,,,t2_4mwkh,1356330888,seekingalpha.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15d3ig,...,,,,,,,,,,
3,StockTrader8,1350310000.0,,,t2_9b4e5,1356222842,keeneonthemarket.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15ay8i,...,,,,,,,,,,
4,StockTrader8,1350310000.0,,,t2_9b4e5,1356043510,keeneonthemarket.com,https://www.reddit.com/r/wallstreetbets/commen...,0.0,156y2e,...,,,,,,,,,,
5,GroundhogExpert,1292783000.0,,,t2_4mwkh,1356041481,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,156vsr,...,,,,,,,,,,
6,mkipper,1263318000.0,,,t2_3tlyc,1356016701,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,1564oi,...,,,,,,,,,,
7,kdonn,1323061000.0,,,t2_6djdk,1355964582,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,1551zx,...,,,,,,,,,,
8,GroundhogExpert,1292783000.0,,,t2_4mwkh,1355955444,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,154s0h,...,,,,,,,,,,
9,Dasweb,1279150000.0,,,t2_46mmt,1355850121,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,0.0,15241a,...,,,,,,,,,,


In [21]:
# Barebones filtered dataset for us to do initial work. Important that we could keep it to a manageable size for github
filtered = reddit.filter(['Author ID','Post Title', 'Upvote Ratio', 'created_utc', 'category', 'total_awards_received', 'score', 'selftext'], axis=1)
filtered.to_csv("../filteredout.csv")

### Wallstreetbets Subs Filtered 1

These two sections are separated since the above section imports the massive 1.7gb dataset that only 2 of our team could actually use on their local computer without breaking. Even when run on systems that could handle it the above cells took a long time. So we only run the above cells once to get the filteredout.csv and import it here to continue working with it.

In [5]:
filtered = pd.read_csv("../filteredout.csv", low_memory = False)

In [6]:
# Many of these are NaN, so we will just leverage Score instead
filtered['Upvote Ratio'].isna().sum()

384480

In [7]:
# Approx all NaN, so dropping this column
filtered['category'].isna().sum()

1317200

In [8]:
# Enough to keep, so we won't drop this column
filtered['total_awards_received'].isna().sum()

218435

We realized after looking at the filtered dataset that Upvote Ratio and Category had a lot of NaNs, mainly due to the database not collecting Upvote Ratio until after a certain date and not many users using the Tags in Reddit to categorize their posts. We decided to drop these columns from the dataset.

In [9]:
filtered.drop(['Upvote Ratio', 'category'], axis=1, inplace=True)

In [10]:
# Rename columns for consistency + clarity
filtered.rename(columns={'created_utc':'Time Created UTC', 'total_awards_received':'Num Awards', 'score': 'Score', 'selftext':'Post Body'}, inplace=True)

In [11]:
filtered.head()

Unnamed: 0.1,Unnamed: 0,Author ID,Post Title,Time Created UTC,Num Awards,Score,Post Body
0,0,t2_52yit,A question about Netflix and Amazon come Wedne...,1356455353,,0,With part of Netflix going down last night and...
1,1,t2_46mmt,Rosen Law Firm Announces Filing of Securities ...,1356378910,,7,
2,2,t2_4mwkh,ZGNX insiders are expanding their position. Le...,1356330888,,0,
3,3,t2_9b4e5,KeeneOnTheMarket.com - Short the Russian Drou...,1356222842,,1,
4,4,t2_9b4e5,KeeneOnTheMarket.com - Pregame Earnings: Optio...,1356043510,,1,


In [13]:
filtered['Date'] = pd.to_datetime(filtered['Time Created UTC'], unit='s').dt.date
filtered = filtered.sort_values(by='Time Created UTC')
filtered = filtered.reset_index(drop=True)
filtered = filtered.drop(filtered.columns[[0]], axis=1)
filtered

Unnamed: 0,Post Title,Time Created UTC,Num Awards,Score,Post Body,Date
0,Earnings season is here. Place your bets.,1334162440,,13,I know that /r/investing is a great place for ...,2012-04-11
1,Earnings season is here. Place your bets.,1334162440,,13,I know that /r/investing is a great place for ...,2012-04-11
2,"GOOG - beat estimates, price barely rises.",1334263051,,2,,2012-04-12
3,"GOOG - beat estimates, price barely rises.",1334263051,,2,,2012-04-12
4,My poorly timed opening position for AAPL earn...,1334615377,,12,"So I missed out on GOOG, which is probably a g...",2012-04-16
...,...,...,...,...,...,...
1317195,"Complete DD on HYLN. No battery patent, no IP,...",1613142121,0.0,1,Hyliion became a thing in the WSB world once a...,2021-02-12
1317196,🚀,1613142124,0.0,1,[removed],2021-02-12
1317197,Buying tilray and nokia,1613142128,0.0,1,[removed],2021-02-12
1317198,Please take sndl at $3 !!,1613142129,0.0,1,[removed],2021-02-12


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [22]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

There is a potential bias in how we webscrape the Reddit API. This bias may come from how we build our dataset from Reddit, as just selecting certain threads might influence our dataset. Another bias could come from the culture of Reddit, in which some threads have certain patterns and trends in commenting and posting. However, we do not have biases coming from our group. We are not involved at all with Reddit, and therefore do not present a bias personally. However, if we gravitate to webscraping certain threads, then we may introduce a bias. We will try to detect these by going through reddit threads we are webscraping, and making sure all of them are not those that we are familiar with.

Privacy is another problem. While Reddit data is public property, there are IDs and usernames involved. It is extremely easy to look up reddit usernames, posts, and comments to find the particular post on reddit. We are looking into deanonymizing the data by removing names, usernames, ids, posts, and titles. We are hoping to only publish our sentiment analysis rather than our dataset to prevent people from looking up posts and usernames. If not, this will be problematic as people can look up these posts and target specific users.

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*