## Jacob Linder
<p>I have a pretty good background in stocks and options, which will make it easier for our group get an idea of what information is important and which may be worth discarding. Additionally, I bring relevant python experience to the table, having worked with options data scraped from APIs in the past, which will allow us to get relevant options data for each post. I also have a lot of personal project based experience with Linux, which will be handy if we want to set this running on a server to continually collect data and make observations.</p>

<p> ADD IN: What areas/skills/domains does the team member presently identify with?
    Into which areas/skills/domains would the team member like to grow? </p>


## Logan Stout
<p>I am a hobby investor trading options profitably for the last four years. With my experience, I have a soild grasp on the mechanisms by which securities and their option prices develop as a result of market movements, statistics, and media. As a controls engineer by trade, I have a handle on control systems and managing system outputs based on constantly changing updated data. My strengths include high level project planning and file processing. By approaching this project with the data being mixed with other text in a forum, I hope to gain experience in web scraping and text parsing.</p>

### Who might be interested?
<p>The purpose of our data is to be able to observe if the community at WallStreetBets is actually having any impact on the stocks they suggest, in aggregate. Additionally, we intend to quantify the accuracy of the communities suggestions as a whole by comparing the prices of the underlying equities at the expiry of the contract to the strike of the suggested contract. We will increase the accuracy of our quantification by including the price of the equity at the time of the post. </p>
<p>If there is some measure of effect (or even accuracy) to the predicitons on WSB, any trader or alogrithmic trading firm would be eager to incorporate that information into their programs.</p>

### How is the data limited?
<p>The core of the data collection method is scraping text from posts and comments on Reddit in search of stock market positions. This is simple for the ideal case, where a given comment will have a string of text in the format of the stock ticker, date, strike price, and option type (put or call). PSTH 3/19/21 20c, for example, would be a relatively simple string to parse as it provides all of the required information in sequence. Unfortunately, no standardized format is enforced; this means that the same information could be communicated by any permutation containing all four items or even using full sentences with context elsewhere in the post. By not having adequately sophisticated parsing, it is probable that the amount of data collected will be significantly less than what is totally available. 
</p>

### How was the data created?
<p>All of the position data came from anonymously submitted posts and comments on Reddit.com/r/wallstreetbets. Individual stock data is generated by investors making trades on the stock market, most commonly the New York Stock Exchange and the Nasdaq Stock Market. The specific data used is taken from Yahoo Finance quotes, which receives it from a company called ICE Data Service, who takes it from the exchanges themselves.</p>

### What access rights exist on your data? How will you make your data available?
<p> Currently, there are no obvious restrictions on data usage or collection. Any internet connection can access both Reddit and Yahoo Finance without even having a registered account. However, Reddit may have some control over the content posted on their website, and Yahoo may restrict their data from being used for commercial purposes (figure we should check).</p>
<p> We will make the data available originally as a simple bulk JSON file of all the suggestions make on WSB, as well as the post text connected to that suggestion. Additionally, we intend to compute an aggregated "sentiment" of WSB, as measured by number and direction of suggestions, combined with the number of upvotes and comments on the posts themselves. Further down the line, we would like to present the data by ticker, rating current WSB sentiment as well as the historic amount of sway WSB has had over a particular stock, if available. </p>

In [160]:
import requests
import json
import datetime
import time

In [131]:
after_time = datetime.datetime(2020,1,15,0,0).timestamp()
after_time = int(after_time)
before_time = datetime.datetime(2020,1,16,0,0).timestamp()
before_time = int(before_time)

In [133]:
all_posts = requests.get('https://api.pushshift.io/reddit/submission/search/?after={}&sort_type=created_utc&sort=asc&subreddit=wallstreetbets&size=150'.format(after_time))

In [132]:
all_posts

<Response [200]>

In [126]:
data = all_posts.json()

In [127]:
post_count = 0
flair_dict = {'None':0}
for post_dict in data['data']:
    try:
        flair = post_dict['link_flair_text']
    except KeyError:
        flair = "None"
        flair_dict[flair] += 1
    if flair != "None":
        try:
            flair_dict[flair] += 1
        except KeyError:
            flair_dict[post_dict['link_flair_text']] = 1

sum(flair_dict.values())

100

In [128]:
flair_dict

{'None': 26,
 'Shitpost': 18,
 'Technicals': 1,
 'Options': 5,
 'Fundamentals': 3,
 'Satire': 4,
 'YOLO': 3,
 'Meme': 3,
 'Discussion': 16,
 'DD': 7,
 'Gain': 5,
 'Daily Discussion': 1,
 'Storytime': 1,
 'Stocks': 4,
 'Loss': 3}

In [146]:
good_flairs = ['DD','Options','Stocks','Fundamentals','Technicals','YOLO','Discussion']
good_posts = []
max_time = 0

In [147]:
for post_dict in data['data']:
    try:
        post_dict['link_flair_text']
    except KeyError:
        continue
    if post_dict['link_flair_text'] in good_flairs:
        good_posts.append(post_dict)
        if post_dict['created_utc'] > max_time:
            max_time = post_dict['created_utc']
len(good_posts)

39

In [149]:
one_day = 86400
max_time - after_time

37499

In [164]:
after_time = datetime.datetime(2020,1,15,0,0).timestamp()
after_time = int(after_time)
stop_time = int(datetime.datetime(2020,1,20,0,0).timestamp())

all_posts = requests.get('https://api.pushshift.io/reddit/submission/search/?after={}&sort_type=created_utc&sort=asc&subreddit=wallstreetbets&size=150'.format(after_time))
data = all_posts.json()

good_flairs = ['DD','Options','Stocks','Fundamentals','Technicals','YOLO','Discussion']
good_posts = []
max_time = 0
all_posts

<Response [200]>

In [165]:
while (after_time <= stop_time):
    for post_dict in data['data']:
        try:
            post_dict['link_flair_text']
        except KeyError:
            continue
        if post_dict['link_flair_text'] in good_flairs:
            good_posts.append(post_dict)
            if post_dict['created_utc'] > max_time:
                max_time = post_dict['created_utc']
    #recalculate new time to get next 100 posts
    after_time = after_time + (max_time - after_time)
    time.sleep(2)
    all_posts = requests.get('https://api.pushshift.io/reddit/submission/search/?after={}&sort_type=created_utc&sort=asc&subreddit=wallstreetbets&size=150'.format(after_time))
    data = all_posts.json()
    
    # write to file
    # record max time

In [166]:
len(good_posts)

641