## SparkStreaming Hackathon
### Course: Real-time Data Analysis
### Authors: Ruben Tak, Nils Jennissen, David Landeo
This task involves setting up a data streaming pipeline to extract and process posts and comments from Reddit. The data will be structured and sent through a socket, then received and processed by another process. References to users, posts, and external sites will be extracted and counted, and the top 10 important words will be identified using TF-IDF. Optional features include sentiment analysis, additional metrics, saving results to a database, creating a Jupyter Notebook dashboard, and visualizing the results on a web page. The deliverables include Python code, instructions, output data files, and optional Docker setup.

In [1]:
pip install praw

Collecting praw
  Downloading praw-7.7.0-py3-none-any.whl (189 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.4/189.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.0 prawcore-2.3.0 update-checker-0.18.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# remember to use nc -lk 9999 before you run the script
import socket
import json
import praw
import logging
logging.basicConfig(filename='stream_json_error.log', level=logging.ERROR)
#from credentials import CLIENT_ID, CLIENT_SECRET

In [3]:
CLIENT_ID = '6di041usQ3ginoVTUL3Tjw'
SECRET_TOKEN = 'Q-uxdARHnumCBT-tGLLmbmsZG0mwxw'

In [4]:
# CLIENT_ID = CLIENT_ID
# SECRET_TOKEN = CLIENT_SECRET
USER_AGENT = 'MyBot/0.0.1'

host = "127.0.0.1"
port = 9999

In [None]:
subred_name = "reddit"

def create_socket(host, port):
    """
    Create a socket and bind it to the specified host and port.
    """
    s = socket.socket()
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
    s.bind((host, port))
    print(f"Listening on port: {port}")
    s.listen()
    return s

def stream_json(reddit, subreddit, socket):
    """
    Stream comments from the specified subreddit and send them through the socket.
    """
    for comment in subreddit.stream.comments():
        try:
            post = comment.submission
            parent_id = str(comment.parent())
            parent_comment = reddit.comment(parent_id)
            my_object = {
                "comment": comment.body,
                "prev_comment": parent_comment.body,
                "post": post.selftext,
                "post_date": comment.created_utc,
            }
            # Send data 
            c, addr = socket.accept()
            c.send(json.dumps(my_object).encode('utf-8'))
            c.close()
            print(f'Sent data: {my_object}')
        except praw.exceptions.PRAWException as ex:
            logging.error(f"Error while streaming comments: {ex}")
            pass

def main():
    # Set up Reddit API
    reddit = praw.Reddit(client_id=CLIENT_ID,
                         client_secret=SECRET_TOKEN,
                         user_agent=USER_AGENT)

    subreddit = reddit.subreddit(subred_name)

    # Set up socket
    s = create_socket(host, port)

    # Stream comments and send them through the socket
    stream_json(reddit, subreddit, s)

if __name__ == "__main__":
    main()

Listening on port: 9999
Sent data: {'comment': 'Thanks!', 'prev_comment': 'Go into settings and turn off the Followers button.  👍', 'post': "Dear redditors,\n\nFor those of you who don’t know me, I’m Steve aka u/spez. I am one of the founders of Reddit, and I’ve been CEO since 2015. On Wednesday, I celebrated my 18th cake-day, which is about 17 years and 9 months longer than I thought this project would last. To be with you here today on Reddit—even in a heated moment like this—is an honor.\n\nI want to talk with you today about what’s happening within the community and frustration stemming from changes we are making to access our API. I spoke to a number of moderators on Wednesday and yesterday afternoon and our product and community teams have had further  conversations with mods as well.\n\nFirst, let me share the background on this topic as well as some clarifying details. On 4/18, we [shared](https://www.reddit.com/r/reddit/comments/12qwagm/an_update_regarding_reddits_api/) that w