## SparkStreaming Hackathon
### Course: Real-time Data Analysis
### Authors: Ruben Tak, Nils Jennissen, David Landeo
This task involves setting up a data streaming pipeline to extract and process posts and comments from Reddit. The data will be structured and sent through a socket, then received and processed by another process. References to users, posts, and external sites will be extracted and counted, and the top 10 important words will be identified using TF-IDF. Optional features include sentiment analysis, additional metrics, saving results to a database, creating a Jupyter Notebook dashboard, and visualizing the results on a web page. The deliverables include Python code, instructions, output data files, and optional Docker setup.

In [None]:
import socket
import json
import praw
import logging

# Set up logging for errors
logging.basicConfig(filename='stream_json_error.log', level=logging.ERROR)

# Import Reddit API credentials
from credentials import CLIENT_ID, CLIENT_SECRET

# Define user agent for Reddit API
USER_AGENT = 'MyBot/0.0.1'

# Define socket host and port
host = "127.0.0.1"
port = 9999

# Define the subreddit to stream
subreddit_name = "reddit"

# Prepare the socket
socket_instance = socket.socket()
socket_instance.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
socket_instance.bind((host, port))
print(f"Listening on port: {port}")
socket_instance.listen()

def stream_reddit_data(subreddit):
    ''' This function streams JSON data from the chosen subreddit chanel.'''
    # Iterate through comments in the subreddit stream
    for comment in subreddit.stream.comments():
        try:
            # Extract relevant data from the comment and its parent
            post = comment.submission
            parent_id = str(comment.parent())
            previous_comment = reddit.comment(parent_id)

            previous_comment_body = previous_comment.body
            comment_body = comment.body

            # Create a dictionary with the extracted data
            comment_data = {
                "comment": comment_body,
                "previous_comment": previous_comment_body,
                "post": post.selftext,
                "author": str(comment.author),
                "link_url": comment.link_url,
                "link_permalink": comment.link_permalink,
                "post_date": comment.created_utc,
                "ups": comment.ups,
                "likes": comment.likes,
            }

            # Send the data through the socket
            connection, address = socket_instance.accept()
            connection.send(json.dumps(comment_data).encode('utf-8'))
            connection.close()
            print(f'Sent data: {comment_data}')
        except praw.exceptions.PRAWException as ex:
            # Ignore PRAW exceptions
            pass

# Initialize Reddit API instance
reddit = praw.Reddit(client_id=CLIENT_ID,
                     client_secret=CLIENT_SECRET,
                     user_agent=USER_AGENT)

# Get the specified subreddit
subreddit = reddit.subreddit(subreddit_name)

# Start streaming JSON data from the subreddit
stream_reddit_data(subreddit)

Listening on port: 9999
Sent data: {'comment': "You're a complete moron dude", 'prev_comment': 'We can always do more, and better and faster, but we shared the progress we made over the past 24 months in a few posts (see [here](https://new.reddit.com/r/reddit/comments/weiqc0/better_faster_stronger_recent_improvements_to/), [here](https://new.reddit.com/r/reddit/comments/107orxe/ringing_in_2023_with_a_2022_reflection_on_mod/), and [here](https://www.reddit.com/r/modnews/comments/12kxfd4/mobile_moderation_on_reddit/)). That list includes but is not limited to the following features:  \n[Mod Notes](https://www.reddit.com/r/modnews/comments/t8vafc/announcing_mod_notes/), [User Mod Log](https://new.reddit.com/r/modnews/comments/t8vafc/announcing_mod_notes/), [Mobile Removal Reasons](https://new.reddit.com/r/modnews/comments/vnmpgo/mobile_removal_reasons_mod_queue_improvements/), [Mod Queue sort improvements](https://www.reddit.com/r/modnews/comments/vnmpgo/mobile_removal_reasons_mod_queue_i