# Best Comment Award
## What was the most upvoted comment on Election Day 2016?

In the site_analysis.ipynb notebook, we answered this same question for Election Day 2008 using the Orion cluster and found a common theme between all of the top comments on Election Day 2008. Now that we are on an Azure Spark cluster, let's look at Election Day 2016.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, LongType
from datetime import datetime
import time 

In [7]:
# Data was uploaded to Azure block storage

file_path = 'abfs://dda-2022-12-15t21-04-18-212z@ddasta.dfs.core.windows.net/reddit/2016/RC_2016-11.bz2'

# We are only interested in these columns

schema = StructType([
    StructField('author', StringType(), nullable=True),
    StructField('body', StringType(), nullable=True),
    StructField('score', LongType(), nullable=True),
    StructField('created_utc', StringType(), nullable=True)
])

df_nov_16 = spark.read.schema(schema).json(file_path)
df_nov_16.count()

71022319

In [8]:
# Election Day was on November 8, 2016

start_datetime = datetime(2016, 11, 8)
end_datetime = datetime(2016, 11, 9)

# Convert datetime object to a unix timestamp string which is how timestamps are represented in the schema

start_datetime_unix = str(int(time.mktime(start_datetime.timetuple())))
end_datetime_unix = str(int(time.mktime(end_datetime.timetuple())))

df_nov_8_16 = df_nov_16.filter(df_nov_16['created_utc'] >= start_datetime_unix)
df_nov_8_16 = df_nov_8_16.filter(df_nov_16['created_utc'] < end_datetime_unix).cache()

df_nov_8_16.count()

2575570

### It's interesting that out of the 71,022,319 comments posted on Reddit in 2016, 2,575,570 were posted on Election Day. That's more than 3.6% of the comments all year posted on one day! For reference 1/365 is only 0.27%!

In [11]:
df_nov_8_16.take(5)

[Row(author='theoriginalharbinger', body="Get reduced-cap magazines and you're gtg.\n\nWould also suggest buying as much ammo as you're planning on shooting out of state. Cali is implementing a mandate that all ammo go through a background check. [Here's some detail on that](http://www.sacbee.com/news/politics-government/capitol-alert/article88521977.html). My own back-of-the-envelope math indicates this is probably going to cause most ammo sales to be done in bulk and impose anywhere between a 10 to 20% surcharge on the price of ammo.", score=2, created_utc='1478640324', ups=None), Row(author='wwindexx', body="Yes. Yes yes yes. Tax em all. I can't figure out for the life of me why someone who isn't directly affiliated with a church would come up with flimsy excuses on why we shouldn't tax churches. /u/d1rron care to explain why you feel we need to monitor sunday sermons instead of taxing them like everyone else?", score=5, created_utc='1478640324', ups=None), Row(author='Shanti_Ananda

In [18]:
from pyspark.sql.functions import *

# Sort comments on election day in descending order by score

df_nov_8_16 = df_nov_8_16.sort(desc(col("score")))

df_nov_8_16.take(5)

[Row(author='RumHam12', body="Jesus I'm fuckin retarded \n\nEdit: gold for being retarded? So this is what it's like to be in a special Ed program.", score=20396, created_utc='1478591529', ups=None), Row(author='qvulture', body="It's called the Top 40 because that's how many songs early jukeboxes could hold.", score=17104, created_utc='1478609988', ups=None), Row(author='TooShiftyForYou', body='"Babe, don\'t copy Michelle with this one."', score=17058, created_utc='1478635133', ups=None), Row(author='st0neh', body='Spacebar.', score=16256, created_utc='1478591418', ups=None), Row(author='Alpha-Trion', body='The Rock and Terry Crews should both star in a buddy cop movie together.', score=15043, created_utc='1478617557', ups=None)]

## Answer:

In [28]:
award_winner = df_nov_8_16.take(1)[0]
print(award_winner.author + " wins for their comment: \n\n" + award_winner.body + "\n\nwith a score of " + str(award_winner.score))

RumHam12 wins for their comment: 

Jesus I'm fuckin retarded 

Edit: gold for being retarded? So this is what it's like to be in a special Ed program.

with a score of 20396

Context: https://www.reddit.com/r/pcmasterrace/comments/5brzzu/where_in_the_fuck_is_this_key/d9qte7o/?context=3

I was curioius in what context thsis comment was made, so I did some searching on reddit.com and found this. RumHam had made a post to r/pcmasterrace with a screenshot from a game asking what keyboard key a symbol on in the game was representing. A helpful Reddit user told RumHam it is the spacebar key, and this top comment was the response to that user.

It's interesting that the top comment on Election Day had nothing to do with the election. Contrast this with the top comments on Election Day 2008 which were found in the site_analysis.ipynb notebook in this same GitHub repo and you will see that the top 3 comments on that day were election related.

One way to interpret this is that as Reddit grew in users, it no longer focused as much on news and current events and each community took on its own life. So despite the 2016 election being massive news in the United States and across the world, Reddit is a world of its own. Another possible interpretation is just that Reddit users like to upvote funny comments.

# Top Comments
## What are the five most-upvoted comments made by u/RumHam12 across the entire dataset. Do they post highly-upvoted comments often or are they a “one hit wonder?”

In [1]:
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, LongType

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
79,application_1671409217564_0101,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
winning_author = "RumHam12"

months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

reddit_directory = 'abfs://dda-2022-12-15t21-04-18-212z@ddasta.dfs.core.windows.net/reddit/'

schema = StructType([
    StructField('author', StringType(), nullable=True),
    StructField('body', StringType(), nullable=True),
    StructField('score', LongType(), nullable=True)
])

## 2016 top comments

In [3]:
df_2016 = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

# Create a data frame with all the 2016 comments

for month in months:
    file_path = reddit_directory + "/2016" + "/RC_2016-" + month + ".bz2"
    month_df = spark.read.schema(schema).json(file_path)
    df_2016 = df_2016.union(month_df)

In [4]:
df_2016 = df_2016.filter(col("author") == winning_author)
df_2016 = df_2016.sort(desc(col("score")))
df_2016.take(10)

[Row(author='RumHam12', body="Jesus I'm fuckin retarded \n\nEdit: gold for being retarded? So this is what it's like to be in a special Ed program.", score=20396), Row(author='RumHam12', body="Plz no\n\nEdit: some of you are alright. Don't go on Reddit tomorrow.", score=2772), Row(author='RumHam12', body="I'm going to bed fam. If I wake up and this is on r/all I'm going to run into oncoming traffic.\n\n\nEdit: fuck", score=2337), Row(author='RumHam12', body='I would just love to be reminded of my undying autism for longer ', score=2197), Row(author='RumHam12', body="Good lord I'm so dumb you thought I was kidding ", score=1094), Row(author='RumHam12', body='I spent way too long looking for it ', score=521), Row(author='RumHam12', body="Holy shit that's even worse! This thing is a mess.", score=245), Row(author='RumHam12', body='Lol instead of helping them fix it I just posted it on the Internet and exposed it to the world ', score=127), Row(author='RumHam12', body='They sound like shit

### Analysis

RumHam had three comments with a score between 2000 and 3000 and another comment with a score just above 1000, but their best comment in 2016 by far was their top comment shown above with a score of 20,369. I would consider RumHam to be a one-hit wonder in 2016 because the score of their top comment of the year was an order of magnitude higher than the sccore of their second best comment.

Now we will check 2015 to see their comment score performance in the previous year.

## 2015 top comments

In [5]:
year = "2015"
df_2015 = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

for month in months:
    file_path = reddit_directory + "/" + year + "/RC_" + year + "-" + month + ".bz2"
    month_df = spark.read.schema(schema).json(file_path)
    df_2015 = df_2015.union(month_df)
    
df_2015 = df_2015.filter(col("author") == winning_author)

In [6]:
df_2015.count()

0

### Analysis

u/RumHam12 did not make a single comment in 2015. Given this fact, we will assume their account didn't exist in 2015. Now we will check their comment performance in 2017.

## 2017 top comments

In [7]:
year = "2017"
df_2017 = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

# Note that the data set only has the first three months of 2017
for month in months[:3]:
    file_path = reddit_directory + "/" + year + "/RC_" + year + "-" + month + ".bz2"
    month_df = spark.read.schema(schema).json(file_path)
    df_2017 = df_2017.union(month_df)

df_2017 = df_2017.filter(col("author") == winning_author)

In [8]:
df_2017.count()

17

In [9]:
df_2017 = df_2017.sort(desc(col("score")))
df_2017.take(10)

[Row(author='RumHam12', body='Makes sense. Thanks dad!', score=44), Row(author='RumHam12', body="That requires me to take more time away from this game and that's not happening take what you get ", score=13), Row(author='RumHam12', body="Is there a big skill ceiling? Like it's pretty competitive?", score=8), Row(author='RumHam12', body='Breath of the wild ', score=5), Row(author='RumHam12', body="That's good to know! Thanks!", score=5), Row(author='RumHam12', body='I can take all the time in the world if the costume is good enough ', score=4), Row(author='RumHam12', body='Cocapoo ', score=2), Row(author='RumHam12', body='Ok that makes way more sense lol. I thought everything I posted had it and I looked like a retard everywhere I went ', score=1), Row(author='RumHam12', body="Lol as much as I'd love to submit something Zelda related I don't think this would fly", score=1), Row(author='RumHam12', body='Oh I can use this! Thanks fam!', score=1)]

### Analysis

RumHam's comment scores in 2017 were much lower than in 2016. Their highest score comment had a score of only 44. This further confirms RumHam's status as a one-hit wonder commenter.

## 2018 top comments

In [10]:
year = "2018"
df_2018 = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

for month in months:
    # Note that we decompressed the files for 2018 and later
    file_path = reddit_directory + "/" + year + "/RC_" + year + "-" + month + ".json"
    month_df = spark.read.schema(schema).json(file_path)
    df_2018 = df_2018.union(month_df)
    
df_2018 = df_2018.filter(col("author") == winning_author)

In [11]:
df_2018.count()

17

In [12]:
df_2018 = df_2018.sort(desc(col("score")))
df_2018.take(10)

[Row(author='RumHam12', body='Can we just make a separate subreddit for this creepy garbage?', score=3), Row(author='RumHam12', body='Indeed', score=3), Row(author='RumHam12', body='This guy gets it ', score=3), Row(author='RumHam12', body='The law ', score=3), Row(author='RumHam12', body='Are you saying zerk is bs?', score=3), Row(author='RumHam12', body="If the single motherhood rate explodes right as a government incentive is given out for being a single mother I think it's pretty safe to assume there's a connection there ", score=2), Row(author='RumHam12', body="That's like exactly what I was thinking lol", score=2), Row(author='RumHam12', body='Blackbeard would be dope ', score=2), Row(author='RumHam12', body="Wendigo: bound by blood is genuinely the worst movie I have ever seen it's hilarious and I've never heard anyone else even mention it ", score=2), Row(author='RumHam12', body="Yea I tired that it'll still like 2.5 gigs ", score=1)]

### Analysis

2018 was an even weaker year for RumHam's comment scores. Their highest scoring comment only had a score of 3. At this point we can confidently say that RumHam is a one-hit wonder commenter.

## 2019 top comments

In [13]:
year = "2019"
df_2019 = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

for month in months:
    file_path = reddit_directory + "/" + year + "/RC_" + year + "-" + month + ".json"
    month_df = spark.read.schema(schema).json(file_path)
    df_2019 = df_2019.union(month_df)
    
df_2019 = df_2019.filter(col("author") == winning_author)

In [14]:
df_2019.count()

0

### Analysis

RumHam did not make a single comment in 2019. We will assume that are now inactive and will not analyze the 2020 data for comments.

## Answer

RumHam is a one-hit wonder. Their top 5 comments across the entire dataset are actually just their top 5 comments in 2016 which can be found printed above (I would print them again here, but the notebook was closed and the data was not cached).

Not only was RumHam a one-hit wonder in that their highest scoring comment was an order of magnitude better than their second highest scoring comment, they were also a one-year wonder in that they had several relatively high scoring comments in 2016, but then had much lower scoring comments in other years.