## Hacker News Data

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

This particular dataset is a subset of the data available [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). All posts with no comments were removed and a random sample of the remaining posts was chosen bringing the number of rows down to around 20000. 

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:
 - Do Ask HN or Show HN receive more comments on average?
 - Do posts created at a certain time receive more comments on average?

Let's start by reading the data from a csv file.

In [1]:
from csv import reader

fileopen = open('hacker_news.csv')
data = reader(fileopen)
hn = list(data)
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


Since there is a header, let's separate that from the data

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Now let us separate the posts to have the two types we are specifically interested in. 

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if 'ask hn' in title:
        ask_posts.append(row)
    elif 'show hn' in title:
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Out of',len(hn),'posts, there are', len(ask_posts), 
      "'Ask HN' posts and", len(show_posts),"'Show HN' posts.")

Out of 20100 posts, there are 1745 'Ask HN' posts and 1165 'Show HN' posts.


Now let's find whether `Ask HN` or `Show HN` posts receive more average comments.

In [6]:
total_ask_comments = 0
for row in ask_posts:
    comment_count = int(row[4])
    total_ask_comments += comment_count
avg_ask_comments = total_ask_comments/len(ask_posts)

total_show_comments = 0
for row in show_posts:
    comment_count = int(row[4])
    total_show_comments += comment_count
avg_show_comments = total_show_comments/len(show_posts)

print("The average number of comments for 'Ask HN' posts is", 
      avg_ask_comments)
print("The average number of comments for 'Show HN' posts is", 
      avg_show_comments)


The average number of comments for 'Ask HN' posts is 14.031518624641834
The average number of comments for 'Show HN' posts is 10.302145922746782


It looks like there are, on average, more comments for `Ask HN` posts. So let us focus on these posts. 

Let us try to see if `Ask HN` posts created at a certain time receive more comments.

In [35]:
import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append((row[6], int(row[4])))
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comment_datetime = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    comment_hour = comment_datetime.strftime('%H')
    if comment_hour in counts_by_hour:
        counts_by_hour[comment_hour] += 1
        comments_by_hour[comment_hour] += row[1]
    else:
        counts_by_hour[comment_hour] = 1
        comments_by_hour[comment_hour] = row[1]

avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.898550724637682], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


This list is kind of hard to read. We should create a list where it is easier to sort things. 

In [37]:
swap_avg_by_hour = []
for item in avg_by_hour:
    swap_avg_by_hour.append([item[1],item[0]])
# print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
# print(sorted_swap)
print('Top 5 Hours for Ask Posts Comments:')
for item in sorted_swap[:5]:
    text = '{hour}:00: {avg_comment:.2f}'
    print(text.format(hour = item[1], avg_comment = item[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59
02:00: 23.81
20:00: 21.52
16:00: 16.80
21:00: 16.01


Since the times are Eastern time, and we are in Mountain time, looks like our best five times to ask a question and have the greatest chance of obtaining comments are 1 pm, midnight, 6 pm, 2 pm and 7 pm. 