# Exploring Hackers News Posts

In this project, we'll be comparing two different types of posts on Hacker News, a popular site where technology stories (or "posts") are the subject of votes and comments. Either `Ask HN` or `Show HN` are the two types of posts we'll be examining.

Users submit Ask HN posts to pose a specific question to the Hacker News community, such as "What's the best online course you've ever taken? Similarly, to show the Hacker News community a project, product, or just something interesting in general, users submit `Show HN` posts.

We're going to compare these two types of posts in particular to determine the following:

* Does `AskHN` or `ShowHN` get more comments?
* Do posts created at a particular time get commented on more, on average?
Note that by removing all posts that received no comments and then randomly sampling from the remaining posts, the dataset we're working with was reduced from nearly 300,000 rows to about 20,000 rows.

## Introduction

We'll start by reading the data and removing the headers.

In [1]:
import csv

opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers from a List of Lists

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Above, you will notice that the dataset contains the title of the postings, the number of comments for each posting, and the creation date of the posting. First, let's look at how many comments there are for each posttype.

## Extracting Ask HN and Show HN Posts

Let's start by identifying posts that begin with either Ask HN or Show HN and separate the data on these post types into separate lists. In the following steps, separating the data will make it easier to analyze.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for rows in hn:
    title = rows[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(rows)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(rows)
    else:
        other_posts.append(rows)
        

print(len(show_posts))
print(len(other_posts))
print(len(other_posts))

1162
17194
17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Since we've split the Ask and Show posts into different lists, let's calculate the average number of comments each type of post gets.

In [4]:
def average_comments(posts):
    count = 0
    for rows in posts:
        comments = int(rows[-3])
        count += comments
    return count / len(posts)
print(average_comments(ask_posts))
print(average_comments(show_posts))

14.038417431192661
10.31669535283993


In our sample, ask posts average about 14 comments, while show posts average about 10. We will focus our remaining analysis on ask posts only, since they are more likely to receive comments.

## Finding the Amount of Ask Posts and Comments by Hour Created

The next thing we want to do is determine if we can maximize the amount of comments a question post receives by creating it at a certain time of day. Let's start by looking at how many questions and comments have been posted during each hour. Then we're going to calculate the average number of comments that the ask posts that were created at each hour of the day receive.

In [5]:
import datetime as dt

result_list = []
for rows in ask_posts:
    c_time = rows[6]
    comments = int(rows[-3])
    result_list.append([c_time, comments])
    
date_format = '%m/%d/%Y %H:%M'
counts_by_hour = {}
comments_by_hour = {}

for rows in result_list:
    
    date = rows[0]
    comment = rows[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [6]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Printing Values from a List of Lists

In [7]:
swap_avg_by_hour = []

for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [8]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


With an average of 38.59 comments per post, the hour with the most comments per post is 15:00. Between the hours with the highest and second highest average number of comments, there is about a 60% increase in the number of comments.

The time zone used is Eastern Time in the USA, according to the dataset [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts). Therefore, we also could write 15:00 as 3:00 pm est.

## Conclusion

The study looked at the types of posts that received the most comments and the times when they received the most comments. According to our analysis, we would recommend categorizing the post as a question post and creating it between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) to maximize the amount of comments a post receives.

Note, however, that the dataset we analyzed excluded posts that did not contain a comment. Given that, it's more accurate to say that, of the posts that received comments, ask posts received more comments on average, and ask posts created between 3:00pm and 4:00pm (15:00pm - 4:00pm EST) received the most comments on average.