
# Analysis of HackerNews Posts

## Introduction

The aim of this project is to analyze two different sections on the Hacker News website: the Ask section and the Show section and compare them on the amount of comments. In particular, we are interested in posts whose titles begin with either **Ask HN** (submitted to ask the Hacker News community a specific question) or **Show HN** (submitted to show a project, product, or just generally something interesting). We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments/points on average?
* Do posts created at a certain time receive more comments/points on average?

The original data set for our analysis was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. For descriptions of the columns please consult the data set documentation.

Let's start by opening the data set and reading it into a list of lists.

## 1. Retrieving the data

In [2]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [3]:
headers = hn[0]
print(headers)

hn = hn[1:]
for row in hn[:6]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
['10482257', 'Title II

## 2. Extracting Ask HN Posts and Show HN Posts

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if(title.startswith('ask hn')):
        ask_posts.append(row)
    if (title.startswith('show hn')):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("The number of ask posts:",len(ask_posts))
print("The number of show posts:", len(show_posts))
print("The number of other posts:",len(other_posts))

The number of ask posts: 1744
The number of show posts: 1162
The number of other posts: 18938


## 3. Analyzing Comments for Ask HN and Show HN Posts

### 3.1. Calculating the Average Number of Comments

Let's determine if ask posts or show posts receive more comments on average.

In [8]:
total_ask_comments = len(ask_posts)
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = (total_ask_comments / len(ask_posts))
print("The average number of comments on ask posts:", avg_ask_comments)

total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

avg_show_comments = (total_show_comments / len(show_posts))
print("The average number of comments on show posts:", avg_show_comments)


The average number of comments on ask posts: 14.038417431192661
The average number of comments on show posts: 10.31669535283993


We can see that on average ask posts receive about 1.4 times more comments than show posts.

Since ask posts are more likely to receive comments, we'll focus our further analysis just on these posts.

Let's determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.


## 3.2. Finding the Amount of Ask HN Posts and Comments by Hour Created

In [6]:
# the amount of ask posts created per hour

import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

    counts_by_hour = {}
    comments_by_hour = {}
    for row in result_list:
        format_string = "%m/%d/%Y %H:%M"
        time = dt.datetime.strptime(row[0], format_string)
        hour = time.hour
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = int(row[1])
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += int(row[1])
            
    for row in comments_by_hour:
        print(row)
    
#     for row in counts_by_hour:
#         print(row)
    
    

9
9
13
9
13
10
9
13
10
14
9
13
10
14
16
9
13
10
14
16
23
9
13
10
14
16
23
12
9
13
10
14
16
23
12
9
13
10
14
16
23
12
17
9
13
10
14
16
23
12
17
9
13
10
14
16
23
12
17
9
13
10
14
16
23
12
17
9
13
10
14
16
23
12
17
15
9
13
10
14
16
23
12
17
15
21
9
13
10
14
16
23
12
17
15
21
9
13
10
14
16
23
12
17
15
21
20
9
13
10
14
16
23
12
17
15
21
20
2
9
13
10
14
16
23
12
17
15
21
20
2
9
13
10
14
16
23
12
17
15
21
20
2
9
13
10
14
16
23
12
17
15
21
20
2
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
9
13
10
14
16
23
12
17
15
21
20
2
18
3
9
13
10
14
16
23
12
17
15
21
20
2
18
3
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
19
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
19
9
13
10
14
16
23
12
17
15
21
20
2
18
3
5
19
1
9



## 3.3. Calculating the Average Number of Comments for Ask HN Posts by Hour

Now we'll use these dictionaries to calculate the average number of comments for posts created during each hour of the day.


In [19]:
# the average number of comments per post 
# for posts created during each hour of the day

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])

for row in avg_by_hour:
    print(row)
                        


[9, 5.5777777777777775]
[13, 14.741176470588234]
[10, 13.440677966101696]
[14, 13.233644859813085]
[16, 16.796296296296298]
[23, 7.985294117647059]
[12, 9.41095890410959]
[17, 11.46]
[15, 38.5948275862069]
[21, 16.009174311926607]
[20, 21.525]
[2, 23.810344827586206]
[18, 13.20183486238532]
[3, 7.796296296296297]
[5, 10.08695652173913]
[19, 10.8]
[1, 11.383333333333333]
[22, 6.746478873239437]
[8, 10.25]
[4, 7.170212765957447]
[0, 8.127272727272727]
[6, 9.022727272727273]
[7, 7.852941176470588]
[11, 11.051724137931034]


This format makes it hard to identify the hours with the highest values. Let's sort the list of lists and print the 5 highest values in a format that is easier to read.

In [44]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

for row in swap_avg_by_hour:
    print(row)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for row in sorted_swap:
    print(row)
print("\nTop 5 Hours for ask Posts Comments")

for row in sorted_swap[:5]:
    time_obj = dt.datetime.strptime(str(row[1]), "%H")
    time = time_obj.strftime("%H:%M")
    print("{time}: {comments:.2f} average comments per post".format(time=time, comments=row[0]))

[38.5948275862069, 15]
[23.810344827586206, 2]
[21.525, 20]
[16.796296296296298, 16]
[16.009174311926607, 21]
[14.741176470588234, 13]
[13.440677966101696, 10]
[13.233644859813085, 14]
[13.20183486238532, 18]
[11.46, 17]
[11.383333333333333, 1]
[11.051724137931034, 11]
[10.8, 19]
[10.25, 8]
[10.08695652173913, 5]
[9.41095890410959, 12]
[9.022727272727273, 6]
[8.127272727272727, 0]
[7.985294117647059, 23]
[7.852941176470588, 7]
[7.796296296296297, 3]
[7.170212765957447, 4]
[6.746478873239437, 22]
[5.5777777777777775, 9]

Top 5 Hours for ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Thus, from all ask posts that received comments the most commented ones are those created in the following time ranges: 15.00-17.00, 2.00-3.00, 20.00-22.00, with the most favorable time range (with a big gap from its runner-up) being from 15.00 till 16.00. Accordiing to the data set documentation, the time is related to the time zone Eastern Time in the US. Hence, taking into account our time zone (Europe/Rome), to have a higher chance of receiving comments on our Ask HN post, we should create it between the midnight and 1.00.

## Conclusion

To sum up, ask posts stimulate more discussions and receive on average more comments than show posts, while show posts, being somehow innovative, receive on average more points. To have a higher chance to receive comments on our ask post, we should submit it between the midnight and 1.00. For our show post to receive more points, the best time to submit it is from 6.00 til 7.00 or from 19.00 till 20.00 (Europe/Rome time zone).