# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The dataset used is taken from hacker_news.csv.
Below are the descriptioins of the columns:

* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

In [3]:
from csv import reader
import datetime as dt

In [4]:
open_file = open("datasets/hacker_news.csv")
read_file = reader(open_file)

In [5]:
hn = list(read_file)

In [6]:
print(*hn[:6], sep='\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


In [7]:
headers = hn[0]
hn = hn[1:]

In [8]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [9]:
print(*hn[:5], sep='\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


### Extracting "Ask HN" and "Show HN" posts

In [10]:
ask_posts = []
show_posts = []
other_posts = []

In [11]:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [12]:
print("No. of ask_posts:", len(ask_posts))
print("No. of show posts:", len(show_posts))
print("No. of other_posts:", len(other_posts))

No. of ask_posts: 1744
No. of show posts: 1162
No. of other_posts: 17194


### Calculating the average no. of comments for "Ask HN" and "Show HN" Posts 

In [13]:
total_ask_comments = sum([int(row[4]) for row in ask_posts])

In [14]:
avg_ask_comments = total_ask_comments / len(ask_posts)

In [15]:
total_show_comments = sum([int(row[4]) for row in show_posts])
avg_show_comments = total_show_comments / len(ask_posts)                           

In [16]:
print("Average no. of comments on 'Ask HN' posts: %.4f" % avg_ask_comments)
print("Average no. of comments on 'Show HN' posts: %.4f" % avg_show_comments)

Average no. of comments on 'Ask HN' posts: 14.0384
Average no. of comments on 'Show HN' posts: 6.8739


Thus, we observe that "Ask HN" posts receive more no. of comments than "Show HN" posts on average.

### Finding the amount of "Ask" posts and comments by the hour created

In [17]:
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [18]:
counts_by_hour = {}
comments_by_hour = {}

In [19]:
print(*[row[0] for row in result_list[:50]], sep='\n')

8/16/2016 9:55
11/22/2015 13:43
5/2/2016 10:14
8/2/2016 14:20
10/15/2015 16:38
9/26/2015 23:23
4/22/2016 12:24
11/16/2015 9:22
2/24/2016 17:57
6/4/2016 17:17
9/19/2015 17:04
9/22/2015 13:16
6/21/2016 15:45
1/13/2016 21:17
10/4/2015 21:27
1/25/2016 20:27
10/27/2015 2:47
1/19/2016 12:01
3/22/2016 2:05
9/8/2015 14:04
8/28/2016 18:06
7/20/2016 13:44
9/12/2016 16:52
2/29/2016 17:52
4/18/2016 15:28
12/28/2015 14:38
4/4/2016 3:34
1/15/2016 21:47
11/19/2015 5:33
12/20/2015 3:59
10/15/2015 21:34
2/26/2016 19:20
8/2/2016 18:00
2/28/2016 1:24
1/13/2016 9:12
5/6/2016 1:14
6/23/2016 13:59
4/30/2016 17:21
10/20/2015 19:21
10/25/2015 15:09
5/4/2016 14:14
12/23/2015 20:48
6/1/2016 16:19
7/10/2016 22:19
6/21/2016 23:40
4/24/2016 19:36
5/20/2016 1:26
6/24/2016 17:02
7/12/2016 19:34
10/16/2015 16:36


In [20]:
for row in result_list:
    dateobj = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dateobj.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

In [21]:
for k, v in counts_by_hour.items():
    print(k, v)

09 45
13 85
10 59
14 107
16 108
23 68
12 73
17 100
15 116
21 109
20 80
02 58
18 109
03 54
05 46
19 110
01 60
22 71
08 48
04 47
00 55
06 44
07 34
11 58


In [22]:
for k, v in comments_by_hour.items():
    print(k, v)

09 251
13 1253
10 793
14 1416
16 1814
23 543
12 687
17 1146
15 4477
21 1745
20 1722
02 1381
18 1439
03 421
05 464
19 1188
01 683
22 479
08 492
04 337
00 447
06 397
07 267
11 641


Thus, `counts_by_hour` contains the no. of ask posts created during each hour of the day.
`comments_by_hour` contains the no. of comments for the ask posts created during each hour of the day. 

### Calculating average no. of comments per hour per "Ask HN" post

In [23]:
avg_by_hour = []

In [24]:
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [25]:
print(*avg_by_hour, sep='\n')

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


Sorting `avg_by_hour` in desc. order with the key set as avg no. of comments for each hour (i.e. second element in each sublist inside the list 'avg_by_hour'):

In [26]:
avg_by_hour.sort(reverse=True, key=lambda row:row[1])

In [27]:
print(*avg_by_hour, sep='\n')

['15', 38.5948275862069]
['02', 23.810344827586206]
['20', 21.525]
['16', 16.796296296296298]
['21', 16.009174311926607]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['18', 13.20183486238532]
['17', 11.46]
['01', 11.383333333333333]
['11', 11.051724137931034]
['19', 10.8]
['08', 10.25]
['05', 10.08695652173913]
['12', 9.41095890410959]
['06', 9.022727272727273]
['00', 8.127272727272727]
['23', 7.985294117647059]
['07', 7.852941176470588]
['03', 7.796296296296297]
['04', 7.170212765957447]
['22', 6.746478873239437]
['09', 5.5777777777777775]


## Conclusion

Displaying hour and corresponding avg no of comments for 
"Top 5 hours for Ask Posts comments":

In [28]:
for row in avg_by_hour[:5]:
    dateobj = dt.datetime.strptime(row[0], "%H")
    hour = dateobj.strftime("%H:%M")
    avg_comments = row[1]
    print("{}: {:.2f} average comments per post".format(hour, avg_comments))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Thus, posting at these hours seem to have a higher chance for receiving comments.