# Analysing Comment Distribution of Hacker News Posts Depending on Time and Title

Hacker News is a very useful environment that many programmers and analysers uses for their questions to be answered or showing their projects and ideas. Depending on the post types and date of posting, the comment that each post recieves may change.

  In this project, we are going to analyse how the average comment behavior changes considering the title types and the submission date of the post.

Below code prepares us the database to work on it as a list of lists.

In [2]:
from csv import reader
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)

Here are the first 5 rows of hn:

In [3]:
for i in range(0,5):
    print(hn[i])
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Now we will extract the first row of the data since it contains the column names

In [4]:
header = hn[0]
del hn[0]
print(header,'\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now we would like to create lists for different types of posts such as ask posts, show posts, and other posts.

In [25]:
ask_posts = list()
show_posts = list()
other_posts = list()

We need to check title column values and add the corresponding row to our lists accordingly

In [26]:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

The title categories are ready now and we can check the number of posts for each of these categories

In [27]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Then we are going to calculate total and average comments for each category

## Ask Post  Comments

In [31]:
total_ask_comments = 0
#Total num of comments in ask posts initialization

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
#Calculated the total number of comments in ask posts

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


## Show Post Comments

In [32]:
total_show_comments = 0
#Total num of comments in show posts initialization

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
#Calculated the total number of comments in show posts

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Since the average of total comments of ask posts are higher than show posts, we are going to concentrate on ask posts and analyse hourly post number and comment number

In [35]:
import datetime as dt
result_list = list()
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
# We have created a list containing create date and num of comments for each ask post

Then we create two different dictionaries for analysing post counts and number of comments by hour separately

In [37]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:   
    
    raw_date = row[0]
    comment = row[1]
    dt_obj = dt.datetime.strptime(raw_date,'%m/%d/%Y %H:%M')
    hour = dt_obj.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

In [39]:
print(counts_by_hour)
print(comments_by_hour)

{'14': 107, '00': 55, '10': 59, '12': 73, '09': 45, '03': 54, '17': 100, '07': 34, '21': 109, '05': 46, '16': 108, '15': 116, '04': 47, '22': 71, '02': 58, '13': 85, '19': 110, '01': 60, '18': 109, '23': 68, '11': 58, '06': 44, '08': 48, '20': 80}
{'14': 1416, '00': 447, '10': 793, '12': 687, '09': 251, '03': 421, '17': 1146, '07': 267, '21': 1745, '05': 464, '16': 1814, '15': 4477, '04': 337, '22': 479, '02': 1381, '13': 1253, '19': 1188, '01': 683, '18': 1439, '23': 543, '11': 641, '06': 397, '08': 492, '20': 1722}


Now we can calculate the average number of comments per hour by using these dictionaries

In [52]:
avg_by_hour = list()
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour] / counts_by_hour[hour]])
    
for row in range(len(avg_by_hour)):
    print(avg_by_hour[row],'\n')

['14', 13.233644859813085] 

['00', 8.127272727272727] 

['10', 13.440677966101696] 

['12', 9.41095890410959] 

['09', 5.5777777777777775] 

['03', 7.796296296296297] 

['17', 11.46] 

['07', 7.852941176470588] 

['21', 16.009174311926607] 

['05', 10.08695652173913] 

['16', 16.796296296296298] 

['15', 38.5948275862069] 

['04', 7.170212765957447] 

['22', 6.746478873239437] 

['02', 23.810344827586206] 

['13', 14.741176470588234] 

['19', 10.8] 

['01', 11.383333333333333] 

['18', 13.20183486238532] 

['23', 7.985294117647059] 

['11', 11.051724137931034] 

['06', 9.022727272727273] 

['08', 10.25] 

['20', 21.525] 



We have found the average comments per hour. But we would like to show it in a more proper way.

In [53]:
swap_avg_by_hour = list()
for row in avg_by_hour:
    first = row[1]
    second = row[0]
    swap_avg_by_hour.append([first,second])

print(swap_avg_by_hour)

[[13.233644859813085, '14'], [8.127272727272727, '00'], [13.440677966101696, '10'], [9.41095890410959, '12'], [5.5777777777777775, '09'], [7.796296296296297, '03'], [11.46, '17'], [7.852941176470588, '07'], [16.009174311926607, '21'], [10.08695652173913, '05'], [16.796296296296298, '16'], [38.5948275862069, '15'], [7.170212765957447, '04'], [6.746478873239437, '22'], [23.810344827586206, '02'], [14.741176470588234, '13'], [10.8, '19'], [11.383333333333333, '01'], [13.20183486238532, '18'], [7.985294117647059, '23'], [11.051724137931034, '11'], [9.022727272727273, '06'], [10.25, '08'], [21.525, '20']]


In [56]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for avg,hour in sorted_swap[:5]:
    dt_hour = dt.datetime.strptime(hour,'%H')
    dt_formated = dt_hour.strftime('%H:%M')
    main_str = "{H} {AVG:.2f} average comments per post"
    print_str = main_str.format(H = dt_formated, AVG = avg) 
    print(print_str)
    

Top 5 Hours for Ask Posts Comments
15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


Now we have seen that the top hours for posting an ask post are
* 15:00
* 02:00
* 20:00
* 16:00
* 21:00

But since I am living in Turkey. I need to calculate them from UTC(-05:00) to UTC(+03:00)
To do that, I will be adding some time delta to the hours:

In [59]:
print('Top 5 Hours for Ask Posts Comments for Turkey')
for avg,hour in sorted_swap[:5]:
    dt_hour = dt.datetime.strptime(hour,'%H') + dt.timedelta(hours = 8)
    #updated the hour according to my UTC
    
    dt_formated = dt_hour.strftime('%H:%M')
    main_str = "{H} {AVG:.2f} average comments per post"
    print_str = main_str.format(H = dt_formated , AVG = avg) 
    print(print_str)

Top 5 Hours for Ask Posts Comments for Turkey
23:00 38.59 average comments per post
10:00 23.81 average comments per post
04:00 21.52 average comments per post
00:00 16.80 average comments per post
05:00 16.01 average comments per post
