# The Best Time of the Day to Post on Hacker News



This project aims to analyze a dataset of submissions to the popular technology site Hacker News. According to [Wikipedia](https://en.wikipedia.org/wiki/Hacker_News), Hacker News is a social news website focused on computer science and entrepreneurship and it is run by Y Combinator, Paul Graham's investment fund and startup incubator.


The dataset we'll use is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts), and has almost 300,000 rows, each row representing a post. It includes the following columns:

* title: title of the post (self explanatory)

* url: the url of the item being linked to

* num_points: the number of upvotes the post received

* num_comments: the number of comments the post received

* author: the name of the account that made the post

* created_at: the date and time the post was made (the time zone is Eastern Time in the US)

For this project, we are particulary interested in posts whose titles begin with Ask HN and Show HN. The first one is used to aks the community a question while the second one is used to show the community something, it could be a project, a product, or just something the author finds interesting enough to share. Our goal is to determine if a post created in a particular moment of the day is more interacted with than posts cretated in other moments. In another words, we are interesd in answering the question: is there a best moment of the day to post on Hacker News?

## Introduction

First, we'll read data and remove the headers

In [1]:
#importing csv libreary to read the data
from csv import reader
opened_file = open('hacker_news.csv')
reader_file = reader(opened_file)
hn = list(reader_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers from a List of Lists

In [2]:
# Remove headers
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created.

## Par 1 -  Display Number of Posts

Now that the headers are removed we can start filtering our data - a process called Data Cleaning.

In order to make proper use of our data, we will have to separate the Ask Posts from the Show Posts. Posts that do not fall under this category will be listed as Other Posts.

In the next part of code, three lists were created: ask_posts, show_posts and other_posts.

To find such posts we're going to make use of the startswith() function to look for ask hn and show hn in the beginning of title.

The number of posts under each category is listed below.

In [3]:
# creating 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

# Iterating through the data set which has index 1:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

# Output for each category

print(title)
print('No of ask posts is', len(ask_posts))
print('No of show posts is', len(show_posts))
print('No of other posts is', len(other_posts))

RoboBrowser: Your friendly neighborhood web scraper
No of ask posts is 1744
No of show posts is 1162
No of other posts is 17194


In [4]:
# Display first five rows of each list:
print(ask_posts[:5])
print('\n')
print(show_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

## Part 2 - Which category of posts receive the most comments?

To do this, we're going to compute the average number of comments for each category.

In this section, we will make use of for loops for each category, then compute the average and compare.

In [5]:
##Finding total no of comments in ask posts:

total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])   
avg_ask_comments = (total_ask_comments)/len(ask_posts)


#Finding total nr of comments in show posts:

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments  = (total_show_comments)/len(show_posts)

print('Average ask comments is', (avg_ask_comments))

print('Average show comments is', (avg_show_comments))

Average ask comments is 14.038417431192661
Average show comments is 10.31669535283993


The above results shows that on average, Ask Posts receive(14) more comments feedback than Show Posts. We'll focus our remaining analysis just on these posts.

## Part 3 - Does time affect the amount of feedback posts receive?

The next interesting thing is to check whether the time at which the post is uploaded has any effect on the level of feedback that the post might receive, be it by comments or upvotes.

In this part, we have used the datetime module to parse the data from the Date column in our dataset, which has an index number of 0.

Furthermore, two dictionaries were created - One for the comments and the Other for the counts, relative to the hour, in 24-hr format.

In [6]:
#finding the amount of ask posts created per hour
import datetime as dt

In [7]:
import datetime as dt # imported datetime module 
result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

print('The no of comments on ask posts by the hour are:')
comments_by_hour

The no of comments on ask posts by the hour are:


{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

The output above shows us that **Ask Posts** created at the hours 15:00 showed more potential of getting feedback from users on their posts. In general posts created in between noon hours say 13:00 to late hours of the day say 21:00 had more feedbacks on a posts.

This could be due to the fact that during the early hours of the morning most users are highly engaged with work, school e.t.c Therefore as noon approaches the tension eases off, providing time for lesiure activities.

There's a little oddity which shows a large number of posts made during the early hours of the day, precisely 02:00.This result can also be exhibited by calculating the average comments by the hour.

In [8]:
#Calculating the average number of comments for Ask Hn posts per hour

avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['20', 21.525],
 ['06', 9.022727272727273],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['01', 11.383333333333333],
 ['21', 16.009174311926607],
 ['12', 9.41095890410959],
 ['23', 7.985294117647059],
 ['22', 6.746478873239437],
 ['17', 11.46],
 ['02', 23.810344827586206],
 ['14', 13.233644859813085],
 ['00', 8.127272727272727],
 ['19', 10.8],
 ['11', 11.051724137931034],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['04', 7.170212765957447],
 ['16', 16.796296296296298],
 ['10', 13.440677966101696],
 ['13', 14.741176470588234],
 ['15', 38.5948275862069],
 ['18', 13.20183486238532],
 ['07', 7.852941176470588]]

The result above further affirms the results gotten before. **E.g** There are approximately 39 **Ask Posts** made at 15:00 every day which reflects on the feedback/comments from users which is about 4477 comments. Also we also notice that, there are lots of posts made between noon hours to before midnight, which accounts for the large amount of comments seen during the day.

This also accounts for the oddity we identified in the earlier cells above.

In [9]:
#Sorting and printing the value from a list of list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[21.525, '20'], [9.022727272727273, '06'], [7.796296296296297, '03'], [10.08695652173913, '05'], [11.383333333333333, '01'], [16.009174311926607, '21'], [9.41095890410959, '12'], [7.985294117647059, '23'], [6.746478873239437, '22'], [11.46, '17'], [23.810344827586206, '02'], [13.233644859813085, '14'], [8.127272727272727, '00'], [10.8, '19'], [11.051724137931034, '11'], [10.25, '08'], [5.5777777777777775, '09'], [7.170212765957447, '04'], [16.796296296296298, '16'], [13.440677966101696, '10'], [14.741176470588234, '13'], [38.5948275862069, '15'], [13.20183486238532, '18'], [7.852941176470588, '07']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [10]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The result of the values shown above shows to gain maximum made on a post, a post should be made most preferably at 15:00. However other suitable hours could be 2:00, 20:00, 16:00 and 21:00.

## Part 4 - Calulating the amounts of points either Ask post or Show post gets.

In [11]:
total_ask_count = 0
for row in ask_posts:
    count = row[3]
    if count != '':
        count = int(row[3])
        total_ask_count += count
    
    
avg_ask_counts = total_ask_count/len(ask_posts)
print('The average number of counts for Ask Posts is {:.2f}'.format(avg_ask_counts))

The average number of counts for Ask Posts is 15.06


In [12]:
# Finding the average number of counts for Show Posts

total_show_count = 0
for row in show_posts:
    show_count = row[3]
    if show_count != '':
        show_count = int(row[3])
        total_show_count += show_count
    
    
avg_show_counts = total_show_count/len(show_posts)
print('The average number of counts for Show Posts is {:.2f}'.format(avg_show_counts))


The average number of counts for Show Posts is 27.56


The values shown above implies that there are more counts(rating) for **Show posts** than **Ask posts**. This simply means that the community values contributions of projects, products or generally something intresting to the platform than only seeking information through **Ask posts**.

## Part 5 - Does posts created at a certain time more upvoted than others?

It could be that an article or post uploaded at a certain time might get more attention or upvotes than posts uploaded at a different time. For example, we already saw that posts uploaded at 3pm get more feedback, i.e. comments back than posts uploaded at any other time of day.

In this section, we will see whether time has any effect on the upvotes a certain post may get.

We will do this in the following order:

* Ask Posts
* Show Posts
* Other Posts

In [13]:
ask_list_counts_vs_time = []
# total=0

# Checking for upvoting vs. time for Ask Posts
for posts in ask_posts:
    created = posts[6]
    counts = int(posts[3])
    ask_list_counts_vs_time.append([created, counts])
    
ask_counts_by_hour = {}
ask_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in ask_list_counts_vs_time:
    date = row[0]
    count = row[1]
    created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if created_obj not in ask_counts_by_hour:
        ask_counts_by_hour[created_obj] = 1
        ask_comments_by_hour[created_obj] = count
    else:
        ask_counts_by_hour[created_obj] += 1
        ask_comments_by_hour[created_obj] += count
        
print('the number of counts on ask posts by the hour are:')
ask_counts_by_hour  

the number of counts on ask posts by the hour are:


{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [14]:
show_list_counts_vs_time = []

# Checking for upvoting vs. time for Show Posts
for post in show_posts:
    created = post[6]
    counts = int(post[3])
    show_list_counts_vs_time.append([created, counts])
    
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in show_list_counts_vs_time:
    date = row[0]
    count = row[1]
    show_created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if show_created_obj not in show_counts_by_hour:
        show_counts_by_hour[show_created_obj] = 1
        show_comments_by_hour[show_created_obj] = count
    else:
        show_counts_by_hour[show_created_obj] += 1
        show_comments_by_hour[show_created_obj] += count
        
print('the number of counts on show posts by the hour are:')
show_counts_by_hour

the number of counts on show posts by the hour are:


{'00': 31,
 '01': 28,
 '02': 30,
 '03': 27,
 '04': 26,
 '05': 19,
 '06': 16,
 '07': 26,
 '08': 34,
 '09': 30,
 '10': 36,
 '11': 44,
 '12': 61,
 '13': 99,
 '14': 86,
 '15': 78,
 '16': 93,
 '17': 93,
 '18': 61,
 '19': 55,
 '20': 60,
 '21': 47,
 '22': 46,
 '23': 36}

## Conclusion
From the results above we can conclude the following points:-

* Show Posts uploaded between 13:00 and 17:00 are the most likely to get the   highest number of upvotes.
* Ask Posts also get a high number of upvotes, especially those uploaded       between 13:00 and 21:00.

In fact, one can deduce that people spend more time looking at Ask Posts (13:00 - 21:00) rather than at Show Posts.