# Exploring Hacker News Posts

This project aims to explore the posts from the popular Hacker News technology site in a bid to gain greater insights about the type of posts that are made by the users on Hacker News. The data set used in this project is obtained from [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts), where the accompanying description of the dataset, including the column details, can be found. The original data set consists of about 300,000 data points (rows) which has been trimmed down further to about 20,000 by removing submissions without comments and then doing a randomly sampling further.

In [1]:
# Importing libraries

from csv import reader
import datetime as dt
import pytz

## Exploring "Ask HN" and"Show HN" Posts

## Average Number of Comments Per Post

In [2]:
# Reading in the csv file into a list of lists: hn

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
# Separating out the headers and the rest of the rows data

headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [4]:
# Creating three empty lists to contain posts with
# 'Ask HN', 'Show HN' and others

ask_posts = []
show_posts = []
other_posts = []

# Separating out the three different type of posts

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Printing out the number of 'Ask HN', 'Show HN' and other posts        
print(len(ask_posts))     
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [5]:
# Printing out the first five rows of posts that start with
# 'Ask HN' and 'Show HN'
print(ask_posts[:5])
print('\n')
print(show_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

In [6]:
# Determining whether 'Ask HN' or 'Show HN' posts
# receive more comments on average

total_ask_comments = 0
for row in ask_posts:
    comment = int(row[4])
    total_ask_comments += comment
avg_ask_comments = total_ask_comments / len(ask_posts)    

total_show_comments = 0
for row in show_posts:
    comment = int(row[4])
    total_show_comments += comment
avg_show_comments = total_show_comments / len(show_posts) 

print('Average Number of Comments per "Ask HN" post: {}'.format(avg_ask_comments))
print('Average Number of Comments per "Show HN" post: {}'.format(avg_show_comments))

Average Number of Comments per "Ask HN" post: 14.038417431192661
Average Number of Comments per "Show HN" post: 10.31669535283993


It appears that "Ask HN" posts receive a higher average number of comments per post as compared to "Show HN" posts, as seen from the calculated average above. 

**With "Ask HN" posts being more likely to receive comments, the focus of the remainining analysis will be on "Ask HN" posts.**

## Average Number of Comments Per Ask Post by Hour

In [7]:
# Creating an empty list of lists which would contain the creation time and
# number of comments

result_list = []

# Extracting creation time and number of comments by iterating over the ask posts

for row in ask_posts:
    creation_time = row[6]
    comment = int(row[4])
    result_list.append([creation_time, comment])
    
# Checking first 5 rows
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [8]:
# Creating two empty dictionaries which would serve as frequency tables
counts_by_hour = {}
comments_by_hour = {}

# Looping over the result_list to obtain two frequency tables:
# number of ask posts by hour and number of comments on ask posts by hour
for row in result_list:
    comment = row[1]
    hour = row[0]
    fmt_hour = dt.datetime.strptime(hour,'%m/%d/%Y %H:%M')
    str_hour = fmt_hour.strftime('%H')
    if str_hour not in counts_by_hour:
        counts_by_hour[str_hour] = 1
        comments_by_hour[str_hour] = comment
    else:
        counts_by_hour[str_hour] += 1
        comments_by_hour[str_hour] += comment

In [9]:
# Creating an empty list to show the average number of
# comments per ask post by hour

avg_by_hour = []

# Looping over the dictionaries to calculate out the average

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour)    

[['08', 10.25], ['13', 14.741176470588234], ['19', 10.8], ['01', 11.383333333333333], ['02', 23.810344827586206], ['00', 8.127272727272727], ['21', 16.009174311926607], ['03', 7.796296296296297], ['23', 7.985294117647059], ['05', 10.08695652173913], ['09', 5.5777777777777775], ['18', 13.20183486238532], ['22', 6.746478873239437], ['20', 21.525], ['11', 11.051724137931034], ['14', 13.233644859813085], ['06', 9.022727272727273], ['16', 16.796296296296298], ['17', 11.46], ['04', 7.170212765957447], ['07', 7.852941176470588], ['12', 9.41095890410959], ['15', 38.5948275862069], ['10', 13.440677966101696]]


In [10]:
# Creating a swapped column list of the avg_by_hour list

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[10.25, '08'], [14.741176470588234, '13'], [10.8, '19'], [11.383333333333333, '01'], [23.810344827586206, '02'], [8.127272727272727, '00'], [16.009174311926607, '21'], [7.796296296296297, '03'], [7.985294117647059, '23'], [10.08695652173913, '05'], [5.5777777777777775, '09'], [13.20183486238532, '18'], [6.746478873239437, '22'], [21.525, '20'], [11.051724137931034, '11'], [13.233644859813085, '14'], [9.022727272727273, '06'], [16.796296296296298, '16'], [11.46, '17'], [7.170212765957447, '04'], [7.852941176470588, '07'], [9.41095890410959, '12'], [38.5948275862069, '15'], [13.440677966101696, '10']]


In [11]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    avg = row[0]
    hour = row[1]
    fmt_hour = dt.datetime.strptime(hour, '%H')
    str_hour = fmt_hour.strftime('%H:%M')
    template = '{}: {:.2f} average comments per post'
    print(template.format(str_hour, avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion

For a potential user who is keen to create a "Ask HN" post, it is recommended that the ask post be created during the hours of 15:00 or 3:00 PM (US/Eastern Time).

In [12]:
recommended = dt.datetime(2020,1,1,15, tzinfo=pytz.timezone('US/Eastern'))
print(recommended)

2020-01-01 15:00:00-04:56


In [13]:
print(' '.join(pytz.country_timezones('sg')))

Asia/Singapore


In [14]:
recommended_sg = recommended.astimezone(pytz.timezone('Asia/Singapore'))
recommended_sg = recommended_sg.strftime('%H:%M')
print(recommended_sg)

03:56


Converting to Singapore Time GMT+8, the recommended hour is at 4:00 AM.