# Exploring Hacker News Posts

In this project, I want to analyse submissions made to the Hacker News community. Specifically, I want to better understand two kinds of posts -- "Ask HN" and "Show HN" -- by exploring two questions: 

1) Do Ask HN or Show HN receive more comments on average?
2) Do posts created at a certain time receive more comments on average?

In [6]:
#Import data set

from csv import reader 

hn = list(reader(open('hacker_news.csv')))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Extracting Headers

In [8]:
headers = hn[0]
hn = hn[1:]

headers
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Filtering for Ask and Show posts

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

ask_posts[:5]
show_posts[:5]

1744
1162
17194


[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

## Calculating avg. no of comments for each type of posts

In [15]:
# Avg of comments for "Ask posts"
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [16]:
# Avg of comments for "Show posts"
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Ask posts receive more comments on average.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

We'll determine if ask posts created at a certain time are more likely to attract comments by performing the following steps:

1) Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2) Calculate the average number of comments ask posts receive by hour created

## 1) Calculating the amount of ask posts created in each hour of the day, along with the number of comments received.

In [30]:
import datetime as dt

result_list = []

#Retrieve when each post was created at and # of comments for each post
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [47]:
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for post in result_list:
    num_comments = post[1]
    created_hour = dt.datetime.strptime(post[0], date_format).strftime("%H")
    if created_hour in counts_by_hour:
        counts_by_hour[created_hour] += 1
        comments_by_hour[created_hour] += num_comments
    else:
        counts_by_hour[created_hour] = 1
        comments_by_hour[created_hour] = num_comments

print("Number of Ask posts by hour")
print(counts_by_hour)
print("\n")
print("Number of corresponding comments by hour")
print(comments_by_hour)

Number of Ask posts by hour
{'21': 109, '04': 47, '11': 58, '23': 68, '19': 110, '09': 45, '22': 71, '07': 34, '13': 85, '05': 46, '16': 108, '08': 48, '18': 109, '10': 59, '17': 100, '15': 116, '12': 73, '03': 54, '06': 44, '14': 107, '20': 80, '01': 60, '00': 55, '02': 58}


Number of corresponding comments by hour
{'21': 1745, '04': 337, '11': 641, '23': 543, '19': 1188, '09': 251, '22': 479, '07': 267, '13': 1253, '05': 464, '16': 1814, '08': 492, '18': 1439, '10': 793, '17': 1146, '15': 4477, '12': 687, '03': 421, '06': 397, '14': 1416, '20': 1722, '01': 683, '00': 447, '02': 1381}


## 2) Calculating the average number of comments ask posts receive by hour created

In [51]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour

[['21', 16.009174311926607],
 ['04', 7.170212765957447],
 ['11', 11.051724137931034],
 ['23', 7.985294117647059],
 ['19', 10.8],
 ['09', 5.5777777777777775],
 ['22', 6.746478873239437],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234],
 ['05', 10.08695652173913],
 ['16', 16.796296296296298],
 ['08', 10.25],
 ['18', 13.20183486238532],
 ['10', 13.440677966101696],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['12', 9.41095890410959],
 ['03', 7.796296296296297],
 ['06', 9.022727272727273],
 ['14', 13.233644859813085],
 ['20', 21.525],
 ['01', 11.383333333333333],
 ['00', 8.127272727272727],
 ['02', 23.810344827586206]]

## Rank by most number of posts

In [54]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
swap_avg_by_hour

[[16.009174311926607, '21'],
 [7.170212765957447, '04'],
 [11.051724137931034, '11'],
 [7.985294117647059, '23'],
 [10.8, '19'],
 [5.5777777777777775, '09'],
 [6.746478873239437, '22'],
 [7.852941176470588, '07'],
 [14.741176470588234, '13'],
 [10.08695652173913, '05'],
 [16.796296296296298, '16'],
 [10.25, '08'],
 [13.20183486238532, '18'],
 [13.440677966101696, '10'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [9.41095890410959, '12'],
 [7.796296296296297, '03'],
 [9.022727272727273, '06'],
 [13.233644859813085, '14'],
 [21.525, '20'],
 [11.383333333333333, '01'],
 [8.127272727272727, '00'],
 [23.810344827586206, '02']]

In [57]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [58]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
    print(
    "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg)
    )
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Conclusion: An ask post created between 3-5PM, 8-10PM, and at 2AM are likely to receive most comments.