# Analyzing Hacker News Posts

Questions to explore: 
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

### Read in the data and print the first five rows.

In [3]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

In [36]:
for row in hn[0:5]:
    print(row, '\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



### Extract the header row from the dataset.

In [None]:
headers = hn[0]
hn = hn[1:]

In [27]:
print('Header:')
print(headers, '\n')
print('Data:')
for row in hn[0:5]:
    print(row, '\n')
    
hn_index = "[ id=0, title=1, url=2, num_points=3, num_comments=4, author=5, created_at=6 ]"

Header:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

Data:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'

### Separate "Ask HN" posts, "Show HN" posts, and other posts.

In [None]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [18]:
for row in ask_posts[0:2]: 
    print(row,'\n')
for row in show_posts[0:2]: 
    print(row,'\n')
for row in other_posts[0:2]: 
    print(row,'\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] 

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 



### Determine if "Ask HN" posts or "Show HN" posts get more comments on average.

In [49]:
# Find avg comments per "ask hn" post
total_ask_com = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_com += num_comments
    
avg_ask_comments = round(total_ask_com / len(ask_posts), 1)

# Find avg comments per "show hn" post
total_show_com = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_com += num_comments
    
avg_show_comments = round(total_show_com / len(show_posts),1)

# Find avg comments on all other posts
total_other_com = 0
for row in other_posts:
    num_comments = int(row[4])
    total_other_com += num_comments
    
avg_other_comments = round(total_other_com / len(other_posts),1)

#Print averages
print('Average comments per "ask hn" post:', avg_ask_comments)
print('Average comments per "show hn" post:', avg_show_comments)
print('Average comments per other posts:', avg_other_comments)

Average comments per "ask hn" post: 14.0
Average comments per "show hn" post: 10.3
Average comments per other posts: 26.9


On average, ask posts tend to receive ~4 more comments per post. Other posts actually receive the most of all though.

### Determine what hours during the day "Ask HN" posts receive the most comments.

In [51]:
### INDEX GUIDE ###

print(headers)
print(sample_row)
sample_row = hn[0]

print('\n')

i = 0
for value in headers:
    print(i, ':', value)
    i += 1 

print('\n')
    
i = 0
for value in sample_row:
    print(i, ':', value)
    i += 1

# ask_posts = []
# show_posts = []
# other_posts = []

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


0 : id
1 : title
2 : url
3 : num_points
4 : num_comments
5 : author
6 : created_at


0 : 12224879
1 : Interactive Dynamic Video
2 : http://www.interactivedynamicvideo.com/
3 : 386
4 : 52
5 : ne0phyte
6 : 8/4/2016 11:52


In [50]:
import datetime as dt

In [52]:
## Add creation date and number of comments to a list of lists
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [67]:
## Find total numbers of posts and comments each hour

counts_by_hour = {}
comments_by_hour = {}

for post in result_list:
    date = dt.datetime.strptime(post[0],'%m/%d/%Y %H:%M') # Convert date to datetime object
    time = dt.datetime.strftime(date, '%H') # Extract hour
    
    # Count num of posts each hour
    if time in counts_by_hour:
        counts_by_hour[time] += 1
    else:
        counts_by_hour[time] = 1
        
    # Count num of comments each hour
    if time in comments_by_hour:
        comments_by_hour[time] += post[1]
    else:
        comments_by_hour[time] = post[1]

In [105]:
## Calculate the average number of comments during each hour
avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

In [106]:
## Print averages sorted by hour
avg_by_hour.sort()
print('Sorted by Hour:')
for avg in avg_by_hour:
    print(avg)

Sorted by Hour:
['00', 8.127272727272727]
['01', 11.383333333333333]
['02', 23.810344827586206]
['03', 7.796296296296297]
['04', 7.170212765957447]
['05', 10.08695652173913]
['06', 9.022727272727273]
['07', 7.852941176470588]
['08', 10.25]
['09', 5.5777777777777775]
['10', 13.440677966101696]
['11', 11.051724137931034]
['12', 9.41095890410959]
['13', 14.741176470588234]
['14', 13.233644859813085]
['15', 38.5948275862069]
['16', 16.796296296296298]
['17', 11.46]
['18', 13.20183486238532]
['19', 10.8]
['20', 21.525]
['21', 16.009174311926607]
['22', 6.746478873239437]
['23', 7.985294117647059]


In [107]:
# Print averages sorted by highest average
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
swap_avg_by_hour.sort(reverse=True)

print('Sorted by Highest Average')
for avg in swap_avg_by_hour:
    print(avg)

Sorted by Highest Average
[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


In [122]:
print('Top 5 Hours for "Ask HN" Comments (in Eastern Time):')
for hour in swap_avg_by_hour[0:5]:
    # Convert hour value to 24:00 format
    hour_posted = dt.datetime.strptime(hour[1], '%H')
    hour_posted = dt.datetime.strftime(hour_posted,'%H')
    
    print('{hour}:00: {num:.2f} average comments per post'.format(hour=hour_posted, num=hour[0]))

Top 5 Hours for "Ask HN" Comments (in Eastern Time):
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

Returning to the exploration questions from the intro:

**Do Ask HN or Show HN receive more comments on average?**
- "Ask HN" posts receive more comments on average than do "Show HN" posts.

**Do posts created at a certain time receive more comments on average?**
- For "Ask HN" posts, mid afternoon around 3-5pm EST and evening around 8-10pm EST receive the most comments. There's also a busy late night hour from 2-3AM EST.