# Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News and  try to answer the below questions:

1. Do `Ask HN` or `Show HN` receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [None]:
# Read data
from csv import reader
opened = open('../input/HN_posts_year_to_Sep_26_2016.csv')
file = list(reader(opened))
hn = file[1:]
headers = file[0]

print(headers)
print(hn[:5])

Then we separate the posts into 3 types as follow: 

1. ask posts: any post start with *ask hn*
2. show posts: any post start with *show hn*
3. other posts: the rest of the posts

In [None]:
ask_posts =[]
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'): ask_posts.append(row)
    elif title.lower().startswith('show hn'): show_posts.append(row)
    else: other_posts.append(row)
        
print('There are ',len(ask_posts), 'in ask_posts.')
print('There are ',len(show_posts), 'in show_posts.')
print('There are ',len(other_posts), 'in other_posts.')

Now we can start answering our questions.

## 1. Do `Ask HN` or `Show HN` receive more comments on average?

In [None]:
# find the average number of comments in Ask hn posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)

In [None]:
# find the average number of comments in Show hn posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)

In [None]:
print('The average comments in ask HN posts is ', round(avg_ask_comments,2), '.')
print('The average comments in show HN posts is ', round(avg_show_comments,2), '.')

The average comments in *ask HN* is more than a double to *show HN*. One of the possible reasons is: a ask post is normally followed by at least 1 comment for the answer. When the ask post is very tricky without a clear answer, this may lead to a discussion and create more comments. This case is somehow not guranteed in the *show HN*.

## 2. Do posts created at a certain time receive more comments on average?

Since `ask HN` has more comments, we will foucs on these post for the 2nd question.

In [None]:
import datetime as dt

# extract time created and the number of comments in ask HN posts
result_list = []

for row in ask_posts:
    created_at = row[-1]
    num_comment = int(row[4])
    result_list.append([created_at, num_comment])

In [None]:
# create frequency table by hour with total comments
counts_by_hours = {}
comments_by_hours = {}

for item in result_list:
    hour = dt.datetime.strptime(item[0], '%m/%d/%Y %H:%M').hour
    if hour in counts_by_hours: 
        counts_by_hours[hour] += 1
        comments_by_hours[hour] += item[1]
    else: 
        counts_by_hours[hour] = 1
        comments_by_hours[hour] = item[1]

In [None]:
# find the average number of comments by hour
avg_by_hour = []
for item in counts_by_hours:
    avg_by_hour.append([item, float(comments_by_hours[item])/float(counts_by_hours[item])])

In [None]:
# swap columns
swap_avg_by_hour=[]
for item in avg_by_hour:
    swap_avg_by_hour.append([item[1],item[0]])

In [None]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Post Comments:', '\n')
for i in range(5):
    print('{hour}:00: {num: .2f} average comments per post.'.format(hour=sorted_swap[i][1], 
                                                               num=sorted_swap[i][0]))

The top hours for comments are concentrated around the noon between 10:00 to 15:00. Considering that these are `Ask HN` posts, they need some time for the users to digest and comment. Therefore, it is possible that late posts would have less comments due to less attention towards HN from the users after the day. Meanwhile, the time after lunch is also period for the users to browse and take some free time so that more users could be active.

Following questions to be answered:

- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare results to the average number of comments and points other posts receive.
