# Exploring Hacker News Post

## Introduction
---
Hacker News is a site where user-submitted stories are voted and commented upon. There is currently a dataset available on [kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts), but it has been reduced from about 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Objective
---
We are specifically interested in post whose titles begin with either **Ask HN** or **Show HN**. Users submit **Ask HN** posts to ask the community a specific quesion while the **Show HN** posts are to show the community a project, product or generally something interesting. In this project, we are mainly comparing between these two types of posts to determine the following:

1) Do **Ask HN** or **Show HN** posts receive more comments on average?

2) Do posts created at a certain time receive more comments on average?

## Exploring the data
---
We will first be printing the first 5 rows of the dataset, which is stored as a list of lists under the variable name `hn`.

In [2]:
from csv import reader
opened_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
for rows in hn[:5]:
    print(rows)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




As we can see, the first row of the data contain the column names. We are going to extract the first row of data and store it under the variable `headers`.

In [3]:
headers = hn[0]
del hn[0]
print(headers)
print('\n')
for rows in hn[:5]:
    print(rows)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Now that we have removed the headers, we can start to filter our data.

## Filtering of data
---
We are only concerned with post titles beginning with "Ask HN" or "Show HN", hence we will create a new list of lists containing just the data for those titles.

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title1 = title.lower() 
    # above step is to ensure that we dont miss out any title that is in upper case by converting all titles to lower case
    if title1.startswith('ask hn'):
        ask_posts.append(row)
        
    elif title1.startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

        

9139
10158
273822


We can see that there are 1744 ask posts, 1162 show posts and 17194 other posts.

## Analysis of data
---
### Average number of comments for both types of posts

To find out whether **Ask HN** posts or **Show HN** posts have more comments, we will first need to find the total number of comments for each type of post and then divide by the number of posts.



In [5]:
# Finding the average number of comments on an ask post
total_ask_comments = 0
for row in ask_posts:
    num_comments1 = int(row[4])
    total_ask_comments += num_comments1
avg_ask_comments = total_ask_comments/1744
print("Average number of comments on an ask post = " + str(round(avg_ask_comments, 2)))

# Finding the average number of comments on a show post
total_show_comments = 0
for row in show_posts:
    num_comments2 = int(row[4])
    total_show_comments += num_comments2
avg_show_comments = total_show_comments/1162
print("Average number of comments on a show post = " + str(round(avg_show_comments, 2)))


Average number of comments on an ask post = 54.46
Average number of comments on a show post = 42.71


As seen above, ask post has an average of 14.04 comments per post while show post has an average of 10.32 comments per post. Hence, ask post receive more comments on average.

### Determine if ask posts at certain time are more likely to attract comments



In [6]:
import datetime as dt
import re
result_list = []
for post in ask_posts:
    datetime_comments = [post[6], int(post[4])]
    result_list.append(datetime_comments)
counts_by_hour = {}
comments_by_hour = {}
for post in result_list:
    date_time = post[0]
    # string_split = re.split(':| |', date_time) # alternative method to extract hour
    # hour = string_split[1]
    DateTime = dt.datetime.strptime(date_time, '%m/%d/%Y %H:%M')
    Hour = DateTime.hour
    # Hour = dt.datetime.strftime(DateTime, '%H') # alternative method to extract hour from DateTime object
    if Hour not in counts_by_hour:
        counts_by_hour[Hour] = 1
        comments_by_hour[Hour] = post[1]
    else:
        counts_by_hour[Hour] += 1
        comments_by_hour[Hour] += post[1]
print('The number of ask posts by hour = ', counts_by_hour)
print('\n')
print('The number of comments on ask posts by hour = ', comments_by_hour)


    

The number of ask posts by hour =  {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}


The number of comments on ask posts by hour =  {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


From the above, the most popular time where ask posts are posted is around 15:00 - 16:00, with 116 posts posted at that time of the day.

Similarly, ask posts that are posted from 15:00 - 16:00 also happen to have the highest number of comments (4477 comments) on them.

However, we cannot confirm that the high number of comments from 15:00 to 16:00 is due to the high number of ask posts or is it due to the time of the day. Thus, we will be finding out the average number of comments on an ask post by hour.

In [7]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_comments = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([round(avg_comments, 2), hour])
    
avg_by_hour = sorted(avg_by_hour, reverse = True)

for hour in avg_by_hour:
    string = '{}:00 : {} average comments per post'.format(hour[1], hour[0])
    print(string)


15:00 : 28.68 average comments per post
13:00 : 16.32 average comments per post
12:00 : 12.38 average comments per post
2:00 : 11.14 average comments per post
10:00 : 10.68 average comments per post
4:00 : 9.71 average comments per post
14:00 : 9.69 average comments per post
17:00 : 9.45 average comments per post
8:00 : 9.19 average comments per post
11:00 : 8.96 average comments per post
22:00 : 8.8 average comments per post
5:00 : 8.79 average comments per post
20:00 : 8.75 average comments per post
21:00 : 8.69 average comments per post
3:00 : 7.95 average comments per post
18:00 : 7.94 average comments per post
16:00 : 7.71 average comments per post
0:00 : 7.56 average comments per post
1:00 : 7.41 average comments per post
19:00 : 7.16 average comments per post
7:00 : 7.01 average comments per post
6:00 : 6.78 average comments per post
23:00 : 6.7 average comments per post
9:00 : 6.65 average comments per post


## Conclusion
---
The average number of comments per post is still the highest (38.59 comments) for ask post made between 15:00 - 16:00. The timezone of this dataset is Eastern Time in US (GMT-4). Thus, to convert it to local Singapore time (GMT+8), we will need to add 12 hours to the time above, which means that 03:00 - 04:00 is the best timing for Singaporeans to create an ask post that will receive the most number of comments.

Some things to work on:

Determine if show or ask posts receive more points on average.
Determine if posts created at a certain time are more likely to receive more points.
Compare your results to the average number of comments and points other posts receive.