# Analysis of Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News.Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.
users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.


# Introduction


Let's start by importing libraries and reading in the csv file

In [1]:
from csv import reader

In [2]:
import datetime as dt

In [14]:
 opened_file=open('hacker_news.csv',encoding='utf8')
 read_file=reader(opened_file)
 hn=list(read_file)
 print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [15]:
len(hn)

20101

# Removing headers from dataset

In [17]:
headers=hn[0]
hn.pop(0)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [18]:
len(hn)

20100

In [19]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extracting Ask HN Posts and Show HN Posts 

Next, we will extract the posts that start with Ask HN and Show HN into separate lists

In [72]:
ask_posts=[]
show_posts=[]
other_posts=[]

for i in hn:
    if i[1].lower().startswith("ask hn"):
        ask_posts.append(i)
    elif i[1].lower().startswith("show hn"):
        show_posts.append(i)
    else:
        other_posts.append(i)


In [73]:
print("No of ask_posts:",len(ask_posts),"\n", ask_posts[:3],"\n")
print("No of show_posts:",len(show_posts),"\n", show_posts[:3],"\n")
print("No of other_posts:",len(other_posts),"\n", other_posts[:3],"\n")

No of ask_posts: 1744 
 [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']] 

No of show_posts: 1162 
 [['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']] 

No of other_posts: 17194 
 [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['1097

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we have separated the ask and show posts, let's try to find out which type of post receives more comments on average

In [74]:
ask_posts_total=len(ask_posts)
show_posts_total=len(show_posts)
ask_total_comments=0
show_total_comments=0
for i in ask_posts:
    num_of_comments=int(i[4])
    ask_total_comments+=num_of_comments

    
for i in show_posts:
    num_of_comments=int(i[4])
    show_total_comments+=num_of_comments
    
avg_ask_comments=(ask_total_comments/ask_posts_total)
avg_show_comments=(show_total_comments/show_posts_total)

print("Avg# of Ask Post Comments:",avg_ask_comments)
print("Avg# of Show Post Comments:",avg_show_comments)


    


Avg# of Ask Post Comments: 14.038417431192661
Avg# of Show Post Comments: 10.31669535283993


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

# Finding Amount of Ask Posts and Comments by Hour

Next, we will try to find the number of posts created by hour and the average number of comments received by each post. This is the first step in our attempt to find if there are certain times when posts are likely to attract more comments

In [75]:
result_list=[]
counts_by_hour={}
comments_by_hour={}

for i in ask_posts:
    time=i[6]
    num_of_comments=int(i[4])
    hour_part=dt.datetime.strptime(time,"%m/%d/%Y %H:%M").hour
    if hour_part in counts_by_hour:
        counts_by_hour[hour_part]+=1
        comments_by_hour[hour_part]+=num_of_comments      
    else:
        counts_by_hour[hour_part]=1
        comments_by_hour[hour_part]=num_of_comments

avg_comments_by_hour=[]
for i in counts_by_hour:
    avg_comments_by_hour.append([i,comments_by_hour[i]/counts_by_hour[i]])
print(avg_comments_by_hour)
        
        
        
  
        


[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


In [76]:
print(comments_by_hour)

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


In [81]:
to_sort_prep=[]
for i in avg_comments_by_hour:
    to_sort_prep.append([i[1],i[0]])
sorted_list=sorted(to_sort_prep,reverse=True)
print(sorted_list[:5])
    

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21]]


In [121]:
for i in sorted_list[:5]:
    time_formatted=dt.datetime.strptime(str(i[1]-1),"%H") # convertime to central by subtracting '1' from the hour
    print("{time} CST: {comments:.2f} average comments per post".format(time=time_formatted.strftime("%I:%M %p"),comments=(i[0])))


02:00 PM CST: 38.59 average comments per post
01:00 AM CST: 23.81 average comments per post
07:00 PM CST: 21.52 average comments per post
03:00 PM CST: 16.80 average comments per post
08:00 PM CST: 16.01 average comments per post


The hour that receives the most comments per post on average is 14:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.
According to the [dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US and since the project creators live in Austin, TX, which follows the Central Time Zone, the time zone has been converted to CST. 

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 14:00 and 15:00 (2:00 pm est - 3:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that *of the posts that received comments*, ask posts received more comments on average and ask posts created between 14:00 and 15:00 (2:00 pm cst - 3:00 pm cst) received the most comments on average.