# Exploring Hacker News Posts

Hacker News is a site where users can submit posts are voted and commented upon. Hacker News is very popular in technology and startup cicles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

There are two types of posts that are important for this project: "Ask HN" and "Show HN". People submit "Ask HN" posts to ask Hacker News community a specific question. Likewise, users submit "Shoe HN" posts to show a project, product or something interesting. 

The purpose of this project is to compare these two types of posts and answer the following questions: 
1. Do "Ask HN" or "Show HN" receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

# Intruduction

In [1]:
import csv
opened_file = open("HN_posts_year_to_Sep_26_2016.csv")
hn = list(csv.reader(opened_file))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Remove the header row from hn

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-\xc3\x82\xc2\x93the-data-vault\xc3\x82\xc2\x94', '1', '0', 'markgainor1', '9/26/2016 3:

# Extracting Ask HN and Show HN Posts

Split ask posts and show posts into two different lists:

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


# Calculate the Average Number of Comments for Ask HN Posts and Show HN Posts

In [4]:
total_ask_comments = 0.0
for row in ask_posts:
    num_comments = float(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.3934784987


In [5]:
total_show_comments = 0.0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.88609962591


On average, ask posts receive more comments than show posts. 

# Finding the Amount of Ask Posts and Comments by Hour Created

Next, we will decide if ask posts created at certain time are more likely to get more comments. There are two steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received. 
2. Calculate the average number of comments ask posts reveive by hour created. 

In [6]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = float(row[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
        
comments_by_hour

{'00': 2277.0,
 '01': 2089.0,
 '02': 2996.0,
 '03': 2154.0,
 '04': 2360.0,
 '05': 1838.0,
 '06': 1587.0,
 '07': 1585.0,
 '08': 2362.0,
 '09': 1477.0,
 '10': 3013.0,
 '11': 2797.0,
 '12': 4234.0,
 '13': 7245.0,
 '14': 4972.0,
 '15': 18525.0,
 '16': 4466.0,
 '17': 5547.0,
 '18': 4877.0,
 '19': 3954.0,
 '20': 4462.0,
 '21': 4500.0,
 '22': 3372.0,
 '23': 2297.0}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

In [7]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour

[['02', 11.137546468401487],
 ['03', 7.948339483394834],
 ['00', 7.5647840531561465],
 ['01', 7.407801418439717],
 ['20', 8.749019607843136],
 ['21', 8.687258687258687],
 ['22', 8.804177545691905],
 ['23', 6.696793002915452],
 ['08', 9.190661478599221],
 ['09', 6.653153153153153],
 ['14', 9.692007797270955],
 ['06', 6.782051282051282],
 ['07', 7.013274336283186],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['13', 16.31756756756757],
 ['12', 12.380116959064328],
 ['15', 28.676470588235293],
 ['04', 9.7119341563786],
 ['17', 9.449744463373083],
 ['16', 7.713298791018998],
 ['19', 7.163043478260869],
 ['18', 7.94299674267101],
 ['05', 8.794258373205741]]

Sort the list to identify the highest values:

In [8]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[11.137546468401487, '02'], [7.948339483394834, '03'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [8.749019607843136, '20'], [8.687258687258687, '21'], [8.804177545691905, '22'], [6.696793002915452, '23'], [9.190661478599221, '08'], [6.653153153153153, '09'], [9.692007797270955, '14'], [6.782051282051282, '06'], [7.013274336283186, '07'], [8.96474358974359, '11'], [10.684397163120567, '10'], [16.31756756756757, '13'], [12.380116959064328, '12'], [28.676470588235293, '15'], [9.7119341563786, '04'], [9.449744463373083, '17'], [7.713298791018998, '16'], [7.163043478260869, '19'], [7.94299674267101, '18'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

Print the top 5 hours that received largest numbers of comments:

In [9]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
    "{}:{:.2f} average comments per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

Top 5 Hours for 'Ask HN' Comments
15:00:28.68 average comments per post
13:00:16.32 average comments per post
12:00:12.38 average comments per post
02:00:11.14 average comments per post
10:00:10.68 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average 28.68 comments per post. There is about a 75.75% increase in the number of comments between the hours with the highest and the second highest average number of comments. 

# Conclusion

In this project, we analyzed ask posts and show posts and determined ask posts reveive more comments than show posts on average. 

We recommened the posts to be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) to maximize the number of comments. 