# Exploring Hacker News Posts

The purpose of this analysis is to explore posts of the Hacker News website, and discover interesting patterns in the data e.g. what types of posts receive more comments on average and if posts created at certain times receive more comments.

* The data set used can be found here: [Link](https://www.kaggle.com/hacker-news/hacker-news-posts)

* For the purposes of this analysis we are not going to use the full dataset, but we are going to remove all the submissions that did not receive any comments and then take a random sample out of those.

* Below you can find the descriptions of the columns:
    * id: The unique identifier from Hacker News for the post
    * title: The title of the post
    * url: The URL that the posts links to, if it the post has a URL
    * num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    * num_comments: The number of comments that were made on the post
    * author: The username of the person who submitted the post
    * created_at: The date and time at which the post was submitted

In [2]:
open_hacker = open('hacker_news.csv')
from csv import reader
read_hacker = reader(open_hacker)

hn = list(read_hacker)

# Display the first five rows of the list
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Let's now remove the first row of the dataset which contains the column headers. This will help us in analyzing the data.

In [3]:
headers = hn[:1]
del hn[:1]

hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Next we are going to keep only the posts that begin with "Ask HN" and "Show HN". We are also goin to store them is two separate list of lists for to enable analysis.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(hn))
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

20100
1744
1162
17194


In [5]:
# Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
    

14.038417431192661


In [6]:
# Calculate the average number of comments 'Show HN' posts receive
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


As we see from the results in the above cells, the ask HN posts receive more comments on average - 14 vs 10 approximately. Since these posts are more likely to receive comments we will focus the rest of the analysis on them.

In [12]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [13]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['15', 38.5948275862069],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['07', 7.852941176470588],
 ['19', 10.8],
 ['13', 14.741176470588234],
 ['11', 11.051724137931034],
 ['03', 7.796296296296297],
 ['09', 5.5777777777777775],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['02', 23.810344827586206],
 ['23', 7.985294117647059],
 ['00', 8.127272727272727],
 ['04', 7.170212765957447],
 ['01', 11.383333333333333],
 ['18', 13.20183486238532],
 ['10', 13.440677966101696],
 ['06', 9.022727272727273],
 ['20', 21.525],
 ['21', 16.009174311926607]]

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[38.5948275862069, '15'], [13.233644859813085, '14'], [16.796296296296298, '16'], [10.08695652173913, '05'], [10.25, '08'], [7.852941176470588, '07'], [10.8, '19'], [14.741176470588234, '13'], [11.051724137931034, '11'], [7.796296296296297, '03'], [5.5777777777777775, '09'], [6.746478873239437, '22'], [9.41095890410959, '12'], [11.46, '17'], [23.810344827586206, '02'], [7.985294117647059, '23'], [8.127272727272727, '00'], [7.170212765957447, '04'], [11.383333333333333, '01'], [13.20183486238532, '18'], [13.440677966101696, '10'], [9.022727272727273, '06'], [21.525, '20'], [16.009174311926607, '21']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [None]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post

## Summary

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.