# HackerNews Post Interactions and Popularity

This project will explore a dataset of HackerNews (HN) posts, comparing two types of posts and their popularity on the website; "Ask HN" and "Show HN" posts. Ask HN posts are user-submitted posts that ask the community a specific question, including personal projects, technical issues, and current events. Show HN posts are submissions that show a project, a product, or other interesting related links.

Specifically, this project compares these two types to determine:

* Which of Ask HN or Show HN posts receive more comments on average?

* Do posts created at certain times receive more comments on average?

These data for the analyses are derived from a truncated version of a larger data set of 300,000 posts, wherein the present version consists of 20,000 rows by first removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [1]:
# Read in hacker_news.csv dataset
import csv
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))

# Display first 5 rows of hn
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

The variables collected in these data and their index in column order are;

* id [0]; The unique post identifier

* title [1]; The post's written title on HN

* url [2]; The URL that the post links to, if applicable

* num_points [3]; The number of points acquired, measured as the total number of upvotes / likes minus the total number of downvotes / dislikes

* num_comments [4]; The total number of comments on the post

* author [5]; The username of the original poster / submittor of the post

* created_at [6]; The date and time that the post was originally submitted.

In [2]:
# extract header row and remove from hn data set
headers = hn[:1]
hn = hn[1:]
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Filter data to retain only Show HN and Ask HN posts

In order to address our research questions, we need to filter our HN data sets into separate subsets; ask_posts, show_posts, and other_posts for all unrelated data.

In [3]:
ask_posts = []
show_posts =[]
other_posts = []

# loop through hn and copy posts to their appropriate list

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    if title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# check length of each new data set to determine number of posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
18938


## Determine which of Ask HN or Show HN posts receive more comments on average

With the data filtered into separate lists to identify the relevant post-types for our research questions, we can determine the average comments by post-type.

In [4]:
total_ask_comments = 0
total_show_comments = 0

# iterate over both data sets and count the total number of comments
for post in ask_posts:
    total_ask_comments += int(post[4])
    
for post in show_posts:
    total_show_comments += int(post[4])
    
avg_ask_comments = (total_ask_comments / len(ask_posts))
print("Average Ask HN comments: ",round(avg_ask_comments,4))
avg_show_comments = (total_show_comments / len(show_posts))
print("Average Show HN comments: ",round(avg_show_comments,4))

Average Ask HN comments:  14.0384
Average Show HN comments:  10.3167


On average, Ask HN posts receive 14 comments per post, while Show HN posts receive 10 comments per post.

## Determine if posts submitted at certain times receive more comments than others

Focusing first on Ask HN posts, we will use the associated data set to determine if posts made during certain times of day receive more comments on average than others.

In [5]:
import datetime as dt

# create empty list to store post information
result_list_ask = []

# iterate over ask_posts to isolate submission time and number of comments
for post in ask_posts:
    submitted_time = post[6]
    comments = int(post[4])
    result_list_ask.append([submitted_time, comments])

Calculate the number of posts per hour as well as the comments per hour, in order to get the average comments per post by hour.

In [6]:
# create dictionaries to store pertinent data
counts_by_hour_ask = {}
comments_by_hour_ask = {}
# standard date format for timestamps on posts
date_format = '%m/%d/%Y %H:%M'

# iterate through result_list_ask and assign values to relevant dictionary
for post in result_list_ask:
    timestamp = post[0]
    comments = post[1]
    hour = dt.datetime.strptime(timestamp,date_format).strftime('%H')
    if hour not in counts_by_hour_ask:
        counts_by_hour_ask[hour] = 1
        comments_by_hour_ask[hour] = comments
    else:
        counts_by_hour_ask[hour] += 1
        comments_by_hour_ask[hour] += comments

Calculate average comments per post by hour for Ask HN posts

In [7]:
avg_by_hour_ask = []
for hour in comments_by_hour_ask:
    avg_by_hour_ask.append([hour,
                            comments_by_hour_ask[hour]/counts_by_hour_ask[hour]])

# avg_by_hour_ask

In [8]:
swap_avg_by_hour_ask = []
for hour in avg_by_hour_ask:
    swap_avg_by_hour_ask.append([hour[1], hour[0]])
sorted_avg_ask = sorted(swap_avg_by_hour_ask, reverse = True)

print('Top 5 Hours for Ask HN Posts Comments')

for avg, hr in sorted_avg_ask[:5]:
    print(
    '{}: {:.2f} average comments per post.'.format(
    dt.datetime.strptime(hr, '%H').strftime('%H:%M'),avg))

Top 5 Hours for Ask HN Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


For Ask HN posts, the three best hours to make a post are 15:00 (3PM EST), 02:00 (2AM EST), and 20:00 (8PM EST) in order to have the most opportunity to engage with other users through comments.

We now repeat this with Show HN posts, to see if overall user engagement is highest at these times, or if there may be an interaction with the post-type an hour to explore.

In [9]:
result_list_show = []
for post in show_posts:
    submitted_time = post[6]
    comments = int(post[4])
    result_list_show.append([submitted_time, comments])
counts_by_hour_show = {}
comments_by_hour_show = {}
for post in result_list_show:
    timestamp = post[0]
    comments = post[1]
    hour = dt.datetime.strptime(timestamp,date_format).strftime('%H')
    if hour not in counts_by_hour_show:
        counts_by_hour_show[hour] = 1
        comments_by_hour_show[hour] = comments
    else:
        counts_by_hour_show[hour] += 1
        comments_by_hour_show[hour] += comments
avg_by_hour_show = []
for hour in comments_by_hour_show:
    avg_by_hour_show.append([hour,
                            comments_by_hour_show[hour]/counts_by_hour_show[hour]])
swap_avg_by_hour_show = []
for hour in avg_by_hour_show:
    swap_avg_by_hour_show.append([hour[1], hour[0]])
sorted_avg_show = sorted(swap_avg_by_hour_show, reverse = True)

print('Top 5 Hours for Show HN Posts Comments')

for avg, hr in sorted_avg_show[:5]:
    print(
    '{}: {:.2f} average comments per post.'.format(
    dt.datetime.strptime(hr, '%H').strftime('%H:%M'),avg))

Top 5 Hours for Show HN Posts Comments
18:00: 15.77 average comments per post.
00:00: 15.71 average comments per post.
14:00: 13.44 average comments per post.
23:00: 12.42 average comments per post.
22:00: 12.39 average comments per post.


For Show HN posts, the posts with the most comments were submitted at 18:00 (6PM EST), 00:00 (12AM EST), and 14:00 (2PM EST).

## Conclusions, Insights, and Recommendations

The most popular kind of posts, in terms of average comments per post, were Ask HN posts. For these post-types, the best times to submit were 3PM EST, 2AM EST, and 8PM EST, ordered by most average comments per post. Show HN posts had a different list of top hours to improve chances of post interaction, with the top three times being 6PM EST, 12AM EST, and 2PM EST.

It appears that the best hours for user comment interactions would be 2-4PM EST, with Ask HN posts being most likely to generate user comments at 3PM. Following this, there seems to be similarly high user traffic and interaction between 6PM-9PM EST, with Ask HN posts again having the highest average comments per post around 3PM EST.

This might be related to the nature of Ask HN posts, which are directly interacting with and engaging the HN community and sourcing content from the users themselves. Show HN on the other hand is providing content to the HN community and attempting to get responses to that. It may be beneficial for posts trying to get the most user interaction through comments to pose Ask HN questions that allow many users to contribute their knowledge, experience, or opinions, rather than aksing more technical questions that require expertise, specialized skills, or niche experiences. On the other hand, these kinds of posts could be generating interest due to the unique questions posed, and further analysis of the content of the most-commented Ask HN posts could investigate this interaction further.