# Hacker news: Analysis of posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News. Hacker News is a site where user-submitted "posts" are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

## Goal:

Our goal in this project is:

1. To compare the Ask and Show posts of the Hacker news data set and see which of the two receive most comments?
2. To perform an analysis based on certain time the post was made to determine if that had a bearing on higher comments made for the post?

## Dataset:

The dataset used in this notebook can be found here: https://www.kaggle.com/hacker-news/hacker-news-posts

The dataset has been reduced to exclude submissions that did not receive any comments. The description of columns are shown below: 

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

Let's take a look at a sample of the data:

In [None]:
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df_hn = pd.read_csv('../input/hacker-news-posts/HN_posts_year_to_Sep_26_2016.csv')
df_hn.head()

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [None]:
# Turn dataframe into list of lists
hn = df_hn.values.tolist()
hn[:5]

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [None]:
# Create 3 lists to separate posts
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of Ask HN posts:',len(ask_posts))
print('Number of Show HN posts:',len(show_posts))
print('Number of other posts:',len(other_posts))

Next, let's determine if ask posts or show posts receive more comments on average.

In [None]:
total_ask_comments = 0
for post in ask_posts:
    com = post[4]
    total_ask_comments += com
ask_avg = total_ask_comments/len(ask_posts)
    
total_show_comments = 0
for post in show_posts:
    com = post[4]
    total_show_comments += com
show_avg = total_show_comments/len(show_posts)

print('Ask post comments on average: {:.2f}'.format(ask_avg))
print('Show post comments on average: {:.2f}'.format(show_avg))

On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. First, lets calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

In [None]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6],int(post[4])])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = row[0]
    hour = dt.datetime.strptime(hour,'%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(hour,'%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print('Number of posts made at each hour:')
counts_by_hour
print('Number of comments made at each hour:')
comments_by_hour

We can see that the first list contains the number of ask posts created during each hour of the day. The second list contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [None]:
avg_by_hour = []
for hr in counts_by_hour:
    avg = round(comments_by_hour[hr]/counts_by_hour[hr],2)
    avg_by_hour.append([hr,avg])
avg_by_hour

Now we need to swap the columns and sort the table to find the best hour to post in order to get the most amount of comments:

In [None]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
sorted_swap

In [None]:
print('Top 5 Hours for Ask Posts Comments')
for avg,hr in sorted_swap[:5]:
    hr = dt.datetime.strptime(hr, '%H')
    hr = dt.datetime.strftime(hr,'%H:%M')
    print('{} : {:.1f} average comments per post'.format(hr,avg))

Now we can see the top 5 best times to create a post and post it on Hacker News. 3PM is the best time to post as it gained the most comments by far with an average of ~29 comments per post. 1PM was the next best time to post followed by 12PM with 16 and 12 posts respectively. 