# Hacker News Analytics

Hacker News (HN) is a online forum for voting and commenting on user submitted articles.

Hacker News is highly appreciated in technology and in start-up circles, which leads to hundreds and thousands of visitors being listed there.

The goal of this project is to find out if Ask HN (posts to ask the Hacker News community a specific question) and Show HN (posts to show the Hacker News community a project, product, or just generally something interesting) receive more comments on average.

We also would like to understand if some posts at a certain time receive more comments on average.

In [3]:
from csv import reader

# call HN threads dataset 
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Data pre-processing

It is always a good practice to format the data before start.

In [6]:
# Remove headers from data set
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:4])

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
[['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos'

In [19]:
# Segregate 'Ask HN' and 'Show HN' threads from data set
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of posts - Ask):', len(ask_posts))
print('Number of posts - Show):', len(show_posts))
print('Number of posts - Other):', len(ask_posts))


Number of posts - Ask): 1744
Number of posts - Show): 1162
Number of posts - Other): 1744


## Data analysis

Now let's start to extract some insights from data.

In [31]:
# determine if ask posts or show posts receive more comments on average
total_ask_comments = 0
for row in ask_posts:
    ncomm = row[4]
    ncomm = int(ncomm)
    total_ask_comments = total_ask_comments + ncomm
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    ncomm = row[4]
    ncomm = int(ncomm)
    total_show_comments = total_show_comments + ncomm
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


As we can see from previous results, Ask posts receive more comments (on average) than Show posts. 

Since Ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

In [89]:
import datetime as dt

# calculate the amount of Ask posts and comments by hour created
result_list =[]

for row in ask_posts:
    result_list.append([row[6],int(row[4])])

posts_by_hour = {}  # number of Ask posts created during each hour of the day                  
comments_by_hour = {} # number of comments of Ask posts created at each hour

for row in result_list:
    date_str = row[0]
    ncomm = row[1]
    date = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    time = dt.datetime.strftime(date,"%H")
    if time not in posts_by_hour:
        posts_by_hour[time] = 1
        comments_by_hour[time] = ncomm
    else:
        posts_by_hour[time] += 1
        comments_by_hour[time] += ncomm
    


In [91]:
# calculate the average number of comments per post for posts created during each hour of the day
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/posts_by_hour[hour]])
    


In [98]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )
    
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

In this task we wanted to find out if Ask HN (posts to ask the Hacker News community a specific question) and Show HN (posts to show the Hacker News community a project, product, or just generally something interesting) receive more comments on average.

To support the decision, we gather information from a sample of 300,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining 20,000 submissions. 

First we segregated the data set to acommodate our needs. After we analyzed the numbers of which types of posts were available there and which are the their averages.

Analysis showed that Ask-posts were predominant ans therefore we focused on them. Then, we calculated the amount of Ask-posts and comments by hour created. At the end we could verify that (on average) at 15:00hs (3:00 pm) there are more comments per post. 16:00hs (4:00 pm) was the fourth pplace in the Top 5 rank which suggests the timeframe between 3:00 pm and 4:00 pm as the most productive period during the day.

This analysis is not exhaustive since it would be interesting to undestand what topic or area is the most upvoted and commented in order to produce more traffic by