# Exploring Hacker News Posts

Hacker News is a forum website hosted by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is an extremely popular online destination for technology enthusiasts and budding founders, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

With a base of highly-technical users, Hacker News can often become a home for eye-opening discussion on a range of topics. Users who post original content can benefit greatly if posting at a time when the most users are online, or when the most eyeballs are on the page.

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting.

**We'll compare these two types of posts to determine the following:**
- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We will be working with a dataset that has Hacker News posts for 12 months up to September 2016, found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), and placed in the local directory as `hacker_news.csv`.

Below are descriptions of the columns:
- `id`: the unique identifier from Hacker News for the post
- `title`: the title of the post
- `url`: the URL that the posts links to, if the post has a URL
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: the number of comments on the post
- `author`: the username of the person who submitted the post
- `created_at`: the date and time of the post's submission

***
We'll begin by reading in the `hacker_news.csv` as a list of lists.

In [76]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hacker_news = list(read_file)

print(hacker_news[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


We will separate the column headers from the data:

In [77]:
hn_headers = hacker_news[0]
hn_data = hacker_news[1:]

print(hn_headers)
print(hn_data[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


***
## Extracting `Ask HN` or `Show HN` Posts

Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

In [78]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_data:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [79]:
print(ask_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


In [80]:
print(show_posts[:5])

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]


***
## Calculating the Average Number of Comments

Next, we want to determine the average number of comments each type of post receives on Hacker News.

In [81]:
def find_average(dataset, index, n_digits = 2):
    total = 0
    length = len(dataset)
    for row in dataset:
        value = int(row[index])
        total += value
    return round(total / length, n_digits)

In [82]:
avg_ask_comments = find_average(ask_posts, 4)
print(avg_ask_comments)

10.39


In [83]:
avg_show_comments = find_average(show_posts, 4)
print(avg_show_comments)

4.89


In [84]:
avg_other_comments = find_average(other_posts, 4)
print(avg_other_comments)

6.46


**On average, `Ask HN` posts receive over twice as many comments that `Show HN` posts receive.**

***
## Comments by Hour of Post Creation

Next, we'll determine if posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the number of posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments posts receive by hour created.

In [85]:
import datetime as dt

def create_post_by_hour_ft(posts_dataset, index_date):
    posts_per_hour = {}
    comments_per_hour = {}
    for row in posts_dataset:
        comments = int(row[4])
        date = row[index_date]
        date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") # Create a datetime object from `created_at`
        hour = date_dt.strftime("%H") # Extract the hour from the datetime object
        if hour in posts_per_hour:
            posts_per_hour[hour] += 1
        else:
            posts_per_hour[hour] = 1
        if hour in comments_per_hour:
            comments_per_hour[hour] += comments
        else:
            comments_per_hour[hour] = comments
    
    return posts_per_hour, comments_per_hour

In [86]:
ask_posts_ft, ask_comments_ft = create_post_by_hour_ft(ask_posts, -1)
print(ask_posts_ft)
print('\n')
print(ask_comments_ft)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [87]:
show_posts_ft, show_comments_ft = create_post_by_hour_ft(show_posts, -1)
print(show_posts_ft)
print('\n')
print(show_comments_ft)

{'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}


{'00': 1283, '23': 1444, '20': 2183, '19': 2791, '18': 3242, '16': 3769, '14': 3839, '10': 1228, '09': 1411, '08': 1771, '06': 904, '03': 934, '21': 1759, '17': 3236, '15': 3824, '11': 2413, '07': 1577, '04': 978, '13': 3314, '12': 3609, '01': 1006, '22': 1450, '02': 1076, '05': 592}


We can use the following function to sort our frequency tables to make them more easily-readable:

In [88]:
def sort_table(table, index_end = None):
    if index_end == None:
        index_end = len(table) - 1
        
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted[:index_end]:
        print(entry[1], ':', entry[0])

In [89]:
sort_table(ask_posts_ft, index_end = 5)

15 : 646
18 : 614
17 : 587
16 : 579
19 : 552


In [90]:
sort_table(ask_comments_ft, index_end = 5)

15 : 18525
13 : 7245
17 : 5547
14 : 4972
18 : 4877


In [91]:
sort_table(show_posts_ft, index_end = 5)

15 : 836
16 : 801
17 : 761
14 : 696
18 : 656


In [92]:
sort_table(show_comments_ft, index_end = 5)

14 : 3839
15 : 3824
16 : 3769
12 : 3609
13 : 3314


***
## Calculating the Average Number of Comments Per Hour

While we now know the hours of the day in which most posts are submitted and when most comments are added, we would now want to know the average number of comments for each post per hour.

This should provide us with insights regarding how active posts are at a given time.

In [93]:
def find_average_comments_per_hour(posts_ft, comments_ft):
    average_comments_per_hour = {}
    for hour in posts_ft:
        posts = posts_ft[hour]
        comments = comments_ft[hour]
        average_comments_per_hour[hour] = round(comments / posts, 2)
    return average_comments_per_hour

In [94]:
ask_comments_per_hour = find_average_comments_per_hour(ask_posts_ft, ask_comments_ft)
sort_table(ask_comments_per_hour, index_end = 5)

15 : 28.68
13 : 16.32
12 : 12.38
02 : 11.14
10 : 10.68


In [95]:
show_comments_per_hour = find_average_comments_per_hour(show_posts_ft, show_comments_ft)
sort_table(show_comments_per_hour, index_end = 5)

12 : 6.99
07 : 6.68
11 : 6.0
08 : 5.6
14 : 5.52


***
## Insights and Analysis

We can now recall our initial questions when working with this dataset:
1. **Do `Ask HN` or `Show HN` receive more comments on average?**

We discovered earlier on that `Ask HN` posts average over twice as many comments that `Show HN` posts do, with ~10 comments/post for `Ask HN` and ~5 comments/post for `Show HN` posts.

This stands to reason as `Ask HN` posts are inherently asking for discussion and feedback, so we should expect that users will be providing comments and responses. On the other hand, `Show HN` posts do not ask for any feedback and only receives comments from users who are inclined to do so.

2. **Do posts created at a certain time receive more comments on average?**

With the last steps of this project, we determined the average amount of comments per post within each hour of the day.

What we found was that for `Ask HN` posts, the most activity on these posts occurred in the afternoon between 12pm - 3pm ET, with a surprising outlier of activity occuring around 2am ET. The 3pm ET slot sees the most activity and this could be attributed to technology professionals coming to a slow-down in their workday on the East coast combined with the amount of users on the West coast coming into their lunch breaks.

For `Show HN` posts, we found that the most activity occurred a bit more evenly, with the highest average number of comments per posts submitted between 11am and 12pm ET. There is also higher activity in the morning hours of 7am and 8am.