# Exploring Hacker News Posts

The data provides submissions to popular technology site [Hacker News](https://news.ycombinator.com/),  started by the start-up incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon.

Dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but it has been reduced by [Dataquest](https://www.dataquest.io) from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Here are descriptions of the columns: 
- `id` - the unique identifier from Hacker News for the post
- `title` - the title of the post
- `url` - the URL that the posts links to, if it the post has a URL
- `num_points` - the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments` - the number of comments that were made on the post
- `author` - the username of the person who submitted the post
- `created_at` - the date and time at which the post was submitted

We're interested in posts titles that begin with Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question and Show HN posts to show the Hacker News community a project, product or something of interest.

We want to compare the two types of posts find out:
- do Ask HN or Show HN receive more comments on average?
- do posts created at a certain time receive more comments on average?

To read in and preview the data:

In [1]:
import csv

opened_file = open('hacker_news.csv')
reader = csv.reader(opened_file)
hn = list(reader)

In [2]:
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Removing headers

To remove the row containing the column headers:

In [3]:
headers = hn[0]
hn = hn[1:]

In [4]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
print(hn[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN posts

To find out how many posts start with Ask HN and Show HN:

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lower = title.lower()  
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [7]:
print(len(ask_posts))

1744


In [8]:
print(len(show_posts))

1162


In [9]:
print(len(other_posts))

17194


There are 1744 posts that start with Ask HN, 1162 post that start with Show HN and 17194 other posts.

To check, let's print out first five rows of the `ask_posts` list and first five rows of the `show_posts` list: 

In [10]:
print(ask_posts[0:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [11]:
print(show_posts[0:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


### Calculating the average number of comments for Ask HN and Show HN posts

Next, we want to determine if Ask HN posts or Show HN posts receive more comments on average:

In [12]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    numcomm = int(num_comments)
    total_ask_comments += numcomm

avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [13]:
total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    numcomm = int(num_comments)
    total_show_comments += numcomm
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


As we can see from the results, Ask HN posts receive more comments on average than Show HN posts. This is probably because the users give answers to questions that Ask HN posts state. 

Since Ask HN posts are more likely to receive comments, we'll focus just on those in next steps.

### Finding the amount of Ask HN posts and comments by hour

We also want to determine if Ask HN posts created at a certain time are more likely to attract comments. We'll use the following steps:
- calculate the amount of Ask HN posts created in each hour of the day, along with the number of comments received
- calculate the average number of comments Ask HN posts receive by hour created

In [14]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    numcomm = int(row[4])
    result_list.append([created_at, numcomm])
    
counts_by_hour = {}
comments_by_hour =  {}

for row in result_list:
    date_str = row[0]
    comments = row[1]
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [15]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [16]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


We have two dictionaries:
- `counts_by_hour`: contains the number of ask posts created during each hour of the day
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received


### Calculating the average comment number for Ask HN posts by hour

To calculate the average number of comments for Ask HN posts created during each hour of the day:

In [17]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_comm = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comm])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


This format makes it hard to identify the hours with the highest values. First to sort the list of lists:

In [18]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[-1], row[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Top 5 hours for comments in Ask HN Posts:

In [19]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

for i in sorted_swap[0:5]:
    string1= "{}h: {:.2f} average comments per post"
    hours = dt.datetime.strptime(i[-1], '%H')
    hours_format = hours.strftime("%H:%M")
    average = i[0]
    print(string1.format(hours_format, average))  

15:00h: 38.59 average comments per post
02:00h: 23.81 average comments per post
20:00h: 21.52 average comments per post
16:00h: 16.80 average comments per post
21:00h: 16.01 average comments per post


Top hour to receive comments is convincingly 15:00h (UTC -5).

But, if we take all top 5 ranked hours into consideration, best times to create a post (and have a higher chance of receiving comments) are at 15-17h, 20-22h and 2-3AM in Eastern Time in the US (UTC -5).

### Calculating the average number of points for Ask HN and Show HN posts

We want to find out if Show HN or Ask HN posts receive more points on average:

In [20]:
total_ask_points = 0

for row in ask_posts:
    num_points = row[3]
    numpt = int(num_points)
    total_ask_points += numpt

avg_ask_points = total_ask_points/len(ask_posts)
print(avg_ask_points)

15.061926605504587


In [21]:
total_show_points = 0

for row in show_posts:
    num_points = row[3]
    numpt = int(num_points)
    total_show_points += numpt

avg_show_points = total_show_points/len(ask_posts)
print(avg_show_points)

18.359518348623855


On average, Show HN post receive more points, probably because it makes sense to vote on something that is showing something new or interesting.

Since Show HN Posts receive more points, it makes sense to focus on those in next steps.

### Calculating the average points for Show HN posts by hour

We want to determine if Show HN posts created at a certain time are more likely to receive more points, so we can use the same method as in previous steps:

In [22]:
result_show_points = []

for row in show_posts:
    created_at = row[6]
    numpt = int(row[3])
    result_show_points.append([created_at, numpt])
    
counts_by_hour = {}
points_by_hour =  {}

for row in result_show_points:
    date_str = row[0]
    points = row[1]
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        points_by_hour[hour] = points
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += points
        
avg_points_by_hour = []

for hour in counts_by_hour:
    avg_pt = points_by_hour[hour]/counts_by_hour[hour]
    avg_points_by_hour.append([hour, avg_pt])
    
swap_avg_points_by_hour = []

for row in avg_points_by_hour:
    swap_avg_points_by_hour.append([row[-1], row[0]])

sorted_swap2 = sorted(swap_avg_points_by_hour, reverse = True)

for i in sorted_swap2[0:5]:
    string2 = "{}h: {:.2f} average points per post"
    hours = dt.datetime.strptime(i[-1], '%H')
    hours_format = hours.strftime("%H:%M")
    average = i[0]
    print(string2.format(hours_format, average)) 

23:00h: 42.39 average points per post
12:00h: 41.69 average points per post
22:00h: 40.35 average points per post
00:00h: 37.84 average points per post
18:00h: 36.31 average points per post


It seems that the average points for Show HN posts per hour are relatively close for the top 5 hours. We can check the top 10 hours:

In [23]:
for i in sorted_swap2[0:10]:
    string2 = "{}h: {:.2f} average points per post"
    hours = dt.datetime.strptime(i[-1], '%H')
    hours_format = hours.strftime("%H:%M")
    average = i[0]
    print(string2.format(hours_format, average)) 

23:00h: 42.39 average points per post
12:00h: 41.69 average points per post
22:00h: 40.35 average points per post
00:00h: 37.84 average points per post
18:00h: 36.31 average points per post
11:00h: 33.64 average points per post
19:00h: 30.95 average points per post
20:00h: 30.32 average points per post
15:00h: 28.56 average points per post
16:00h: 28.32 average points per post


Show HN posts that are created at 12-13h, 18-19h and 22-01h Eastern Time in the US (UTC -5) are likely to receive over 35 average points per post.

It is also evident that the most likely hours for comments received on Ask HN posts and points received on Show HN posts are different. 

### Comparing average number of comments and points for other posts

To find the average number of comments and points for other posts:

In [24]:
print(other_posts[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [25]:
total_other_comments = 0

for row in other_posts:
    num_comments = row[4]
    numcomm = int(num_comments)
    total_other_comments += numcomm

avg_other_comments = total_other_comments/len(other_posts)
print(avg_other_comments)

26.8730371059672


In [26]:
total_other_points = 0

for row in other_posts:
    num_points = row[3]
    numpt = int(num_points)
    total_other_points += numpt

avg_other_points = total_other_points/len(other_posts)
print(avg_other_points)

55.4067698034198


To compare the results for Ask HN, Show HN and other posts, we can use a table:

| Posts:  | Ask HN  | Show HN  | Other  |
|---|---|---|---|
| Average Comments  | 14.04  |10.32   |26.87   |
| Average Points  |  15.06 | 18.36  |55.41   |

Interestingly, number of average comments and points for other posts is approximately 2-3 times higher than for Ask HN or Show HN posts. This could be an interesting topic for further analysis.