# Achieving success with a Hacker News post

There are certain posts that are more popular than others. Why is that? Is it the content? Is it the post type? Or is it just a matter of the time of the day it is published?

*Let's ask someone who knows... let's ask our Hacker News data set.*

**We'll open our dataset.**

In [1]:
open_file = open("hacker_news.csv")

In [2]:
from csv import reader
read_file = reader(open_file)
hn = list (read_file)

**Now we'll print the first five rows to see what our dataset looks like.**

In [3]:
for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


This is what each column means:
* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if it the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

**Let's remove the header row.**

In [4]:
headers = hn[0]

In [5]:
hn = hn[1:]

In [6]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [7]:
for row in hn[:5]:
    print(row)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


**We are particulary interested in Hacker News posts that start with "Ask HN" and "Show HN". Therefore we are going to filter our data set, selecting only these types of posts. We'd like to see if post types have an influence in popularity.**

In [8]:
ask_posts = []
show_posts = []
other_posts = []

In [9]:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
    

**There are 1,744 "Ask HN" posts and 1,162 "Show HN" posts:**

In [10]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


**Now we are going to determine whether ask posts or show posts receive more comments on average.**

In [11]:
total_ask_comments = 0

In [12]:
for row in ask_posts:
    total_ask_comments += int(row[4])

In [13]:
avg_ask_comments = total_ask_comments / len(ask_posts)

In [14]:
total_show_comments = 0

In [15]:
for row in show_posts:
    total_show_comments += int(row[4])

In [16]:
avg_show_comments = total_show_comments / len(show_posts)

*Average amount of comments in ask posts:*

In [17]:
print(avg_ask_comments)

14.038417431192661


*Average amount of comments in show posts:*

In [18]:
print(avg_show_comments)

10.31669535283993


**In average "Ask HN" posts generate 40% more comments than "Show HN" posts.**

Let's now analyze if ask posts created at a certain time are more likely to attract comments. We'll calculate the average amount of comments per post by hour.

In [19]:
import datetime as dt

In [20]:
result_list = []

In [21]:
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

In [22]:
counts_by_hour = {}
comments_by_hour = {}

In [23]:
for row in result_list:
    date = dt.datetime.strptime(str(row[0]), "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [24]:
avg_by_hour = []

In [25]:
for hour in comments_by_hour:
        avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])

In [26]:
print(avg_by_hour)

[['20', 21.525], ['23', 7.985294117647059], ['22', 6.746478873239437], ['13', 14.741176470588234], ['11', 11.051724137931034], ['15', 38.5948275862069], ['06', 9.022727272727273], ['10', 13.440677966101696], ['03', 7.796296296296297], ['05', 10.08695652173913], ['21', 16.009174311926607], ['01', 11.383333333333333], ['02', 23.810344827586206], ['09', 5.5777777777777775], ['19', 10.8], ['18', 13.20183486238532], ['12', 9.41095890410959], ['14', 13.233644859813085], ['00', 8.127272727272727], ['07', 7.852941176470588], ['08', 10.25], ['04', 7.170212765957447], ['16', 16.796296296296298], ['17', 11.46]]


In [27]:
swap_avg_by_hour = []

In [28]:
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [29]:
print(swap_avg_by_hour)

[[21.525, '20'], [7.985294117647059, '23'], [6.746478873239437, '22'], [14.741176470588234, '13'], [11.051724137931034, '11'], [38.5948275862069, '15'], [9.022727272727273, '06'], [13.440677966101696, '10'], [7.796296296296297, '03'], [10.08695652173913, '05'], [16.009174311926607, '21'], [11.383333333333333, '01'], [23.810344827586206, '02'], [5.5777777777777775, '09'], [10.8, '19'], [13.20183486238532, '18'], [9.41095890410959, '12'], [13.233644859813085, '14'], [8.127272727272727, '00'], [7.852941176470588, '07'], [10.25, '08'], [7.170212765957447, '04'], [16.796296296296298, '16'], [11.46, '17']]


In [30]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

**Top 5 Hours for Ask Posts Comments**

In [31]:
for row in sorted_swap[:4]:
    datehour = dt.datetime.strptime(str(row[1]), "%H")
    hour = dt.datetime.strftime(datehour, "%H:00:")
    average = '{:.2f}'.format(row[0])
    print (hour, average, "average comments per post")
                            

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post


## Conclusion

*In order to increase chances of receiving the most amount of comments in a Hacker News post, it is recommended to publish an "Ask HN" post at 3 pm Estearn Time (4 pm Argentina Time)*