# Finding the best time and type of posts to receive more comments on Hacker News

In this project, we'll aim to find which type of posts received on average the most comments, as well as at what time to post it on Hacker News. Hacker News is a website where user-submitted stories (known as 'posts') are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

# Introduction
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit "Ask HN" posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
Our goal for this project is to analyze data to understand what kinds of posts are likely to produce more comments.

the dataset has been reduced from almost 300'00 posts to approximately 20000 posts. We removed all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- **id** - The unique identifier from Hacker News for the post
- **title** - The title of the post
- **url** - The URL that the posts links to, if it the post has a URL
- **num_points** - The number of points the post acquired, calculated as the total number of upvotes minus - the total number of downvotes
- **num_comments** - The number of comments that were made on the post
- **author** - The username of the person who submitted the post
- **created_at** - The date and time at which the post was submitted

# Importing Libraries and Reading data

Let's start by importing the libraries we need and reading the data set into a list of lists.


In [33]:
import csv
f = open('hacker_news.csv')
hn = list(csv.reader(f))

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The first list in the inner lists contains the column headers. In order to analyze our data, we need to first remove the row containing the column headers:

In [34]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Filtering Data

Now that we removed the headers from *hn*, we're ready to filter our data. Since we're only concerned with posts whose titles begin with *Ask HN* or *Show HN*, we'll create new lists of lists containing just the data for those titles. We'll use the method *startswith* to find which title begin with "ask" or "show" and the method *lower* to control for case.

In [35]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


# Analyzing Data

Next, let's determine if ask posts or show posts receive more comments on average:

In [36]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


We see that on average, ask post receive 14 comments, which is around four more than the show. This could be because human being are more enclin to answer to a question. They feel gratitude and want to help other. We can also see that posts in the show category that have a lot of comments are often because someone asked a question in the comments.

Since ask post seems to receive more comments, we'll focus our remaining analysis on these posts. Now we'll determineif *ask posts* created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts received by hour created. For that we'll use de *datetime* module.

In [37]:
import datetime as dt
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at, num_comments]
    result_list.append(result)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_1_str = row[0]
    comments = row[1]
    date_1_dt = dt.datetime.strptime(date_1_str, "%m/%d/%Y %H:%M")
    hour = date_1_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

ordered_counts = []


for key in counts_by_hour:
        key_val_as_tuple = (counts_by_hour[key], key)
        ordered_counts.append(key_val_as_tuple)

table_sorted = sorted(ordered_counts, reverse = True)
for entry in table_sorted:
     print(entry[1], ':', entry[0])   

print(comments_by_hour)


15 : 116
19 : 110
21 : 109
18 : 109
16 : 108
14 : 107
17 : 100
13 : 85
20 : 80
12 : 73
22 : 71
23 : 68
01 : 60
10 : 59
11 : 58
02 : 58
00 : 55
03 : 54
08 : 48
04 : 47
05 : 46
09 : 45
06 : 44
07 : 34
{'19': 1188, '20': 1722, '23': 543, '16': 1814, '07': 267, '11': 641, '03': 421, '15': 4477, '01': 683, '18': 1439, '17': 1146, '10': 793, '04': 337, '21': 1745, '02': 1381, '06': 397, '08': 492, '14': 1416, '09': 251, '22': 479, '00': 447, '12': 687, '05': 464, '13': 1253}


it seems that the best time is 15 in the afternoon. Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [41]:
avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([key, comments_by_hour[key] / counts_by_hour[key]])
        
print(sorted(avg_by_hour, key=lambda x: x[1], reverse= True))

[['15', 38.5948275862069], ['02', 23.810344827586206], ['20', 21.525], ['16', 16.796296296296298], ['21', 16.009174311926607], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['18', 13.20183486238532], ['17', 11.46], ['01', 11.383333333333333], ['11', 11.051724137931034], ['19', 10.8], ['08', 10.25], ['05', 10.08695652173913], ['12', 9.41095890410959], ['06', 9.022727272727273], ['00', 8.127272727272727], ['23', 7.985294117647059], ['07', 7.852941176470588], ['03', 7.796296296296297], ['04', 7.170212765957447], ['22', 6.746478873239437], ['09', 5.5777777777777775]]


Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [48]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])

for row in sorted_swap:
    dt_object= dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    avg_comments = row[0]
    print("{}:   {:.2f} average comments per post".format(dt_object, avg_comments))

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]
15:00:   38.59 average comments per post
02:00:   23.81 average comments per post
20:00:   21.52 average comments per post
16:00:   16.80 average comments per post
21:00:   16.01 average comments per post
13:00:   14.74 average comments per post
10:00:   13.44 average comments per post
14:00:   13.23 average comments per post
18:00:   13.20 average comments per post
17:00:   11.46 average comments per post
01:00:   11.38 average comments per post
11:00:   11.05 average comments per post
19:00:   10.80 average comments per post
08:00:   10.25 average comments per post
05:00:   10.09 average comments per post
12:00:   9.41 average comments per post
06:00:   9.02 average comments per post
00:00:   8.13 average comments per post
23:00:   7.99 average comments per post
07:00:   7.85 average comments per post
03:00:   7.80 average comments per post
04:00:   7.17 aver

We should create a post at 15:00 according to the time zone EST-US, which represent between 9pm and 10pm in Switzerland.!!