In [1]:
from csv import reader
import datetime as dt

# Exploring Hacker News Posts

Hacker News, a social news website, allows users to submit stories or posts. Users can vote and comment on posts in a style similar to Reddit. The more votes the likely the post is visible to other users.

The purpose of this project is to compare two different posts on the site Hacker News: _Ask HN_ or _Show HN_. The former is self-explanatory, users post questions seeking guidance or information. For example, which jobs are hiring and where? The latter is intended for posts that show off projects or other interesting phenomena. 

More importantly, we will use this comparison to answer two questions:

- Does the _Ask HN_ or _Show HN_ receive more comments on average? 

- Do posts created at a certain time recieve more comments on average? 

We'll only analyze a selected sample of 20,000 rows from the 300,000 rows of data available. The simplified data is comprised of posts which received comments. To learn more about how the data was gathered and information regarding the columns, click [here](http://www.kaggle.com/hacker-news/hacker-news-posts).

# Part 1: Reading in the Data

Let us begin by importing the data. We assign the data to the variable `hn`.

In [2]:
#read in the data

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The first entry contains the column header, so we will remove this entry to have only data. We then verify that the header was removed correctly. 

In [3]:
#do not run twice
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Part Two: Filtering the Data

Since our main concern is between the _Ask HN_ and _Show HN_ posts, we can create a new `list` containing only these posts. This simplifies the analysis as the size of our data decreases significantly as we see below.


In [11]:
ask_posts = []
show_posts = [] #initialize empty lists 
other_posts = []

#filters and separates posts based on category
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Total number of ask posts are:", len(ask_posts))
print("Total number of show posts are:", len(show_posts))
print("Total number of other posts are:", len(other_posts))

Total number of ask posts are: 1744
Total number of show posts are: 1162
Total number of other posts are: 17194


# Part Three: Determining Avg # of comments

From above, we notice that the number of ask posts totals 1,744, while there are 1,162 show posts. Now, let's attempt to answer the first question we posed.

- Does the _Ask HN_ or _Show HN_ receive more comments on average? 

In [5]:
#function calculates the avg number of comments for posts
def avg_cal(a_list, index):
    total_comments = 0
    for post in a_list:
        total_comments += int(post[index])
    avg_comments = total_comments/len(a_list)
    return avg_comments

In [6]:
avg_ask_comments = avg_cal(ask_posts,4)
avg_show_comments = avg_cal(show_posts,4)

print('The average number of comments on ask posts is: %1.2f' %avg_ask_comments)
print('The average number of comments on show posts is: %1.2f' %avg_show_comments)

The average number of comments on ask posts is: 14.04
The average number of comments on show posts is: 10.32


From the above information, we see that ask posts average 14 comments per post. Meanwhile, the show posts average rougly 10 comments per post, a four point decrease from the ask posts. 

# Part Four: Number of comments by time created

In part three, we determined ask posts recieve more comments and so we will only focus on these posts. In this section, we will attempt to answer the second question we posed in the beginning:


- Do posts created at a certain time recieve more comments on average? 

In [15]:
result_list = []
counts_by_hour = {} #initialize two empty dictionaries 
comments_by_hour = {}

#iterates over each post in ask posts
for post in ask_posts:
    created_at = post[6] #stores time post was created
    num_comments = int(post[4]) #stores number of comments of post
    result_list.append([created_at, num_comments])
    #appends information as tuple to result_list

#constructs two frequency tables 
for result in result_list:
    time = result[0]
    num_comments = result[1]
    hr = dt.datetime.strptime(time, "%m/%d/%Y %H:%M").strftime("%H")
    #stores hr as datetime object
    
    if hr in counts_by_hour:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += num_comments
    else:
        counts_by_hour[hr] = 1 #
        comments_by_hour[hr] = num_comments
        

print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


We have created our frequency table and assigned them to `counts_by_hour` and `comments_by_hour`. The hours are displayed in military time.  

- `comments_by_hour`: contains total number of comments on ask posts at each hour received

- `counts_by_hour`: contains total number of ask posts created at each hour of the day

The next step is to utilize these two dictionaries to calculate the average number of comments for posts created during each hour of the day. To do this, we will create a list of lists which contain the hours during which posts were created and the average number of comments those posts received.

In [8]:
avg_by_hour = [] #initialize our lists

#iterate each hr in counts_by_hour
for hr in counts_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In the above, we calculated the average number of comments for posts created during each hour of the day and stored the results in `avg_by_hour`. However, in this format it's not entirely clear the ideal hour to create a post. The values are not ordered and it's difficult to determine the correct choice. After all, we want our post to receive some shine. We'll remedy this with code down below.  

In [24]:
swap_avg_by_hour = [] #initialize empty list

#iterate over each row in data displayed above
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    #swap indices to have time on right side
#print(swap_avg_by_hour)


sorted_swap = sorted(swap_avg_by_hour,reverse = True)
#arrange data in descending order of avg number of comments

#formats the data to be easily read
print('Top 5 Hours for Ask Posts Comments')
for avg, hr in sorted_swap[:5]:
    template = "{}: {:.2f} average comments per post"
    time = dt.datetime.strptime(hr, "%H").strftime("%H:%M")
    print(template.format(time,avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on the information above, we see that the best time to receive comments would be at 3:00 p.m (EST) with comments totaling an average of 38.59. 

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time will receive the most comments on average. Our analysis leads us to conclude that the recommended time to post is between 15:00 and 16:00 (3:00 pm est - 4:00 pm est). The post should also be an Ask post.


It should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.