# Mission Statement

Hacker News is a site where user-submitted stories are voted and commented upon, similar to Reddit. Hacker News is popular in technology and start-up circles, and posts at the top of Hacker News listings can receive hundreds of thousands of visitors for being at the top.

We will explore a dataset of Hacker News posts, but the data set has been reduced by removing all submissions with no comments. 

Specifically we will explore posts whose titles begin with Ask HN or Show HN.

Ask HN posts ask the Hacker News community a specific question.
Show HN posts are intended to show the Hacker News community a project, product, or something interesting the user wants to share.

We will compare these two types of posts to answer the following:

1. Do "Ask HN" or "Show HN" posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

# Access the Data

Our data set is located in a csv file "hacker_news". This data set will be read into a list of lists. The first row in the data set is headers, so this row will be removed from the data set.

In [1]:
from csv import reader

opened_file = open("hacker_news.csv",encoding='utf-8')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

display(headers)
display(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

# Filtering Data: Ask, Show, and Others

Now that we have removed the headers and read our data, we are ready to filter our data. We are only concerned with post titles beginning with "Ask HN" or "Show HN", so we will create new lists of lists containing just the data for those titles

To find posts that begin with either "Ask HN" or "Show HN", we'll use the string method *startswith*. Because capitalization matters, we will use the lower method as well to deal with variance in capitalization.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of Ask Posts: ", len(ask_posts))
print("Number of Show Posts: ", len(show_posts))
print("Number of Other Posts: ", len(other_posts))

Number of Ask Posts:  1744
Number of Show Posts:  1162
Number of Other Posts:  17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts

After separating the Ask and Show posts, we will calculate the average number of comments each type of post receives.

## Ask Posts

In [3]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)

print("Average Number of Comments on Ask HN Posts: ",
      round(avg_ask_comments,2), "comments/post")
    

Average Number of Comments on Ask HN Posts:  14.04 comments/post


## Show Posts

In [4]:
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print("Average Number of Comments on Show HN Posts: ",
     round(avg_show_comments,2), "comments/post")

Average Number of Comments on Show HN Posts:  10.32 comments/post


Our analysis shows that on average, *"Ask HN"* posts receive 14.04 comments/post, while *"Show HN"* posts receive 10.32 comments/post.

*"Ask HN"* posts receive more comments on average.

Since *"Ask HN"* posts receive more comments, we will focus our remaining analysis just on these posts.

# Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if there is a certain time period that a question can be posted to maximize the amount of comments an ask post receives. First, the amount of ask posts created in each hour of the day will be calculated, along with the amount of comments received. Then we will calculate the average number of comments ask posts receive by hour created. 

In [5]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    time_comment = [created_at,num_comments]
    result_list.append(time_comment)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at_str = row[0]
    comments = row[1]
    created_at_dt = dt.datetime.strptime(created_at_str,"%m/%d/%Y %H:%M")
    created_at_hour = created_at_dt.strftime("%H")
    if created_at_hour not in counts_by_hour:
        counts_by_hour[created_at_hour] = 1
        comments_by_hour[created_at_hour] = comments
    elif created_at_hour in counts_by_hour:
        counts_by_hour[created_at_hour] += 1
        comments_by_hour[created_at_hour] += comments
    
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

# Sorting the Results

Now that we have calculated the average number of comments per post by hour, we will sort the results in descending order to find the top 5 best hours to write Ask HN posts.

In [6]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [7]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)

print("Top 5 Hours for Ask Posts Comments")
for avg,hour in sorted_swap[:5]:
    print("{hr}: {average:.2f} average comments per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the data set documentation, the timezone used is EST (Eastern Standard Time). 

The hour that receives the most comments per post on average is at 3:00 PM EST, with 38.59 average comments per post. There is roughly a 60% increase in comments from the second best hour to receive comments and the best hour. 

In [8]:
print("Bottom 5 Hours for Ask Posts Comments")
for avg,hour in sorted_swap[-5:]:
    print("{hr}: {average:.2f} average comments per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Bottom 5 Hours for Ask Posts Comments
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


The worst time to create a post using "Ask HN" is in the morning at 9:00 AM EST, receiving only 5.58 average comments per post.

# Calculating Average Number of Points per Show or Ask Post

We previously had calculated the number of comments a Show or Ask post received. Now we will explore the average points a Show or Ask Post receives. Points are the difference between upvotes and downvotes. The higher this value, the more positive sentiment there is for the post and is a good proxy for traffic to the post. 

We will use the lists ask_posts and show_posts for this analysis as well.

## Ask Posts

In [9]:
total_ask_points = 0

for post in ask_posts:
    num_points = int(post[3])
    total_ask_points += num_points

avg_ask_points = total_ask_points/len(ask_posts)

print("Average Number of Points on Ask HN Posts: ",
      round(avg_ask_points,2), "points/post")

Average Number of Points on Ask HN Posts:  15.06 points/post


## Show Posts

In [10]:
total_show_points = 0

for post in show_posts:
    num_points = int(post[3])
    total_show_points += num_points

avg_show_points = total_show_points/len(show_posts)

print("Average Number of Points on Show HN Posts: ",
      round(avg_show_points,2), "points/post")

Average Number of Points on Show HN Posts:  27.56 points/post


Show posts on average receive 27.56 points/post, while Ask posts receive an average of 15.06 points.

As opposed to analyzing posts by number of comments, using number of points per post shows that *"Show HN"* posts receive more points than *"Ask HN"* posts.

Using number of points highlights a different quality of the posts than number of comments does. Number of comments show how many users may be actively engaging with the post. On the other hand, number of points shows the overall sentiment towards the post. Show posts receiving a higher average number of points means that people responded more postively to these posts than to Ask Posts. However, this may not explain the traffic the post received. If the post was polarizing in sentiment (many people  felt positively about the post and many people felt negatively about the post), the post may receive a points value close to zero even if it was a highly trafficked post. 

We will continue our analysis using *"Show HN"* posts because this type of post received more points/post on average.

# Amount of Show Posts by Hour Created and Number of Points

Next, we'll determine if there is a certain time period that a question can be posted to maximize the amount of points a show post receives. First, the amount of show posts created in each hour of the day will be calculated, along with the amount of points received. Then we will calculate the average number of points ask posts receive by hour created. 

In [11]:
points_result_list = []

for post in show_posts:
    created_at = post[6]
    num_points = int(post[3])
    time_point = [created_at,num_points]
    points_result_list.append(time_point)
    
counts_by_hour = {}
points_by_hour = {}

for row in points_result_list:
    created_at_str = row[0]
    points = row[1]
    created_at_dt = dt.datetime.strptime(created_at_str,"%m/%d/%Y %H:%M")
    created_at_hour = created_at_dt.strftime("%H")
    if created_at_hour not in counts_by_hour:
        counts_by_hour[created_at_hour] = 1
        points_by_hour[created_at_hour] = points
    elif created_at_hour in counts_by_hour:
        counts_by_hour[created_at_hour] += 1
        points_by_hour[created_at_hour] += points
    
points_avg_by_hour = []

for hour in points_by_hour:
    points_avg_by_hour.append([hour,points_by_hour[hour]/counts_by_hour[hour]])
    
points_swap_avg_by_hour = []

for row in points_avg_by_hour:
    points_swap_avg_by_hour.append([row[1],row[0]])
    
points_sorted_swap = sorted(points_swap_avg_by_hour,reverse = True)

print("Top 5 Hours for Show Posts Points")
for avg,hour in points_sorted_swap[:5]:
    print("{hr}: {average:.2f} average points per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Top 5 Hours for Show Posts Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


The hour that receives the most average points per post is at 11:00 PM EST-12:00 AM EST at 42.39 points/post. It is important to note that at 12:00 PM EST - 1:00 PM EST posts get 41.69 points/post on average and at 10:00 PM EST - 11:00 PM EST posts get 40.35 average points/post. 

In [12]:
print("Bottom 5 Hours for Show Posts Points")
for avg,hour in points_sorted_swap[-5:]:
    print("{hr}: {average:.2f} average points per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Bottom 5 Hours for Show Posts Points
21:00: 18.43 average points per post
08:00: 15.26 average points per post
04:00: 14.85 average points per post
02:00: 11.33 average points per post
05:00: 5.47 average points per post


The worst time to write a post if you are looking to maximize your points is at 5:00 AM EST - 6:00 AM EST. Posts at this hour only receive 5.47 average points per post.

# Comparison of Number of Comments and Number of Points by Hour

We will take a look at the average number of comments and number of points by hour for Show HN posts.

In [13]:
result_list = []

for post in show_posts:
    created_at = post[6]
    num_comments = int(post[4])
    time_comment = [created_at,num_comments]
    result_list.append(time_comment)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at_str = row[0]
    comments = row[1]
    created_at_dt = dt.datetime.strptime(created_at_str,"%m/%d/%Y %H:%M")
    created_at_hour = created_at_dt.strftime("%H")
    if created_at_hour not in counts_by_hour:
        counts_by_hour[created_at_hour] = 1
        comments_by_hour[created_at_hour] = comments
    elif created_at_hour in counts_by_hour:
        counts_by_hour[created_at_hour] += 1
        comments_by_hour[created_at_hour] += comments
    
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
    
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
sorted_swap = sorted(swap_avg_by_hour,reverse = True)

### Show HN Number of Comments per Hour

In [14]:
print("Top 5 Hours for Show Posts Comments")
for avg,hour in sorted_swap[:5]:
    print("{hr}: {average:.2f} average comments per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Top 5 Hours for Show Posts Comments
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post


### Show HN Number of Points per Hour

In [15]:
print("Top 5 Hours for Show Posts Points")
for avg,hour in points_sorted_swap[:5]:
    print("{hr}: {average:.2f} average points per post".format(hr = dt.datetime.strptime(hour, '%H').strftime("%H:%M"),average = avg))

Top 5 Hours for Show Posts Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


Examining the top 5 hours for Comments and Points, we find that 6:00 PM EST - 7:00 PM EST, 10:00 PM EST - 1:00 AM EST (3 separate hours) are all found in both top 5 tables.

If you care more about points, create your Show HN post between 11:00 PM EST - 12:00 AM EST. If you care more about comments, create your Show HN post between 6:00 PM EST - 7:00 PM EST. 

In general, if you create your Show HN post between the hours listed above, they have a much better chance of receiving more attention, regardless of which metric you are measuring your post performance by (comments or points.)

# Conclusion

In this project, we examined the Hacker News data set to determine which type of post and time asked gets the most comments per post on average.

Based on our analysis, *"Ask HN"* posts receive more comments on average than *"Show HN"* posts on Hacker News. *"Ask HN"* posts were then analyzed to determine the best time to create a post to generate the most comments per post on average (by hour). Based on my analysis, The best time to create a post is between 3:00 PM EST - 4:00 PM EST. 

Additionally, we examined the data set to determine which type of post and at what time gets the highest number of points per post on average. Number of points shows the overall sentiment towards the post.

*"Show HN"* posts received a higher number of points on average compared to *"Ask HN"* posts, indicating these posts receive more positive sentiment. *"Show HN"* posts were then analyzed to determine the best time to create a post to receive the highest number of points per post on average. The hour that receives the most average points per post is at 11:00 PM EST-12:00 AM EST at 42.39 points/post. 

When comparing the best hours to receive the highest number of comments to the best hours to receive the highest number of points for *"Show HN"* posts, we found that for both metrics the same 4 out of 5 hours were found in the Top 5 hours.

