# EXPLORING HACKER NEWS POSTS

In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

Some posts can easily attract a lot of views, and comments. In this study we will explore aspects that impact the amount of comments for a post.

*Post title*: when creating posts, users can - optionally - add Ask HN or Show HN to the title of the post. They do so to explicitly 'ask' or 'show' something to the Hacker News community. We'll analyze whether posts with these tags receive more comments on average.

*Post timing*: also, we will explore whether posts published at certain times receive more comments on average.

## About The Dataset

You can find the data set [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. You can download this adjusted data [here](https://app.dataquest.io/m/356/guided-project%3A-exploring-hacker-news-posts/1/introduction).

Let's start with reading the dataset and display the first five rows.

In [1]:
from csv import reader 
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:4]:
    print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



##  Removing Headers From A List Of Lists

The first list in the above lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [2]:
headers = hn[0]
print('Number of records before removing the headers: ', len(hn))
hn = hn[1:]
print('Number of records after removing the headers: ', len(hn))

Number of records before removing the headers:  20101
Number of records after removing the headers:  20100


Now, display the first five rows to verify that you removed the header row properly.

In [3]:
for row in hn[:5]:
    print(row, '\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



## Extracting Ask HN and Show HN Posts

Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method `startswith`.

Let's split the data into three new lists:
* `ask_posts` (the one who posted added 'ask hn' or similar)
* `show_posts` (the one who posted added 'show hn' or similar)
* `other_posts` (the remainder)

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Check the number of posts in the three lists

print("Number of posts in the hn list is ",len(hn))
print("Number of posts in the ask_posts list is ",len(ask_posts))
print("Number of posts in the show_posts list is ",len(show_posts))
print("Number of posts in the other_posts list is ",len(other_posts))

# Check whether the sum of the new three lists is equal to the original list
print("The sum of the new three lists is ",len(ask_posts)+len(show_posts)+len(other_posts))

Number of posts in the hn list is  20100
Number of posts in the ask_posts list is  1744
Number of posts in the show_posts list is  1162
Number of posts in the other_posts list is  17194
The sum of the new three lists is  20100


## Calculating The Average Number Of Comments For Ask HN And Show HN Posts

We separated the "ask posts" and the "show posts" into two lists of lists named ask_posts and show_posts. Below are the first five rows in the ask_posts list of lists:

In [5]:
for row in ask_posts[:5]:
    print(row, '\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] 

['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] 

['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] 

['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'] 



Here are the first five rows in the show_posts list of lists:

In [6]:
for row in show_posts[:5]:
    print(row, '\n')

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] 

['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'] 

['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'] 

['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45'] 



Now, let's determine if ask posts or show posts receive more comments on average.

In [7]:
total_ask_comments = 0

for row in ask_posts:
    post = int(row[4])
    total_ask_comments += post
avg_ask_comments = total_ask_comments/len(ask_posts)
print('The average number of comments for "ask posts" is {:.2f} '.format(avg_ask_comments))

total_show_comments = 0
for row in show_posts:
    post = int(row[4])
    total_show_comments += post
avg_show_comments = total_show_comments/len(show_posts)
print('The average number of comments for "show posts" is {:.2f} '.format(avg_show_comments))

The average number of comments for "ask posts" is 14.04 
The average number of comments for "show posts" is 10.32 


It appears that 'ask' posts receive more comments on average than 'show' posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

We'll work on the first step — calculating the number of ask posts and comments by hour created.

## Finding The Number Of Ask Posts And Comments By Hour Created

In [11]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    comments_num = int(row[4])
    result_list.append([created_at, comments_num])
    
# Build frequency tables for the number of posts and for the number of comments, per hour of the day
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created_at = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = created_at.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

## Calculating The Average Number Of Comments For Ask HN Posts By Hour

In [14]:
avg_by_hour = []
for hour in counts_by_hour:
    num_posts = counts_by_hour[hour]
    num_comments = comments_by_hour[hour]
    avg = round((num_comments/num_posts),2)
    avg_by_hour.append([hour, avg])
    
# Sort the list (on its first element, being the hour of day)
avg_by_hour.sort()    
    
# Print the result
output = "For hour {:02d} the average number of comments per post is {:.2f}"
for row in avg_by_hour:
    print (output.format(row[0], row[1])) 

For hour 00 the average number of comments per post is 8.13
For hour 01 the average number of comments per post is 11.38
For hour 02 the average number of comments per post is 23.81
For hour 03 the average number of comments per post is 7.80
For hour 04 the average number of comments per post is 7.17
For hour 05 the average number of comments per post is 10.09
For hour 06 the average number of comments per post is 9.02
For hour 07 the average number of comments per post is 7.85
For hour 08 the average number of comments per post is 10.25
For hour 09 the average number of comments per post is 5.58
For hour 10 the average number of comments per post is 13.44
For hour 11 the average number of comments per post is 11.05
For hour 12 the average number of comments per post is 9.41
For hour 13 the average number of comments per post is 14.74
For hour 14 the average number of comments per post is 13.23
For hour 15 the average number of comments per post is 38.59
For hour 16 the average number 

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

## Sorting And Printing Values From A List Of Lists

In [15]:
# Create a list that is sorted on the average number of comments instead
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

# Created a sorted version of this list
sorted_swap = sorted (swap_avg_by_hour, reverse = True)

# Display the results
print ('Top 5 Hours for Ask Posts Comments', '\n')
output = "{}: {:.2f} average comments per post"
for row in sorted_swap[:5]:
    thetime = dt.datetime.strptime(str(row[1]), '%H')
    thetime = thetime.strftime('%H:%M')
    print (output.format(thetime,row[0]))

Top 5 Hours for Ask Posts Comments 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Therefore, those time periods mentioned are the most optimal for posting if your goal is to generate comments. It is intriguing to note that the five highest-activity hours are spread throughout the day, suggesting the possibility that commenters reside in various time zones. Consequently, these diverse hours might correspond to peak times in different regions worldwide. However, this hypothesis would need additional investigation to be confirmed.

Note that the times above are for the US Eastern Time. (As per the [dataset documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).)

For our time zone (Vietnam time zone), you'll need to add 10 hours to that.

## Conclusion

Refering back to the goal of this study, let's summarize the conclusions.

*Post title*: when creating posts, adding Ask HN to your post title will do better for attracting comments than adding `Show HN`:

`Ask HN`: 14.04 average comments per post
`Show HN`: 10.32 average comments per post

(It has not been compared with posts for not adding a tag at all.)

*Post timing*: the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the *Ask HN posts*, the top hours (in Vietnam time zone) are:

* 01:00 - 02:00: 38.59 average comments per post
* 18:00 - 19:00: 23.81 average comments per post
* 12:00 - 13:00: 21.52 average comments per post