# Exploring Hacker News Posts

In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a popular site for tech and startup news where users submit stories and vote on them, similar to reddit. It was started by [Y Combinator](https://www.ycombinator.com/) and top posts can receive hundreds of thousands of visitors.

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts on Hacker News to ask the community a specific question also users submit `Show HN` posts on Hacker News to share a project, product, or something interesting with the community.

The goal of our project is answer for questions:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Let's explore our dataset.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:] # Dataset without headers

print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Dataset consists of next column:
* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the posts links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission

## Extracting Ask HN and Show HN Posts

We have to filter our dataset because we'll analyze posts that start with `Ask HN` or `Show HN` according to our goal. We can do it using `startswith` method of string object.

In [2]:
ask_posts = [] # Ask HN
show_posts = [] # Show HN
other_posts = []

for row in hn:
    title = row[1].lower() # To lower case
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Ask HN posts:', len(ask_posts))
print('Show HN posts:', len(show_posts))
print('Other posts:', len(other_posts))

Ask HN posts: 1744
Show HN posts: 1162
Other posts: 17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [3]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print('Ask HN average comments:', avg_ask_comments)
print('Show HN average comments:', avg_show_comments)


Ask HN average comments: 14.038417431192661
Show HN average comments: 10.31669535283993


`Ask HN` posts average get more comments than `Show HN` posts.

## Finding the Number of Ask Posts and Comments by Hour Created

We'll limit our analysis to ask posts since they tend to receive more comments compared to other types of posts. Here, we'll calculate the number of ask posts created in each hour of the day, along with the number of comments received.

In [4]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {} # number of ask posts created in each hour of the day
comments_by_hour = {} # number of comments received in each hour of the day
date_format = '%m/%d/%Y %H:%M' # our dataset has created_at column in this format of date

# The solution to our task looks similar to creating of frequency table
for row in result_list:
    created_at, num_comments = row
    created_at_dt = dt.datetime.strptime(created_at, date_format)
    hour = created_at_dt.strftime('%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We can see that the highest number of comments were written at **15:00**. Also we can observe that the majority of posts with over 1000 comments were created between **13:00** and **21:00**. We can explain this by the fact that people are usually socially active in this time interval.

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Here, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [5]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Most often, on average, comments are written at **15:00** (38.6 comments). Less often at **9:00** (5.6 comments).

## Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [6]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    print('{h}:00: {avg:.2f} average comments per post'.format(h=row[1], avg=row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the [data set documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US.

## Conclusion

According to the results of the analyses:
* `Ask HN` posts get more comments than `Show HN` posts
* The best time to write your post is **15:00** in the US or **20:00** in Europe. At this time you'll be able to get the highest of comments.
* We can notice that if you write `Ask HN` post in time between **13:00** and **21:00** you also will get a good number of comments.