# Exploring Posts on Hacker News 

In this project, I am going to explore submissions to the site [Hacker News](https://news.ycombinator.com/). The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

Submissions to the site can be grouped into two categories: questions (`Ask HN`) and posts that showcase something (`Show HN`). I'll examine which type of post receives more comments. I'll also explore whether there are certain times of the day when posts receive more comments.

## Setup

In [None]:
import csv
import datetime as dt

## Load data

First I load the dataset as a list of lists:

In [None]:
f = open('hacker_news.csv')
hn = list(csv.reader(f))

In [None]:
# separate header row
headers = hn[0]
hn = hn[1:]

# display headers
print(headers)

# display first 5 rows
print(hn[:5])

## Filter data

Next I filter the data to include only posts in the `Ask HN` and `Show HN` categories. Specifically, I add entries from the original dataset, for each type of post, to separate lists: 

In [None]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# display data for one post in each list
print(ask_posts[:1])
print(show_posts[:1])

In [None]:
# how many posts in each list?
print("We have {0} 'Ask HN' posts and {1} 'Show HN' posts!".format(len(ask_posts), len(show_posts)))

## Which type of posts receives more comments on average?

Next, we will determine whether `Ask HN` or `Show HN` receive more comments on average:

In [None]:
total_ask_comments = 0
for row in ask_posts:
    # get number of comments for this entry
    num_comments = int(row[4])
    # add to tally of total comments for Ask HN posts
    total_ask_comments += num_comments

# compute average number of comments on Ask HN posts
avg_ask_comments = total_ask_comments/len(ask_posts)

In [None]:
total_show_comments = 0
for row in show_posts:
    # get number of comments for this entry
    num_comments = int(row[4])
    # add to tally of total comments for Show HN posts
    total_show_comments += num_comments

# compute average number of comments on Ask HN posts
avg_show_comments = total_show_comments/len(show_posts)

In [None]:
# compare the results
print('Average comments on Ask HN posts: {:.2f}'.format(avg_ask_comments))
print('Average comments on Show HN posts: {:.2f}'.format(avg_show_comments))

On average, `Ask HN` posts receive more comments than `Show HN` posts -- not surprising since `Ask HN` posts are usually queries to the Hacker News community.

## At what times of day do 'Ask HN' posts receive the most comments?

Next, I'll examine the times of day at which `Ask HN` posts receive the most comments. To start, I iterate over the `ask_posts` data, creating a list where each element is a given post, the entries then refer to the post's date/time created and the corresponding number of comments:

In [None]:
# iterate over ask_posts data
# and create a list where each element has post date/time and number of comments
result_list = []
for row in ask_posts:
    created_at = row[6]
    result_list.append([row[6], int(row[4])])
    
# show first entry in list
result_list[0]

Next, I create dictionaries to store the number of posts and number of total post comments per hour of the day:

In [None]:
# empty dicts to store # posts and comments by hour
counts_by_hour = {}
comments_by_hour = {}

# iterate over result_list and extract hour of each post
# then build dictionaries based on counts and comments per hour
for element in result_list:
    
    # create date/time object
    post_dt = dt.datetime.strptime(element[0], '%m/%d/%Y %H:%M')
    # extract hour of post
    post_hour = post_dt.strftime('%H')
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = element[1]
    elif post_hour in counts_by_hour:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += element[1]

# show each dictionary
print(counts_by_hour)    
print(comments_by_hour)

I can use these dictionaries to calculate, for each hour of the day, the average number of comments per post in a list of lists:

In [None]:
# calculate average number of comments per post for each hour
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour)

This isn't very nice to read or easy to gain insights from, so I'll sort the list of lists in descending order of average comments per hour:

In [None]:
# create a list equivalent to avg_by_hour, but swap the columns (for sorting)
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

In [None]:
# use the sorted() function to sort swap_avg_by_hour in descending order
# (this sorts by average number of comments)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

In [None]:
print('Top 5 Hours for Ask Post Comments:')
for row in sorted_swap[0:5]:
    hr = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    print('{hr}: {avg:.2f} average comments per post'.format(hr=hr, avg=row[0]))

We can see that `Ask HN` posts receive the most comments later in the day: at 3pm, 2pm, 8pm, 4pm, and 9pm.

## At what times of day do 'Show HN' posts receive the most comments?

Next, I'll do the same thing for `Show HN` posts, examining the hours during the day where the highest average number comments per post are made:

In [None]:
# iterate over show_posts data
# and create a list where each element has post date/time and number of comments
result_list = []
for row in show_posts:
    created_at = row[6]
    result_list.append([row[6], int(row[4])])

In [None]:
# empty dicts to store # posts and comments by hour
counts_by_hour = {}
comments_by_hour = {}

# iterate over result_list and extract hour of each post
# then build dictionaries based on counts and comments per hour
for element in result_list:
    
    # create date/time object
    post_dt = dt.datetime.strptime(element[0], '%m/%d/%Y %H:%M')
    # extract hour of post
    post_hour = post_dt.strftime('%H')
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = element[1]
    elif post_hour in counts_by_hour:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += element[1]

In [None]:
# calculate average number of comments per post for each hour
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

In [None]:
# create a list equivalent to avg_by_hour, but swap the columns (for sorting)
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

# use the sorted() function to sort swap_avg_by_hour in descending order
# (this sorts by average number of comments)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [None]:
print('Top 5 Hours for Show Post Comments:')
for row in sorted_swap[0:5]:
    hr = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    print('{hr}: {avg:.2f} average comments per post'.format(hr=hr, avg=row[0]))

Similar to `Ask HN` posts, `Show HN` posts receive the highest number of average comments later in the day: at 6pm, midnight, 2pm, 11pm, and 10pm.

## Conclusion

In this report, I explored data on posts to the site `Hacker News` and discovered several things:

- There are more posts made in the `Ask HN` category (questions) than in the `Show HN` category (which showcase something).
- `Ask HN` and `Show HN` posts receive the most comments on average later in the day, particularly into the evening hours.