# Exploring Hacker News Posts (Guided Project)

**Description:** [Hacker News](https://news.ycombinator.com) is a website by the startup incubator Y Combinator, where user-submitted stories (also known as "posts") receive votes and comments, similar to Reddit.  It is extremely popular in technology and startup circles, and posts that make it to the top of the listing can get hundreds of thousands of views.  

We are specifically interested in the posts with titles that begin with `Ask HN` or `Show HN`.  `Ask HN` posts are those where specific questions are asked to the Hacker News community.  `Show HN` posts are those where projects, products, or anything else interesting is shared with the Hacker News community.

We will work with the dataset described below to answer two questions:  

1. Do `Ask HN` or `Show HN` posts receive more comments (on average)?
2. Do posts created at a certain time receive more comments (on average)?

**Dataset:** The dataset consists of information related to submissions (posts) to Hacker News.  It was originally ~300,000 rows, but it got reduced to ~20,000 rows after removing all submissions that did not receive any comments and further randomly sampling the remaining submissions.  There are 7 fields (columns) in total, and examples include `id`, `num_points`, `num_comments`, and `created_at`.  More details can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

In [None]:
# Open the csv file and read it in as a list of lists
from csv import reader

opened_file = open('_data/HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)

# Display the first five rows of the dataset
print("Below are the first five rows of the data:\n")
print(hn[:5])

In order to analyze the data, the headers in the first row are separated.

In [None]:
# Field names (i.e. headers) are split out from the rest of the dataset
headers = hn[0]
hn = hn[1:]

# Print headers and then two rows of data to confirm
print("Below are the headers:\n")
print(headers)
print("\nBelow are the first two rows of the data:\n")
print(hn[:2])

We are only interested in posts that begin with `Ask HN` or `Show HN` (as mentioned at the beginning of this project).  Thus, two lists are created for the data containing only those titles.

In [None]:
ask_posts = []
show_posts = []
other_posts = []

# Append rows of the posts that start with 'ask hn' to the ask_posts list, and append rows
# of the posts that start with 'show hn' to the show_posts list.
# Use the `lower` method to convert string to all lowercase.
for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Check the number of posts of each type
print("The number of Ask HN posts is:", len(ask_posts))
print("The number of Show HN posts is:", len(show_posts))
print("The number of other posts is:", len(other_posts))

In [None]:
# Calculate average number of comments in Ask HN posts
total_ask_comments = 0

for ask in ask_posts:
    num_comments = ask[4]
    total_ask_comments += int(num_comments)
    
avg_ask_comments = total_ask_comments / len(ask_posts)
    
print("The average number of comments in Ask HN posts is:", round(avg_ask_comments, 1))

# Calculate average number of comments in Show HN posts
total_show_comments = 0

for show in show_posts:
    num_comments = show[4]
    total_show_comments += int(num_comments)
    
avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments in Show HN posts is:", round(avg_show_comments, 1))

**Ask HN posts receive more comments** per post on average (about 10) than Show HN posts (about 5).  This makes sense since Ask HN posts are for those seeking answers to specific question(s), whereas Show HN posts do not necessarily require input from others.  This does not necessarily mean that Ask HN posts are more popular than Show HN posts however.

Next we turn our attention to only `Ask HN` posts, since they are more likely to receive comments than `Show HN` posts. In particular, we look at the number of posts and the number comments by hour of day, and calculate the average number of comments per post by hour of day.

In [None]:
import datetime as dt

result_list = []

for ask in ask_posts:
    created_at = ask[6]
    num_comments = int(ask[4])
    
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    date_time_str = result[0]
    # Turn datetime string into datetime object
    date_time_obj = dt.datetime.strptime(date_time_str, "%m/%d/%Y %H:%M")
    # Extract just the hour of datetime object
    hour = dt.datetime.strftime(date_time_obj, "%H")
    # Get number of comments in the post for the following if statements
    num_comments = int(result[1])
    
    # Count the number of Ask HN posts and the number of Ask HN comments, in separate dictionaries, by hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

In [None]:
avg_by_hour = []

# Calculate the average number of comments per Ask HN post by hour
for hour in counts_by_hour:
    comments_per_post = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, round(comments_per_post, 1)])
    
print("Below is the average number of comments per Ask HN post by hour: \n")
print(avg_by_hour)

To make the above output cleaner and to summarize it, we wrote the below code to print the top 5 hours of average comments per `Ask HN` post (as well as the average number of comments per post).

In [None]:
swap_avg_by_hour = []

# To sort in descending order, first swap the elements of the list (i.e. exchange the order)
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
# Sort in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print top 5 hours of average comments per post
print("Top 5 Hours for Ask HN Posts Comments:\n")

for row in sorted_swap[:5]:
    hour_str = row[1]
    hour_dt = dt.datetime.strptime(hour_str, "%H")
    hour_dt = dt.datetime.strftime(hour_dt, "%H:%M")
    print(hour_dt,"- ", row[0], "average comments per post")

As noted in the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), Eastern post times are used, so **3pm EST** and **1pm EST** are the times when `Ask HN` posts can have a **higher** chance of receiving answers/comments.  In particular, the average number of comments per post between 3pm EST and 4pm EST is ~29 and the average number of comments per post between 1pm EST and 2pm EST is ~16.  This makes sense because afternoons are likely when many people take various breaks from work and browse their favorite forums.

## Conclusion
In this guided project, we worked with the Hacker News submissions dataset to answer two questions below:  
* Do `Ask HN` or `Show HN` posts receive more comments (on average)?  
    *Answer*: We found that on average, `Ask HN` posts receive  ~10 comments per post, wheras `Show HN` posts receive about half of that (around 5 comments per post).  
    
    
* Do `Ask HN` posts created at a certain time receive more comments (on average)?  
    *Answer*: We found that afternoons are when `Ask HN` posts receive the most comments, on average.  Specifically, the average number of comments per post between 3pm and 4pm EST is 29, and the average number of comments per post between 1pm and 2pm EST is 16.  The remaining hours in the day have less than 12 comments per post (again, on average).