## Analyzing Hacker News Posts

This project aims to compare two distinct post types on [Hacker News](https://news.ycombinator.com/), a platform where users vote and comment on tech-related content. We'll focus on `Ask HN` and `Show HN` posts.

`Ask HN` posts invite the community to answer a specific question, such as "How to improve my personal website?" Conversely, "Show HN" posts showcase projects, products, or other interesting items.

Our analysis will determine:

* Which post type, `Ask HN` or `Show HN` typically generates more comments?
* Does post timing influence the volume of comments?"
  
It's important to note that our dataset, originally comprising nearly 300,000 submissions, has been reduced to approximately 20,000 entries. This reduction involved eliminating posts with no comments and randomly sampling from the remaining data.

## Introduction
We'll begin by importing the dataset and removing the header row.

In [12]:
# Read in the data
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers from a List of Lists

In [18]:
# Remove the headers.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


The dataset includes the post title, comment count, and creation date for each post. Let's start by comparing comment counts between `Ask HN` and `Show HN` posts.

## Extracting Ask HN and Show HN Posts
Next, we'll separate the data into two groups based on post type: `Ask HN` and `Show HN`. This will simplify our analysis.

In [24]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Analyzing Average Comment Engagement for "Ask HN" and "Show HN" Posts
With the data separated, we'll calculate the average comment count for each post type.

In [32]:
# Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0

for row in ask_posts:
    total_show_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [36]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


`Ask HN` posts average around 14 comments, compared to 10 for `Show HN` posts. Given the higher engagement, we'll focus our analysis on `Ask HN` posts.

## Determining "Ask HN" Post and Comment Volume by Creation Hour
To identify potential optimal posting times, we'll analyze Ask HN posts by creation hour. This involves calculating post and comment counts for each hour, followed by determining average comment counts per hour.

In [44]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.

import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = dt.datetime.strptime(row[0], date_format)
    time = dt.datetime.strftime(date, '%H')
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = row[1]
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += row[1]

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Determining Average Comment Counts for "Ask HN" Posts by Hour

In [53]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, round((comments_by_hour[hr] / counts_by_hour[hr]))])

avg_by_hour

[['09', 6],
 ['13', 15],
 ['10', 13],
 ['14', 13],
 ['16', 17],
 ['23', 8],
 ['12', 9],
 ['17', 11],
 ['15', 39],
 ['21', 16],
 ['20', 22],
 ['02', 24],
 ['18', 13],
 ['03', 8],
 ['05', 10],
 ['19', 11],
 ['01', 11],
 ['22', 7],
 ['08', 10],
 ['04', 7],
 ['00', 8],
 ['06', 9],
 ['07', 8],
 ['11', 11]]

## Sorting and Printing Values from a List of Lists

In [61]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[6, '09'], [15, '13'], [13, '10'], [13, '14'], [17, '16'], [8, '23'], [9, '12'], [11, '17'], [39, '15'], [16, '21'], [22, '20'], [24, '02'], [13, '18'], [8, '03'], [10, '05'], [11, '19'], [11, '01'], [7, '22'], [10, '08'], [7, '04'], [8, '00'], [9, '06'], [8, '07'], [11, '11']]


[[39, '15'],
 [24, '02'],
 [22, '20'],
 [17, '16'],
 [16, '21'],
 [15, '13'],
 [13, '18'],
 [13, '14'],
 [13, '10'],
 [11, '19'],
 [11, '17'],
 [11, '11'],
 [11, '01'],
 [10, '08'],
 [10, '05'],
 [9, '12'],
 [9, '06'],
 [8, '23'],
 [8, '07'],
 [8, '03'],
 [8, '00'],
 [7, '22'],
 [7, '04'],
 [6, '09']]

In [65]:
# Sort the values and print the the 5 hours with the highest average comments.
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 39 average comments per post
02:00: 24 average comments per post
20:00: 22 average comments per post
16:00: 17 average comments per post
21:00: 16 average comments per post


The optimal posting time for maximizing comment engagement is 15:00, averaging 39 comments per post. This time yields approximately 60% more comments than the second-best hour.

Please note that the [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts/home) uses Eastern Time (EST), making the optimal time 3:00 pm est.

## Conclusion
Our analysis of Hacker News posts indicates that `Ask HN` posts generally receive more comments than `Show HN` posts. Furthermore, `Ask HN` posts created between `15:00` and `16:00` tend to generate the highest average number of comments.

It's crucial to note that our findings are based on a dataset excluding posts with zero comments. Consequently, our results highlight trends among posts that actively engaged the community.

While these insights offer valuable guidance for maximizing comment engagement on Hacker News, further research with a more comprehensive dataset could provide additional insights and potentially refine these findings.