# Exploration of Hacker News Posts

## Purpose

This short project is an analysis of a sampling of posts from the Hacker News website. The goal is to assess whether `Show HN` or `Ask HN` posts get more comments and points, and to investigate whether the timing of a post has any effect on the number of comments made. 

The purpose of conducting this exercise in analysis is demonstrate the use date and time information in a data analysis project. 

## Data

### Importing the data

The data I will use in the analysis is from the Hacker News site, run by [Y Combinator](https://www.ycombinator.com/) and the full dataset can be accessed [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). For the purposes of this analysis of commentary on posts, the dataset has been reduced from 300,000 rows to 20,100 rows by excluding posts without comments and randomly sampling the remainder. 

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


### Organizing and cleaning the data

The next step in filtering the data is separating out the posts of interest for this analysis - the `Ask HN` and `Show HN` posts. 

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of ask posts:', len(ask_posts))
print('Number of show posts:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


## Analysis

### Analysis of average number of comments per post by post _type_

Now I will sift through the posts and track the number of comments for each kind of post of interest.

In [3]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    total_ask_comments +=int(row[4])
    
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)

print('Average number or ask comments:', '{:.2f}'.format(avg_ask_comments))
print('Average number of show comments:', '{:.2f}'.format(avg_show_comments))

Average number or ask comments: 14.04
Average number of show comments: 10.32


On average `Ask HN` posts recieve almost 4 more comments per post than `Show HN` posts. This is not surprising, as the format of an `Ask HN` post literally invites response - that is its purpose. 

### Analysis of average number of comments per post by post _time_

As `Ask` posts are the more commented-on, on average, I will explore them further by finding if there is any relationship between the number of comments and the timing of the post. 

In [4]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_time = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    hour = post_time.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour] += row[1]

avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([comments_by_hour[hour]/counts_by_hour[hour], hour])

sorted_avg = sorted(avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Post Comments')
for row in sorted_avg[0:5]:
    #format the time and convert it to the local time (EST to PST) using the dt.timedelta function
    time = dt.datetime.strptime(row[1],'%H')-dt.timedelta(hours = 3)
    print(time.strftime('%H:%M'), ':', '{:.2f}'.format(row[0]), 'average comments per post')

Top 5 Hours for Ask Post Comments
12:00 : 38.59 average comments per post
23:00 : 23.81 average comments per post
17:00 : 21.52 average comments per post
13:00 : 16.80 average comments per post
18:00 : 16.01 average comments per post


The above result indicates that the best time to create an `Ask HN` post, if you want the maximum number of comments is 12 pm PST. The top five times are 12 pm, 11 pm, 5 pm, 1 pm, and 6 pm, PST. A poster can get more than twice the number of comments on a post at noon than a post an hour later at 1 pm. Additionally, if you want to maximize the number of comments on your post, you're better off posting in the afternoon rather than in the morning. 

### Analyzing the average number of points per post type

The first evaluation metric we looked at was comments - how much engagement did a post generate by type and time. A second way to gauge the success of a post is by the number of points in gets. Points are equal to the number of upvotes a particular post gets, or how much other users of the website enjoy the post. 

In [6]:
total_ask_points = 0
total_show_points = 0
total_other_points = 0

for row in ask_posts:
    total_ask_points +=int(row[4])
    
for row in show_posts:
    total_show_points += int(row[4])
    
for row in other_posts:
    total_other_points += int(row[4])
    
avg_ask_points = total_ask_points/len(ask_posts)
avg_show_points = total_show_points/len(show_posts)
avg_other_points = total_other_points/len(other_posts)

print('Average number or ask points:', '{:.2f}'.format(avg_ask_points))
print('Average number of show points:', '{:.2f}'.format(avg_show_points))
print('Average number of other points:','{:.2f}'.format(avg_other_points))

Average number or ask points: 14.04
Average number of show points: 10.32
Average number of other points: 26.87


## Discussion and Recommendations

This short survery of posts from the Hacker News site demonstrates trends that make intuitive sense. Posts _asking_ for response (`Ask HN`) posts recieve more comments than posts simply want to display a skill or information (`Show HN` posts). Additionally, while `Ask HN` posts garner more points than `Show HN` posts, other posts - including those sharing useful information and resources and exciting news - get more points than either. 

One of the purposes of this analysis was to practice and demonstrate the use of datetime information in analyses. To that end, the best time to create an `Ask  HN` post in order to recieve the most responses was determined. The data suggest that posts make in the afternoon (Pacific Time) get more responses than morning posts, and that 12 pm, 11 pm, and 5 pm were the top three times to make a post. 