# Exploring Hacker News Posts

Guided project from dataquest.io

This project will explore posts from Hacker News. Specifically Ask HN and Show HN posts, where users either ask the community a question or show the community something.
We will assess which type of posts receive more comments, and assess where the time of posts affects the number of comments.

### Import Data

In [2]:
# create list of lists from file
from csv import reader
opened_file = open('data/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
# extract header row
headers = hn[0]
hn = hn[1:]
print(headers)
print("\n")
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Filter data to Show and Ask posts

In [4]:
# loop through the data to seperate posts into 3 lists, ask, show and other
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# show number of each type of post        
print('ask posts: ' + str(len(ask_posts)))
print('show posts: ' + str(len(show_posts)))
print('other posts: ' + str(len(other_posts)))

ask posts: 1744
show posts: 1162
other posts: 17194


### Determining average number of comments per post type

In [5]:
# get total comments for ask posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    total_ask_comments += int(num_comments)
    
# calculate the average
avg_ask_comments = total_ask_comments / len(ask_posts)
print("average comments for ask posts: {:.2f}".format(avg_ask_comments))

average comments for ask posts: 14.04


In [6]:
# get total comments for show posts
total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    total_show_comments += int(num_comments)
    
# calculate the average
avg_show_comments = total_show_comments / len(show_posts)
print("average comments for show posts: {:.2f}".format(avg_show_comments))

average comments for show posts: 10.32


From the previous cells we can see Ask HN posts receive on average four more comments than Show HN posts.

### Finding number of ask posts and comments by time

The analysis will focus on only the Ask posts, as they received more comments on average.

In [7]:
# import datetime module
import datetime as dt

# create new list of created dates and number of comments
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])
    
# create frequency tables as dictionaries for posts by hour and comments
counts_by_hour = {}
comments_by_hour = {}

# loop through new list
for row in result_list:
    str_date = row[0]
    num_comments = row[1]
    dt_date = dt.datetime.strptime(str_date,"%m/%d/%Y %H:%M") # parse date field as datetime
    hour = dt_date.strftime("%H") # extract only hour
    if hour not in counts_by_hour:
        # if new hour set count to 1 and comments
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    if hour in counts_by_hour:
        # if hour already in dictionary increment count by 1 and comments
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 46, '13': 86, '10': 60, '14': 108, '16': 109, '23': 69, '12': 74, '17': 101, '15': 117, '21': 110, '20': 81, '02': 59, '18': 110, '03': 55, '05': 47, '19': 111, '01': 61, '22': 72, '08': 49, '04': 48, '00': 56, '06': 45, '07': 35, '11': 59}


{'09': 257, '13': 1282, '10': 794, '14': 1419, '16': 1831, '23': 544, '12': 691, '17': 1147, '15': 4478, '21': 1749, '20': 1724, '02': 1384, '18': 1441, '03': 422, '05': 493, '19': 1191, '01': 716, '22': 481, '08': 497, '04': 340, '00': 457, '06': 398, '07': 269, '11': 643}


Average comments per post for each hour

In [8]:
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row,comments_by_hour[row]/counts_by_hour[row]])
    
print(avg_by_hour)

[['09', 5.586956521739131], ['13', 14.906976744186046], ['10', 13.233333333333333], ['14', 13.13888888888889], ['16', 16.798165137614678], ['23', 7.884057971014493], ['12', 9.337837837837839], ['17', 11.356435643564357], ['15', 38.27350427350427], ['21', 15.9], ['20', 21.28395061728395], ['02', 23.45762711864407], ['18', 13.1], ['03', 7.672727272727273], ['05', 10.48936170212766], ['19', 10.72972972972973], ['01', 11.737704918032787], ['22', 6.680555555555555], ['08', 10.142857142857142], ['04', 7.083333333333333], ['00', 8.160714285714286], ['06', 8.844444444444445], ['07', 7.685714285714286], ['11', 10.898305084745763]]


In [9]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

[[5.586956521739131, '09'], [14.906976744186046, '13'], [13.233333333333333, '10'], [13.13888888888889, '14'], [16.798165137614678, '16'], [7.884057971014493, '23'], [9.337837837837839, '12'], [11.356435643564357, '17'], [38.27350427350427, '15'], [15.9, '21'], [21.28395061728395, '20'], [23.45762711864407, '02'], [13.1, '18'], [7.672727272727273, '03'], [10.48936170212766, '05'], [10.72972972972973, '19'], [11.737704918032787, '01'], [6.680555555555555, '22'], [10.142857142857142, '08'], [7.083333333333333, '04'], [8.160714285714286, '00'], [8.844444444444445, '06'], [7.685714285714286, '07'], [10.898305084745763, '11']]


In [10]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

In [11]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hours = dt.datetime.strptime(row[1],"%H")
    hours = hours.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hours,row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post


In Eastern US time posts created from 15:00 to 16:00 received the most comments.
The ranges 15:00 to 17:00 and 20:00 to 22:00 are both the best windows to post for comments.

The 20:00 to 22:00 Eastern US window is 02:00 in the UK, and not a reasonable time to be posting online.

It would be best posting Ask HN posts from the UK from 20:00 to 22:00 to fall within the 15:00 to 16:00 Eastern US window. This will increase the likelihood of receiving comments from the Ask HN posts