# Hacker News Dataquest Project

In this project we will show whether `Ask HN` or `Show HN` posts to Hacker News receive more comments on average. We will also show if posts created at a certain time receive more comments on average. The whole dataset is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but we will be using a subset. Our subset contains a random sample of 20,000 submissions that do not have any comments.

In [3]:
#read hacker_news.csv in as a list of lists assigned to variable hn
from csv import reader
hn = list(reader(open("hacker_news.csv")))
#display the first five rows of hn
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [4]:
#extract the first row of data, assign it to variable headers
headers = hn[0]
#remove the first row of hn
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Separate posts beginning with `Ask HN` and `Show HN` (and case variations) into two different lists

In [7]:
#create empty lists
ask_posts = []
show_posts = []
other_posts = []

#loop through each row in hn
for row in hn:
    title = row[1]
    #make title lowercase
    title = title.lower()
    #if the title starts with "ask hn", append row to ask_posts
    if title.startswith("ask hn"):
        ask_posts.append(row)
    #if the title begins with "show hn", append row to show_posts
    elif title.startswith("show hn"):
        show_posts.append(row)
    #else append to other_posts
    else:
        other_posts.append(row)

#check number of posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Determine if the ask posts or show posts receive more comments on average.

In [8]:
#find the total number of comments in ask posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
#compute the average number of comments on ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

#find the total number of comments in show posts
total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
#compute the average number of comments on show posts
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Ask posts have 14 comments on average, and show posts have 10 comments on average. Ask posts have, on average, 4 more comments than show posts.

We will focus the remainder of our analysis on ask posts.

Below, we calculate the number of ask posts created per hour, along with the total number of comments.

In [10]:
import datetime as dt

result_list = [] #this will be a list of lists

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    num_comments = int(num_comments)
    #append to result_list a list with two elements: 
    #created_at and num_comments
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    #extract hour from each date
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")
    
    #create frequency tables
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [12]:
#average number of comments per post for each hour
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [13]:
#create a list that equals avg_by_hour with swapped columns
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [15]:
#sort swap_avg_by_hour in deescending order (number of comments)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print(
    "{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hr,"%H").strftime("%H:%M"), avg)
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


In US Eastern Time, the best time to 