# Exploring Hacker News Posts

* We are interested in analyzing posts on Hacker news that received comments.

* We are mainly interested in 2 types of posts: Ask HN and Show HN

* We want to compare these 2 to determine which of the 2 types of posts receive more comments on average.

* We also wish to find out if the posts created at a certain time receive more comments on average.

# 1 & 2 - Opening the 'hacker_news.csv' file and reading it in

* We open the dataset file hacker_news.csv and read it into 

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




# 3 - Extrating Ask HN and Show HN posts:

* We will filter posts for Ask HN and Show HN. These 2 types are the only ones we are interested in.

* We will loop through the main dataset and extract ask hn and show hn posts into their own seperte lists.

* After this is done, the length of each list will be checked.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('ask_posts size: ' + str(len(ask_posts)))
print('show_posts size: ' + str(len(show_posts)))
print('other_posts size: ' + str(len(other_posts)))   

ask_posts size: 1744
show_posts size: 1162
other_posts size: 17194


# 4 - Calculating the Average Number of Comments for Ask and Show posts

In [5]:
#Finding average # of comments on ask HN posts.
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments/len(ask_posts)

print("Average number of comments for Ask HN posts is: " + str(avg_ask_comments))
print('\n')

#Finding average # of comments on show HN posts.
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments/len(show_posts)

print("Average number of comments for the Show HN posts is: " + str(avg_show_comments))
print('\n')

Average number of comments for Ask HN posts is: 14.038417431192661


Average number of comments for the Show HN posts is: 10.31669535283993




* We can see from the results of our analysis that Ask HN posts recieve roughly 40% more comments than Show HN posts. This is in addition to the fact that there are about 50% more Ask HN posts compared to Show HN posts.

* This could be due to the ver nature of ask posts - as they are by definition "asking" for a response, with the questions/asks being of a more urgent/engaging nature.

# 5 - Calculating the amount of Ask HN posts created every hour as well as the total number of comments.

In [8]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6] ,int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    p_date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = p_date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

* Examening the contents of our newly created counts_by_hour and comments_by_hour. We can see 2 sets of data, order by the hour of the day, starting with 0 hours(12)am and ending with 23 hours(11pm)

In [10]:
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}


# 6 : Calculating the Average Number of Comments for Ask HN posts by Hour.

* using the date contained in our 2 newly formed dictionairies, we can now calculate the average number of comments received by Ask HN posts by hour.

In [11]:
avg_by_hour = []
for hour in comments_by_hour:
    average = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, average])

print(avg_by_hour)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


* We can see from the data above that on average, there was a spike in the number of comments for posts that were posted between the hours of 1PM and 5PM.

# 7 Sorting and Printing Values from a List of Lists:

In [13]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print(sorted_swap[:5])


[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21]]


* Inspecting the swap_avg_by_hour and sorted_swap dictionaries.


In [14]:
for row in sorted_swap:
    print('{}:00: {:.2f} average comments per post'.format(row[1], row[0]))

15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
1:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
8:00: 10.25 average comments per post
5:00: 10.09 average comments per post
12:00: 9.41 average comments per post
6:00: 9.02 average comments per post
0:00: 8.13 average comments per post
23:00: 7.99 average comments per post
7:00: 7.85 average comments per post
3:00: 7.80 average comments per post
4:00: 7.17 average comments per post
22:00: 6.75 average comments per post
9:00: 5.58 average comments per post


# Conclusion:

* According to our results: The best hour to make a post is between 3PM and 4PM. The next best hour is the one between 2AM and 3AM, followed closey by the hour between 8PM to 9PM.