# Exploring HACKER NEWS posts

Idenfitying types of posts, time of post, and response as a metric of importance and effectiveness. With the analysis of this data, we can parse through the data to discover the best type of post and the most reasonable time to post to expect feedback. The project below will ascertain precisely that.

## Reading The Data

To start, we need to see what we are working with. Using the reader and giving the file a variable, we can open the data and look at it. Here we show the first 5 rows.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Extracting the Header

Having a header involved in calculations disrupts the entire process. Next we will separate the header and the data and assign them appropriate variables.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracing Ask HN and Show HN Posts

Lets divide the type of posts available to us to further drill down into useable and refined data

We will give each type a list and use those lists to perfor, further actions. Here we split the data into "Ask HackerNews" and "Show HackerNews" posts. A for loop and "if, else" will do nicely for this task.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Above is the result of the size of each list. Other posts dominate the frequency, but "Ask HN" posts are more plentiful than "Show HN" posts by 50%. Cleaerly, "Ask HN" is a more used post type.

Below we will examine the first 5 rows of "Ask HN" and "Show HN" posts.

In [22]:
print(ask_posts[:5])
print('')
print(show_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', '

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

To start, we will define an average comment function to determine which post type recieves more comments on average.

Then will with display the average.

In [5]:
def avg_comments(hn):
    total_comments = 0
    for row in hn:
        num_comments = int(row[4])
        total_comments += num_comments
    avg = total_comments/len(ask_posts)
    return avg

avg_ask_comments = avg_comments(ask_posts)
avg_show_comments = avg_comments(show_posts)

print(round(avg_ask_comments,2))
print(round(avg_show_comments,2))

    
    

14.04
6.87


It would seem that the "Ask Posts" recieve more than 2 times the response rate as "Show Posts". This is likely due to that fact that a question is being asked in "Ask Posts" and is far more likely to illicit a response. This is in-line with our discover that there are more "Ask HN" to begin with.

## Finding the Amount of Ask Posts and Comments by Hour Created

Utilizing the datetime library in Python, let's find out when these posts and comments on "Ask HN" posts are created.

Two for loops and parsing the time data in the database is all we need to acheive this. Below is the work.

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M').strftime("%H")
    if hour in comments_by_hour:
        comments_by_hour[hour] += row[1]
        counts_by_hour[hour] += 1
    else:
        comments_by_hour[hour] = row[1]
        counts_by_hour[hour] = 1
        
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Above we see that most posts are made in the late afternoon and early evening. These lists can be hard to sort through, later on we will order them by volume of hour.

In [7]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Agian, we have a similar result, with a bias towards the early afternoon

## Calculating the Average Number of Comments for Ask HN posts by Hour

With the two above lists offering us information on the amount of posts and comments, we are best off to establish exaclty what the average number of posts per comment per hour is. It could be that times with few posts could have a large number of comments attached to them. Let's find out.

In [8]:
comments_per_post = []

for hour in comments_by_hour:
        comments_per_post.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
        avg_by_hour = comments_per_post
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting the List

With 24 hours in a day, and no order to the average posts, we will sort the frequency.

By switching the list to have a

In [21]:
swap_avg_by_hour = []

for row in avg_by_hour:
    avg = row[1]
    hour = row[0]
    swap_avg_by_hour.append([avg, hour])
    
print(swap_avg_by_hour)
print('')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for r in sorted_swap[:5]:
    p = r[0]
    t = dt.datetime.strptime(r[1], '%H').strftime("%H")
    print(("{t}:00: {p:.2f} average comments per post").format(t=t,p=p))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion

Since this dataset is in the current time zone I live in (EST). It would make most sense for me to post at 3pm or 4pm to recieve the most feed back and help