# Exploring Hacker News Post

In this project we will explore posts that were posted on Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

## Data

The data can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It contains almost 300,000 rows, each row representing a post. However we use of a version that been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

## Descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

In this project, we are more interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN to ask the Hacker News community a question. Below is an example of Ask HN

    Ask HN: How to improve my personal website?
    Ask HN: Am I the only one outraged by Twitter shutting down share counts?
    Ask HN: Aby recent changes to CSS that broke mobile?

Users submit Show HN to show the community a project, product, or something interesting. Below is an example:

    Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
    Show HN: Something pointless I made
    Show HN: Shanhu.io, a programming playground powered by e8vm

Our goal is to compare the 2 types of posts to determine:

    Do Ask HN or Show HN receive more comments on average?
    Do posts created at a certain time receive more comments on average?


## Read data and print first five rows

In [40]:
import pprint
pp = pprint.PrettyPrinter()
from csv import reader
with open('hacker_news.csv') as f:
    read_file = reader(f)
    hn = list(read_file)
    pp.pprint(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]


## Removing Headers from a List of Lists

In [41]:
headers = hn[0]
hn = hn[1:]
pp.pprint(headers)
pp.pprint(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '

## Extracting Ask HN and Show HN Posts

In [42]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print("Number of ask hn post {}".format(len(ask_posts)))
print("Number of show hn post {}".format(len(show_posts)))
print("Number of other post {}".format(len(other_posts)))
        

Number of ask hn post 1744
Number of show hn post 1162
Number of other post 17194


We separated the `ask posts`, `show posts` and `other posts` into 3 list of lists. You can see that we have 1744 ask posts, 1162 show posts and 17194 other posts. Below is the first five rows of the each posts type

In [43]:
print('ASK POSTS\n=====================')
pp.pprint(ask_posts[:5])
print('SHOW POSTS\n=====================')
pp.pprint(show_posts[:5])
print('OTHER POSTS\n=====================')
pp.pprint(other_posts[:5])

ASK POSTS
[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]
SHOW POSTS
[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development '
  'Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:4

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [44]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average number of comments for ask posts: {:.2f}'.format(avg_ask_comments))

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments/len(show_posts)
print ('Average number of comments for show posts: {:.2f}'.format(avg_show_comments))


Average number of comments for ask posts: 14.04
Average number of comments for show posts: 10.32


On average the ask posts receive more comments than the show posts.

Ask posts has more comments on average 14 comments than show posts with 10 comments. 

People are like to answer a question than to comment on a show post. That's why ask post are more likely to receive comments.

## Finding the Amount of Ask Posts and Comments by Hour Created

In [45]:
import datetime as dt

result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
    created_at = dt.datetime.strptime(row[0], date_format)
    hour = created_at.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print('Posts created by hour:')
pp.pprint(counts_by_hour)
print('======================================')
print('Comments posted by hour:')
pp.pprint(comments_by_hour)

Posts created by hour:
{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}
Comments posted by hour:
{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}


Above, we created 2 dictionaries: `counts_by_hour` for the posts created per hour and `comments_by_hour` for the comments created by hour. The hours are in 24h format. For example you can see that at `17(5pm)` there were `100` posts and `1146` comments created.

## Calculating the Average Number of Comments for Ask HN Posts by Hour
Now let's calculate the average number of comments for posts created during each hour of the day. We'll use the counts_by_hour and comments_by_hour dictionaries.

In [46]:
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment, comments_by_hour[comment]/counts_by_hour[comment]])
print("Average no's of comments per post:")
pp.pprint(avg_by_hour)

Average no's of comments per post:
[['00', 8.127272727272727],
 ['11', 11.051724137931034],
 ['22', 6.746478873239437],
 ['06', 9.022727272727273],
 ['18', 13.20183486238532],
 ['14', 13.233644859813085],
 ['05', 10.08695652173913],
 ['07', 7.852941176470588],
 ['15', 38.5948275862069],
 ['23', 7.985294117647059],
 ['04', 7.170212765957447],
 ['20', 21.525],
 ['19', 10.8],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['12', 9.41095890410959],
 ['10', 13.440677966101696],
 ['02', 23.810344827586206],
 ['21', 16.009174311926607],
 ['03', 7.796296296296297],
 ['17', 11.46],
 ['08', 10.25],
 ['13', 14.741176470588234],
 ['09', 5.5777777777777775]]


## Sorting and Printing Values from a List of Lists

In [47]:
swap_avg_by_hour = []
for h, c in avg_by_hour:
    swap_avg_by_hour.append([c,h])
pp.pprint(swap_avg_by_hour)

[[8.127272727272727, '00'],
 [11.051724137931034, '11'],
 [6.746478873239437, '22'],
 [9.022727272727273, '06'],
 [13.20183486238532, '18'],
 [13.233644859813085, '14'],
 [10.08695652173913, '05'],
 [7.852941176470588, '07'],
 [38.5948275862069, '15'],
 [7.985294117647059, '23'],
 [7.170212765957447, '04'],
 [21.525, '20'],
 [10.8, '19'],
 [16.796296296296298, '16'],
 [11.383333333333333, '01'],
 [9.41095890410959, '12'],
 [13.440677966101696, '10'],
 [23.810344827586206, '02'],
 [16.009174311926607, '21'],
 [7.796296296296297, '03'],
 [11.46, '17'],
 [10.25, '08'],
 [14.741176470588234, '13'],
 [5.5777777777777775, '09']]


In [48]:
# sort by the average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
pp.pprint(sorted_swap[:5])

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]


As you can see above we sorted through our swapped list and printed the top 5 hours for Ask posts comments. 15(3pm) has the most comments per hour with 38.5 followed by 2am with 23.8

In [49]:
print ('Top 5 Hours for Ask Posts Comments', '\n')
for comment, hour in sorted_swap[:5]:
    each_hour = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
    comment_per_hour = '{h}: {c:.2f} average comments per post'.format(h = each_hour, c = comment)
    print(comment_per_hour)

Top 5 Hours for Ask Posts Comments 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusions

Let's summarize the project.

**Post title:** when creating posts, adding Ask HN to your post title will do better for attracting comments than adding Show HN:

    Ask HN: 14.04 average comments per post
    Show HN: 10.32 average comments per post

**Post timing:** the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the Ask HN posts, the top hours are:

    15:00: 38.59 average comments per post
    02:00: 23.81 average comments per post
    20:00: 21.52 average comments per post
    16:00: 16.80 average comments per post
    21:00: 16.01 average comments per post