# Interesting or Not? Exploring Hacker News Posts
In this project we'll try to explore posts in the website Hacker News to see which are more interesting, `ASK HN` or `SHOW HN`. The former are posts that ask questions to the general public while the latter could be a discussion of an intriguing and/or fascinating post found in the internet. We'll try to determine these two types of posts by asking ourselves the following:
* Do `ASK HN` or `SHOW HN` recieve more comments on average?
* Do posts created at a certain time receive more comments on average?

In [1]:
# Import libraries for reading csv
from csv import reader

# Open, read csv, and turn it into a list of lists
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header, hn = hn[0], hn[1:]

In [2]:
# Display header
hn_header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
# Display first 5 rows
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Data Filtering
In this section, we may want to separate posts beginning with `Ask HN` and `Show HN` in different lists.

In [4]:
# lists initialization
ask_posts = []
show_posts = [] 
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("ask_posts has {} posts".format(len(ask_posts)))
print("show_posts has {} posts".format(len(show_posts)))
print("other_posts has {} posts".format(len(other_posts)))

ask_posts has 1744 posts
show_posts has 1162 posts
other_posts has 17194 posts


In [5]:
# Head of ask posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [6]:
# Head of show posts
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

In [7]:
# Head of other posts
other_posts[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Data Analysis
Now that we've separated and seen some examples of `Ask HN` and `Show HN` posts, let's tackle our first question: of the two, which received the most comments on average? Because of this, we'll want to know how many comments are there (and its average) in each `show` and `ask` posts.

In [8]:
def compute_avg_comments(posts, num_comments_index):
    total_comments = 0
    for post in posts:
        # get num_comment column and add it to total
        total_comments += int(post[num_comments_index])
    
    # compute average:
    avg_num_comments = total_comments / len(posts)
        
    return avg_num_comments, total_comments

In [9]:
# ask_posts' number of comments
avg_ask_comments, total_ask_comments = compute_avg_comments(ask_posts, num_comments_index=4)
print("Ask HN:")
print("There are a total of {:,} comments in Ask HN posts".format(total_ask_comments))
print("On average, an Ask HN post amounts to {:,.2f}".format(avg_ask_comments))

Ask HN:
There are a total of 24,483 comments in Ask HN posts
On average, an Ask HN post amounts to 14.04


In [10]:
# show_posts' number of comments
avg_show_comments, total_show_comments = compute_avg_comments(show_posts, num_comments_index=4)
print("Show HN:")
print("There are a total of {:,} comments in Show HN posts".format(total_show_comments))
print("On average, an Show HN post amounts to {:,.2f}".format(avg_show_comments))

Show HN:
There are a total of 11,988 comments in Show HN posts
On average, an Show HN post amounts to 10.32


Our findings (with the exception of other posts) show that users of Hacker News tend to engage more on `Ask HN` posts than to participate in `Show HN` discussions about interesting stuff discovered in the internet. Probably, this is because most of us find it relieving when we answer to confused/curious people about things we may have experienced or knowledgable of. 

In our results, it can be seen that `Ask HN` has comments twice the size of `Show HN` with `24,483` and `11,988` respectively. Thus, it clearly reveals that on average, `Ask HN` posts receive more comments than `Show HN` posts.

---

Now that we have determined that `Ask HN` provides a more engaging environment than `Show HN` we'll be focusing more on analyzing `Ask HN` posts.

At this point, we're going to be tackling the second question which is: do post at a certain time of a day receive more comments than others? 

To do this, we need to calculate the number of ask posts created in each hour of the day along with the number of comments received. Additionally, we may want to compute an ask post's average comment per hour.

In [11]:
# we need datetime module for handling dates
import datetime as dt

# retrieve created_at and num_comments columns in data
result_list = []
for post in ask_posts:
    created_at = post[-1]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

# create freq table (dictionary) for posts 
# and its total comments per hour
counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    # extract list
    created_at, num_comments = result
    
    # convert string to datetime and extract hour as string
    date = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    # create freq tables for counts and comment by hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

The code above creates two frequency tables (in dictionaries): the number of posts in each hour of the day and the total amount of comments received of all post created per hour.

In [12]:
print("Number of ask posts created per hour of the day:")
counts_by_hour

Number of ask posts created per hour of the day:


{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [38]:
print("Total number of comments of all ask posts per hour of the day")
comments_by_hour

Total number of comments of all ask posts per hour of the day


{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Now, we will be computing the average number of comments per post for posts created during each hour of the day.

In [22]:
avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour.sort()

In [23]:
avg_by_hour

[['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['22', 6.746478873239437],
 ['23', 7.985294117647059]]

In [32]:
# swap placement of hour and avg in list to sort by avg
swap_avg_by_hour = []
for hour, val in avg_by_hour:
    swap_avg_by_hour.append([val, hour])
swap_avg_by_hour.sort(reverse=True)

In [36]:
swap_avg_by_hour

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [37]:
# Display top 5 hours of the day for average comment per post
for avg, hour in swap_avg_by_hour[:5]:
    hour = dt.datetime.strptime(hour, "%H")
    hour_str = hour.strftime("%H:%M")
    msg = "{}: {:.2f} average comments per post.".format(hour_str, avg)
    print(msg)

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


We have calculated the average number of comments per post in each hour of the day. Then, we have created a list including hour of the day and average comments per post. After which, we have sorted them starting from highest average descending. 

From the results, we can conclude that it's better to ask questions in Hacker News at around 15:00 (or 3pm) of the day to get more comments from other users. But then, we're not limited to this time of the day. Interestingly, a lot of night owls also respond to `Ask HN` posts at 2am in the morning.

In summary, post your questions 3pm onwards for it to be more visible to respondents! :D