# Introduction

This project will be focused on exploring a data set of posts submitted to the [Hacker News](https://news.ycombinator.com/) website for 12 months up to September 2016.

The two questions we are interested in answering are:
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Note that the data set has been reduced from the original 300,000 rows (approx.) to 20,000 rows after removing submissions that received no comments then conducting random sampling.

Please find further documentation to the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

## Open and read data set

In [1]:
import csv

opened_file = open(r'C:\Users\tsa19\Dataquest Projects\Datasets\HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = csv.reader(opened_file)
hn_raw = list(read_file)

## Display first five rows of data set ##
print(hn_raw[:5])
print(len(hn_raw))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
293120


## Preparing data set

In [2]:
# Extract the first row as headers
headers = hn_raw[0]
hn_data = hn_raw[1:]

# Removing data with no comments
hn = []

for row in hn_data:
    num_comments = row[4]
    if num_comments != '0': # Only adding entry to list if number of comments is not 0.
        hn.append(row)

# import random module
import random

# Select random 20,000 entries from data set
random.seed(1) # initialise a random number generator for data replicability
hn = random.sample(hn, 20000)

# Display headers and checking extraction complete
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12000858',
  'How Amazon Triggered a Robot Arms Race',
  'http://www.bloomberg.com/news/articles/2016-06-29/how-amazon-triggered-a-robot-arms-race',
  '164',
  '101',
  'petethomas',
  '6/29/2016 11:22'],
 ['10329616',
  'Girls in Tech employee fired for misogynist email rant',
  'http://recode.net/2015/10/04/girls-in-tech-employee-fired-for-misogynist-email-rant/',
  '5',
  '8',
  'pavornyoh',
  '10/5/2015 2:07'],
 ['12295633',
  'How PayPal Scaled to Billions of Transactions Daily Using Just 8VMs',
  'http://highscalability.com/blog/2016/8/15/how-paypal-scaled-to-billions-of-transactions-daily-using-ju.html',
  '7',
  '1',
  'yarapavan',
  '8/16/2016 5:01'],
 ['11488759',
  'Calculating PI with bc',
  'http://alien.slackbook.org/blog/calculating-pi/',
  '3',
  '2',
  'vmorgulis',
  '4/13/2016 15:05'],
 ['12064749',
  'Ask HN: What salary will make you move to bay area?',
  '',
  '2',
  '1',
  'RestlessMind',
  '7/10/2016 6:18']]

## Extracting Ask HN and Show HN posts

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Ask HN entries:', len(ask_posts))
print('Show HN entries:', len(show_posts))
print('Other entries:', len(other_posts))

Ask HN entries: 1765
Show HN entries: 1194
Other entries: 17041


## Average number of comments for Ask HN and Show HN entries

In [4]:
total_ask_comments = 0

# Finding sum of comments for Ask HN posts
for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average Ask HN comments:', avg_ask_comments)

# Finding sum of comments for Show HN posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average Show HN comments:', avg_show_comments)

Average Ask HN comments: 13.120679886685553
Average Show HN comments: 9.403685092127303


There are, on average, 39.53% more comments for Ask HN posts (13 comments per post) compared to Show HN posts (9 comments per post).

## Finding amount of Ask posts and comments by hour created

In [5]:
# import datetime module
import datetime as dt

# create list to store time and number of comments each entry in Ask HN list have
result_list = []
for row in ask_posts:
    create_at = row[6]
    num_comments = int(row[4])
    result_list.append([create_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dt_object = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M") # parse first element into datetime object
    dt_string = dt_object.strftime("%H") # select hour of each entry and rewrite as string
    
    if dt_string in counts_by_hour:
        counts_by_hour[dt_string] += 1
        comments_by_hour[dt_string] += row[1]
    else:
        counts_by_hour[dt_string] = 1
        comments_by_hour[dt_string] = row[1]

## Calculating average number of posts for Ask HN by hour

In [6]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['06', 10.701754385964913],
 ['14', 13.195402298850574],
 ['13', 20.773809523809526],
 ['22', 10.708333333333334],
 ['20', 9.770642201834862],
 ['09', 10.218181818181819],
 ['07', 7.658536585365853],
 ['21', 11.242990654205608],
 ['15', 43.206896551724135],
 ['10', 10.948275862068966],
 ['12', 19.873015873015873],
 ['02', 7.603448275862069],
 ['03', 13.777777777777779],
 ['04', 7.644444444444445],
 ['18', 9.858333333333333],
 ['16', 10.363636363636363],
 ['11', 6.974683544303797],
 ['17', 10.288659793814434],
 ['05', 8.51063829787234],
 ['19', 10.14018691588785],
 ['00', 9.017543859649123],
 ['23', 6.258620689655173],
 ['01', 8.73076923076923],
 ['08', 16.576923076923077]]

## Sorting and printing values from list of lists

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]]) # reverse the order of elements in list

sorted_swap = sorted(swap_avg_by_hour, reverse=True) # sorting list in descending order by average post

print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg
        )
    ) # printing results in specified format

Top 5 Hours for Ask Posts Comments
15:00: 43.21 average comments per post
13:00: 20.77 average comments per post
12:00: 19.87 average comments per post
08:00: 16.58 average comments per post
03:00: 13.78 average comments per post


We see that Ask HN posts posted at the hour between 15:00 and 16:00 receive the most replies at 43.21 comments per post, which is more than thrice the average post created at random time.
The second busiest hour is between 13:00-14:00, where there are 20.77 replies on average. This is 58.31% higher than the unspecified time average.

## Calculating average number of posts for Ask HN by hour

The time zone for the date and time the post was made is in Eastern Time in the US (UTC -5) and my local time zone is UTC +8, hence, I will need to add 13 hours to the printed hours above for a localised answer.

In [8]:
local_sorted_swap = sorted_swap.copy() # use copied data to perform transformation on

for row in local_sorted_swap:
    dt_object = dt.datetime.strptime(row[1], "%H") # parse first element into datetime object
    dt_local = dt_object + dt.timedelta(hours=13) # Add 13 hours to original time
    dt_string = dt_local.strftime("%H") # select hour of each entry and rewrite as string
    row[1] = dt_string
    
print("Using the logic above, the top 5 hours with most comments to post in my local time (UTC +8) are:")

for avg, hour in local_sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg
        )
    )

Using the logic above, the top 5 hours with most comments to post in my local time (UTC +8) are:
04:00: 43.21 average comments per post
02:00: 20.77 average comments per post
01:00: 19.87 average comments per post
21:00: 16.58 average comments per post
16:00: 13.78 average comments per post


## Conclusion

If one wish to maximise the number of comments to an Ask HN post, they should post between 3pm EST to 4pm EST or at 6am UTC +8 to 7am UTC +8.

Possible further questions to investigate:
* Determine if show or ask posts receive more points on average
* Determine if posts created at a certain time are more likely to receive more points
* Compare your results to the average number of comments and points other posts receive