# Dataquest Guided Project 2: Exploring Hacker News Posts
---

Hacker News is a site started by startup incubator Y Combinator. It consists of user-generated posts that are voted and commented on, and is extremely popular in technology and startup circles.

Two specific types of posts are of interest to us in this project: `Ask HN` and `Show HN`. `Ask HN` posts comprise of users asking questions to the Hacker News community, whereas `Show HN` posts are created to share a topic with the community, for example a project, product or interesting news article.

The key analysis that I aim to provide in this Python will draw on recently acquired skills and provide the following information:
1. Do `Ask HN` or `Show HN` posts receive more comments on average?
2. Do posts created at a certain time generate more comments on average?


## Importing data
The data set that will be used for this analysis can be found [here][1], and features almost 300,000 Hacker News posts in the 12 months up to 26/09/2016.

Fistly the data will be imported and saved as a list-of-lists variable `hn`. The header row for the data is then saved in `headers`, while `hn` is updated to remove the header row, thus containing only data.

`headers` and the first 3 entried in the data set `hn` are printed below to understand the data set structure.

[1]: https://www.kaggle.com/hacker-news/hacker-news-posts


In [7]:
from csv import reader
opened_file = open(r"C:\Users\tomel\Documents\Programming\Tutorials\Dataquest\Guided Project 2\HN_posts_year_to_Sep_26_2016.csv", encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

print(headers)
for iter in range (0,3):
    print (""" 
     """)
    print (hn[iter])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
 
     
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
 
     
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
 
     
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


## Data set structure
A overview of the structure of the `hn` dataset is given in the below table:

Index Number|Header|Example Data
:---|:---|:---
0|'id'|'12579008'
1|'title'|'You have two days to comment if you want stem cells to be classified as your own'
2|'url'|'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018'
3|'num_points'|'1'
4|'num_comments'|'0'|
5|'author'|'altstar'
6|'created_at'|'9/26/2016 3:26'


## Create `ask_posts` and `show_posts` data sets

Below the list class objects `ask_posts`, `show_posts` and `other_posts` are created and populated with the relevant entries from the `hn` data set by:
- converting the post title (index `0`) to lower case using the `str.lower` method
- identifying if the title starts with the relevant string (i.e. `ask hn` or `show hn`) using the `str.startswith` method

In [18]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of 'Ask HN' posts : ", len(ask_posts))
print("Number of 'Show HN' posts : ", len(show_posts))
print("Number of other posts : ", len(other_posts))

Number of 'Ask HN' posts :  9139
Number of 'Show HN' posts :  10158
Number of other posts :  273822


## Calculate average number of comments by post type

The average number of comments for "Ask HN" and "Show HN" posts are calculated below by summing the comments (index number `4`) across each data set (`ask_posts` and `show_posts`). This total comments variable is then divided by the length of each dataset (i.e. the number of posts) to find the average number of comments.

In [21]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = round(total_ask_comments/len(ask_posts), 2)


total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = round(total_show_comments/len(show_posts), 2)

print ("Average 'Ask HN' comments :", avg_ask_comments)
print ("Average 'Show HN' comments :", avg_show_comments)

Average 'Ask HN' comments : 10.39
Average 'Show HN' comments : 4.89


We can see fromt this simple analysis that "Ask HN" posts receive more than double the number of comments for "Show HN" posts. Therefore it could be argued that "Ask HN" posts are more likely to feature in the top Hacker News posts than "Show HN" posts in the 12 months considered here.

## Determine best time to create "Ask HN" posts

In order for a Hacker News post to be featured and rank highly on the Hacker News main billboard it must generate a high number of comments. The intention of the below analysis is to identify at what time users should create "Ask HN" posts in order to generate a high number of comments, and thus propel the post up the Hacker News billboard.

To do this, dictionaries `counts_by_hour` and `comments_by_hour` are created which record the number of posts created per hour of the day, and the number of comments on posts created in each hour of the day respectively. 

In order to do this, it is necessary to parse the `created at` entry as a timedate class object using the `datetime.strptime` method. Once parsed, a new variable (`create_hour`) can be created for each entry/post which uses the `datetime.strftime` method to record the hour of the day when the post was generated, as a string.

The `create_by_hour` and `comments_by_hour` dictionaries are essentially frequency tables, where data is stored against each hour of the day. This data is then used to calculate the mean number of comments for posts created at each of these hours in the day. The result of this calculation is stored in `avg_by_hour`.

Sorting of the data stored in `avg_by_hour` is performed, and saved in the variable `sorted_swap`, which enables clear formatting of the times of the day when post are created that generate the most number of comments on average.

In [49]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    create_dt = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    create_hour = dt.datetime.strftime(create_dt, "%H")
    
    if create_hour in counts_by_hour:
        counts_by_hour[create_hour] += 1
        comments_by_hour[create_hour] += row[1]
    else:
        counts_by_hour[create_hour] = 1
        comments_by_hour[create_hour] = row[1]

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour],2])

swap_avg_by_hour = []
for entry in avg_by_hour:
    swap_avg_by_hour.append([entry[1], entry[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print ("Top 5 Hours for Ask Posts Comments (Time Zone = US Eastern Time):")

for entry in sorted_swap[0:5]:
    entry_datetime = dt.datetime.strptime(entry[1], "%H")
    entry_datetime = dt.datetime.strftime(entry_datetime, "%H:00")
    
    top_avgs = "{}: {:.2f} average comments per post".format(entry_datetime, entry[0])
    print(top_avgs)

Top 5 Hours for Ask Posts Comments (Time Zone = US Eastern Time):
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


Considering that the input data was normalised such that the time zone for the `'created_at'` data was Eastern Time in US, it might be more useful for a user in the UK to have this data in GMT. 

To convert,  a `timedelta` object with an `hours` attribute of `4` (GMT is 4 hours ahead of US Eastern Time) is created and added to the time data stored in `sorted_swap`.

In [54]:
print ("Top 5 Hours for Ask Posts Comments (Time Zone = GMT):")

for entry in sorted_swap[0:5]:
    entry_datetime = dt.datetime.strptime(entry[1], "%H")
    time_delta = dt.timedelta(hours = 4)
    entry_datetime = entry_datetime + time_delta
    entry_datetime = dt.datetime.strftime(entry_datetime, "%H:00")
    top_avgs = "{}: {:.2f} average comments per post".format(entry_datetime, entry[0])
    print(top_avgs)

Top 5 Hours for Ask Posts Comments (Time Zone = GMT):
19:00: 28.68 average comments per post
17:00: 16.32 average comments per post
16:00: 12.38 average comments per post
06:00: 11.14 average comments per post
14:00: 10.68 average comments per post


## Conclusions

- Of the "Ask HN" and "Show HN" post types, the "Ask HN" post type generated over double the number of comments on average than "Show HN". Therefore, users aiming for their posts to be promoted to the top ranked Hacker News posts should favour "Ask HN" type posts.

- When considering the number of comments generated on a post, it matters when the post was created. Therefore, users hoping for thier posts to be featured should consider posting at specific times.
- For users posting from the UK, posts created between 19:00 - 19:59 GMT are significantly more successful in terms of number of comments than any other hour interval, generating on average 28.68 comments per post. This is over 12 comments more than the next most favourable hourly interval (17:00 - 17:59, with 16.32 average comments per post)
