# Exploring Engagement When Posting on HackerNews

This project deals with a dataset provided by [DataQuest](http://www.dataquest.io). The original data can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts) but the data that was worked on was filtered in the following manner  

* All posts that did not receive comments were removed
* random sample of approx 20,000 rows was used for the Dataquest dataset

In this project, we explore two specific types of posts: 
* Posts where questions are asked to HackerNews
* Posts where work/information is shown to readers of HackerNews

to look for what engagement numbers are like. From there we make recommendations on which of the two posts get more engagements and based on time of posting what the best times to be posting these posts based on Singapore's time zone are


In [1]:
file = open('hacker_news.csv')
from csv import reader 
read_file = reader(file)
hn = list(read_file)
for row in hn[0:5]:
    print(row)
import datetime as dt

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Here is the beginning of loading the file into list format, so we can start working with the csv file

In [2]:
header = hn[0]
hn = hn[1:]
for row in hn[:5]:
    print(row[1])
print(header)

Interactive Dynamic Video
How to Use Open Source and Shut the Fuck Up at the Same Time
Florida DJs May Face Felony for April Fools' Water Joke
Technology ventures: From Idea to Enterprise
Note by Note: The Making of Steinway L1037 (2007)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Above, we removed the header into it's own variable **header** and then we sliced off the header from the whole dataset - **hn**

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    string_lower = row[1].lower().strip()
    if string_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif string_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))
for row in ask_posts[:5]:
    print(row)
print('\n')
for row in show_posts[:5]:
    print(row)
print('\n')
for row in other_posts[:5]:
    print(row)


1744 1162 17194
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1'

Above, we've seperated all the posts into different lists : 

* ask_posts
* show_posts
* other posts 

Specifically, we are concerned with the **ask_posts** and **show_posts** lists because these are posts where posters are either asking hackernews questions or showing their work to hackernews specifically

In [4]:
counter = 0
index_header = []
for index in header: 
    index_header.append(str(counter) + ': ' + index)
    counter += 1
print(index_header)

    

['0: id', '1: title', '2: url', '3: num_points', '4: num_comments', '5: author', '6: created_at']


Working through the header, for easy reference for index of each column when working with dataset hn

In [5]:
def get_average(list):
    total_comments = 0
    for row in list:
        total_comments += int(row[4])
    avr_comments = total_comments / len(list)
    return avr_comments

avg_ask = get_average(ask_posts)
avg_show = get_average(show_posts)
avg_other = get_average(other_posts)

print("average ask:" + str(avg_ask),"average show: " + str(avg_show),"average other:" + str(avg_other))

average ask:14.038417431192661 average show: 10.31669535283993 average other:26.8730371059672


On average, we find that posts where posters are asking questions to hackernews get more comments. A possible reason could be that questions garner more interactions that posts where people are simply showing their work. However, both on average do not get more comments that other posts.

Since we find that **ask posts** get more comments than show posts, we're now going to try to see if the timing of creation of the posts affects the number of comments

In [6]:
for row in ask_posts[:5]:
    print(row[6])
    print(type(row[6]))

8/16/2016 9:55
<class 'str'>
11/22/2015 13:43
<class 'str'>
5/2/2016 10:14
<class 'str'>
8/2/2016 14:20
<class 'str'>
10/15/2015 16:38
<class 'str'>


In [7]:
trial = dt.datetime.strptime(ask_posts[0][6], '%m/%d/%Y %H:%M')
for row in ask_posts:
    datetime = dt.datetime.strptime(row[6],'%m/%d/%Y %H:%M')
    row[6] = datetime

for row in ask_posts[:2]:
    print(type(row[6]))


<class 'datetime.datetime'>
<class 'datetime.datetime'>


Above, we've converted all the values in the 6th index of ask_posts from strings to datetime objects so they can be manipulated as such

In [8]:

ask_posts_by_hour = []
for row in ask_posts:
    ask_posts_by_hour.append(row)
test_hour = dt.datetime.strftime(ask_posts[2][6], "%H")
for row in ask_posts_by_hour:
    hour = dt.datetime.strftime(row[6], "%H")
    row[6] = hour
    

At this point, we have a new list - **ask_posts_by_hour** at which index six will show the hour in which the post was created. to note, both the **num_comments(index 4)** and ** **hour created(index 6)** are strings, so we would have to work with the values as such

In [9]:
for row in ask_posts_by_hour[:3]:
    print(row[4], type(row[4]), "comments")
    print(row[6], type(row[6]), 'hour created')

6 <class 'str'> comments
09 <class 'str'> hour created
29 <class 'str'> comments
13 <class 'str'> hour created
1 <class 'str'> comments
10 <class 'str'> hour created


In [10]:
ask_row_dict = {}
for row in ask_posts_by_hour:
    if row[6] not in ask_row_dict:
        ask_row_dict[row[6]] = [1, int(row[4])]
    else:
        ask_row_dict[row[6]][0] += 1
        ask_row_dict[row[6]][1] += int(row[4])
print(ask_row_dict)


{'09': [45, 251], '13': [85, 1253], '10': [59, 793], '14': [107, 1416], '16': [108, 1814], '23': [68, 543], '12': [73, 687], '17': [100, 1146], '15': [116, 4477], '21': [109, 1745], '20': [80, 1722], '02': [58, 1381], '18': [109, 1439], '03': [54, 421], '05': [46, 464], '19': [110, 1188], '01': [60, 683], '22': [71, 479], '08': [48, 492], '04': [47, 337], '00': [55, 447], '06': [44, 397], '07': [34, 267], '11': [58, 641]}


In [11]:
ask_row_avg = {}

for key, value in ask_row_dict.items():
        ask_row_avg[key] = round(value[1] / value[0], 2)
print(ask_row_avg)

{'09': 5.58, '13': 14.74, '10': 13.44, '14': 13.23, '16': 16.8, '23': 7.99, '12': 9.41, '17': 11.46, '15': 38.59, '21': 16.01, '20': 21.52, '02': 23.81, '18': 13.2, '03': 7.8, '05': 10.09, '19': 10.8, '01': 11.38, '22': 6.75, '08': 10.25, '04': 7.17, '00': 8.13, '06': 9.02, '07': 7.85, '11': 11.05}


In [12]:
ask_row_avg_list = []
for key, value in ask_row_avg.items():
    ask_row_avg_list.append([key, value])
print(ask_row_avg_list)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


In [20]:
ask_row_avg_swapped = []
for row in ask_row_avg_list:
    ask_row_avg_swapped.append([row[1], row[0]])
ask_row_avg_sorted = sorted(ask_row_avg_swapped, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
for row in ask_row_avg_sorted[:5]:
    date_time = dt.datetime.strptime(row[1], "%H")
    hour = date_time.strftime('%H')
    print(hour + ':' + "00 -", row[0], "comments per hour")


Top 5 Hours for Ask Posts Comments
15:00 - 38.59 comments per hour
02:00 - 23.81 comments per hour
20:00 - 21.52 comments per hour
16:00 - 16.8 comments per hour
21:00 - 16.01 comments per hour


based on the eastern time zone in the US, the above are the best hours to be posting Ask Posts to get maximum comments. As Singapore's time zone is 13 hours ahead of this, we need to do some recalculations to make this relevant to us


In [29]:
timedelta = dt.timedelta(hours = 13)
for row in ask_row_avg_sorted[:5]:
    date_time = dt.datetime.strptime(row[1], "%H")
    date_time = date_time + timedelta
    hour = date_time.strftime('%H')
    print(hour + ':' + "00 -", row[0], "comments per hour")


04:00 - 38.59 comments per hour
15:00 - 23.81 comments per hour
09:00 - 21.52 comments per hour
05:00 - 16.8 comments per hour
10:00 - 16.01 comments per hour


After conversion to Singapore time, we find that the best times to post ask posts on Hackernews to get comment interactions based on the data are :
* 2:00 am
* 3:00 pm
* 9:00 am
* 4:00 am
* 10:00 am

In descending order. Thus our recommendations for Singaporeans who are looking to gain maximum engagement for their posts if posting ask posts would be to post during these times.

