# Exploring Hacker News Posts
 
 The goal of this project is to distinguish between 'Ask HN' and 'Show HN' posts. 'Ask HN' posts are from members asking a question while 'Show HN' posts are from members bringing attention to any particular thing. Using this data we will determine which posts receive more interactions and determine if posts during a specific time receive more comments. This project will only look at posts that actually <i>did</i> receive comments, so that is one thing to keep in mind.

Start off by loading the data into a list:

In [1]:
from csv import reader

opened_file=open('hacker_news.csv')
read_file=reader(opened_file)
hn=list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In order to make the data easier to work with we will cut the header off:

In [2]:
headers=hn[0]
hn=hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We will start by sorting through the posts and organizing them into categories: Ask Hn posts, Show HN posts, and other posts. The titles have been converted to lower case to avoid problems arising from inconsistent capitalization:

In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts), len(show_posts),len(other_posts))

1744 1162 17194


Printed below are the first few rows of 'Ask HN' and 'Show HN' post respectively, to get an idea of what the lists look like.

Ask HN posts:

In [4]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Show HN posts:

In [5]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


Now to determine which type of post, on average, gets more comments. We will start by finding the total number of comments for each type of post and then divide by the total.

In [6]:
total_ask_comments=0
total_show_comments=0

for post in ask_posts:
    comments=int(post[4])
    total_ask_comments+=comments
for post in show_posts:
    comments=int(post[4])
    total_show_comments+=comments
    
avg_ask_comments=total_ask_comments/len(ask_posts)
avg_show_comments=total_show_comments/len(show_posts)

print('Average Ask: ', avg_ask_comments)
print('Average Show: ', avg_show_comments)

Average Ask:  14.038417431192661
Average Show:  10.31669535283993


It seems, on average, posts asking a question to the community have a higher rate of engagement, based on comments, than people simply showcasing something. Since ask posts receive more engagement, we will be looking at this type to determine if there is a timeframe in which posts receive more comments.

We will begin by finding the number of ask posts created and their comments by the hour:

In [7]:
import datetime as dt

results_list=[]

for post in ask_posts:
    results_list.append([post[6], int(post[4])])
    
posts_by_hour={}
comments_by_hour={}
date_format='%m/%d/%Y %H:%M' #This will be used to format the  date into a string for reading

for result in results_list:
    date=result[0]
    comment=result[1]
    hour=dt.datetime.strptime(date, date_format).strftime('%H')
    if hour in posts_by_hour:
        posts_by_hour[hour]+=1
        comments_by_hour[hour]+=comment
    else:
        posts_by_hour[hour]=1
        comments_by_hour[hour]=comment
print(posts_by_hour)
comments_by_hour

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We have found the total number of posts and comments per hour, now it is time to find the average number of comments per post:

In [8]:
avg_by_hour=[]
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/posts_by_hour[hour]])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

The average number of comments have been grouped into a list of lists, however it is a bit hard to read. We will organize the list by creating a copy, swapping the elements, and then sorting. The next cell will organize each list by order of comments per post. 

In [9]:
swap_avg_by_hour=[]

for row in avg_by_hour:
    row1=row[1]
    row2=row[0]
    swap_avg_by_hour.append([row1, row2])
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

Now to sort the swapped list:

In [10]:
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
sorted_swap[:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

Here are the top 5 hours in which ask posts receive the most comments:

In [11]:
print('Top 5 Hours for Ask Posts Comments:')
for avg, hour in sorted_swap[:5]:
    print('{} : {:.2f} average comments per post '.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg))

Top 5 Hours for Ask Posts Comments:
15:00 : 38.59 average comments per post 
02:00 : 23.81 average comments per post 
20:00 : 21.52 average comments per post 
16:00 : 16.80 average comments per post 
21:00 : 16.01 average comments per post 


According to the __[documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts)__ on the dataset, the times are in EST, so the top 5 times, respectively in descending order would be: <br>
3pm <br>
2am <br>
8pm <br>
4pm <br>
9pm <br>