#### Exploring Hacker News Posts

##### Objectives

The project aims to answer the following questions from the data:

1. Which post ('Ask HN' or 'Show HN' posts) receives more comments on average?
2. Do posts created at a certain time receive more comments on average?

##### Data

The data set can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). It has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not  receive any comments and then randomly sampling from the remaining submissions.

We shall begin by opening and exploring the data.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

In [2]:
print(headers)
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We shall identify the number of posts that start with 'Ask HN', 'Show HN' and those that do not belong to either category.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [4]:
print('Ask Posts:', len(ask_posts))
print('Show Posts:', len(show_posts))
print('Other Posts:', len(other_posts))

Ask Posts: 1744
Show Posts: 1162
Other Posts: 17194


Hence, there are 1744 'Ask HN' posts as compared to 1162 'Show HN' posts.

We shall proceed to calculate the average number of comments for each category of posts.

In [5]:
total_ask_comments=0
for post in ask_posts:
    total_ask_comments += int(post[4])
    avg_ask_comments= total_ask_comments/(len(ask_posts))

In [6]:
print(avg_ask_comments)

14.038417431192661


In [7]:
total_show_comments=0
for post in show_posts:
    total_show_comments += int(post[4])
    avg_show_comments= total_show_comments/(len(show_posts))

In [8]:
print(avg_show_comments)

10.31669535283993


Ans to Q1: 'Ask HN' posts.

The average number of comments received for 'Ask HN' posts (14) is higher than the average number of comments received for 'Show HN' posts (10.3).

Moving forward, we will focus our remaining analysis on "Ask HN' posts for Q2.

We shall begin by creating a list consisting of the timing of posts and the numbers of comments.

In [9]:
import datetime as dt
result_list=[]
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

We shall create a dictionary to capture the number of posts on an hourly basis.

In [10]:
counts_by_hour={}
comments_by_hour={}
date_format= "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We shall edit the dictionary to a list to capture the average number of posts on an hourly basis.

In [11]:
avg_by_hour=[]

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

We shall swap the first and second elements in order to sort the list.

In [12]:
swap_avg_by_hour=[]

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [13]:
sorted_swap=sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [14]:
print("Top 5 Hours for 'Ask HN' Comments:")
for avg, hr in sorted_swap[:5]:
    print(
        "{} : {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
        )
    )

Top 5 Hours for 'Ask HN' Comments:
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


Ans to Q2: The top 5 timings to receive the most posts on average are 15:00, 2:00, 20:00, 16:00 and 21:00.