# Hacker News analysys
## author: Tomás N. P. Barros
### email: samotbnp@gmail.com
This notebook is a mini project for getting practice with the basics of python using a dataset of the hacker news website provided in the link: https://www.kaggle.com/datasets/hacker-news/hacker-news-posts

In [22]:
import csv
import numpy as np
import datetime as dt

In [None]:
hn = open('hacker_news.csv', 'r')
hn = list(csv.reader(hn))

In [4]:
print(hn[:5]) # show the first five rows of the dataset

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [10]:
# extracting the headers from the dataset
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [11]:
# going to extract posts whose title starts with 'Ask HN'
# or 'Show HN'
ask_posts, show_posts, other_posts = [],[],[]
for row in hn:
    if row[1].lower().startswith('ask hn'):
        ask_posts.append(row)
    elif row[1].lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [12]:
print(f"number of 'Ask HN' posts: {len(ask_posts)}")
print(f"number of 'show HN' posts: {len(show_posts)}")
print(f"number of neither type of posts: {len(other_posts)}")

number of 'Ask HN' posts: 1744
number of 'show HN' posts: 1162
number of neither type of posts: 17194


In [15]:
show_comments_per_post = [int(row[4]) for row in show_posts]
ask_comments_per_post = [int(row[4]) for row in ask_posts]

In [17]:
show_arr = np.array(show_comments_per_post, np.int32)
show_mean = show_arr.mean()
show_median = np.median(show_arr)
show_std = show_arr.std()

In [18]:
ask_arr = np.array(ask_comments_per_post, np.int32)
ask_mean = ask_arr.mean()
ask_median = np.median(ask_arr)
ask_std = ask_arr.std()

In [20]:
print(f"Total number of 'Show HN' comments: {show_arr.sum()}")
print(f"Average number of 'Show HN' comments per post: {show_mean}")
print(f"Median number of 'Show HN' comments per post: {show_median}")
print(f"Standard Deviation of 'Show HN' comments per post: {show_std}")

Total number of 'Show HN' comments: 11988
Average number of 'Show HN' comments per post: 10.31669535283993
Median number of 'Show HN' comments per post: 3.0
Standard Deviation of 'Show HN' comments per post: 23.31582636603177


In [21]:
print(f"Total number of 'Ask HN' comments: {ask_arr.sum()}")
print(f"Average number of 'Ask HN' comments per post: {ask_mean}")
print(f"Median number of 'Ask HN' comments per post: {ask_median}")
print(f"Standard Deviation of 'Ask HN' comments per post: {ask_std}")

Total number of 'Ask HN' comments: 24483
Average number of 'Ask HN' comments per post: 14.038417431192661
Median number of 'Ask HN' comments per post: 4.0
Standard Deviation of 'Ask HN' comments per post: 52.932211852745354


## Comparison between 'Ask HN' and 'Show HN' number of comments per post
On average, 'Ask HN' posts gets more comments than 'Show HN' posts. But It is interesting noting that even though 'Ask HN' label has more comments on average, those numbers are more unpredictable than 'Show HN' posts

In [27]:
result_list = []
for row in ask_posts:
    created_at = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
    number_of_comments = int(row[4])
    result_list.append((created_at, number_of_comments))

In [23]:
print(ask_posts[0])

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


In [28]:
counts_by_hour, comments_by_hour = {}, {}
for _date, num_com in result_list:
    hour = _date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_com
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_com

In [32]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
sorted(avg_by_hour)

[[0, 8.127272727272727],
 [1, 11.383333333333333],
 [2, 23.810344827586206],
 [3, 7.796296296296297],
 [4, 7.170212765957447],
 [5, 10.08695652173913],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [8, 10.25],
 [9, 5.5777777777777775],
 [10, 13.440677966101696],
 [11, 11.051724137931034],
 [12, 9.41095890410959],
 [13, 14.741176470588234],
 [14, 13.233644859813085],
 [15, 38.5948275862069],
 [16, 16.796296296296298],
 [17, 11.46],
 [18, 13.20183486238532],
 [19, 10.8],
 [20, 21.525],
 [21, 16.009174311926607],
 [22, 6.746478873239437],
 [23, 7.985294117647059]]

In [38]:
swap_avg_by_hour = sorted(avg_by_hour, key=lambda l : l[1], reverse=True)
swap_avg_by_hour

[[15, 38.5948275862069],
 [2, 23.810344827586206],
 [20, 21.525],
 [16, 16.796296296296298],
 [21, 16.009174311926607],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [18, 13.20183486238532],
 [17, 11.46],
 [1, 11.383333333333333],
 [11, 11.051724137931034],
 [19, 10.8],
 [8, 10.25],
 [5, 10.08695652173913],
 [12, 9.41095890410959],
 [6, 9.022727272727273],
 [0, 8.127272727272727],
 [23, 7.985294117647059],
 [7, 7.852941176470588],
 [3, 7.796296296296297],
 [4, 7.170212765957447],
 [22, 6.746478873239437],
 [9, 5.5777777777777775]]

In [45]:
print("Top 5 Hours for Ask Posts Comments")
for hr, avg in swap_avg_by_hour[:5]:
    _hour = dt.time(hour = hr)
    _hour = _hour.strftime("%H:%M")
    print(f"{_hour}: {avg:.2f} comments per post")

Top 5 Hours for Ask Posts Comments
15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.80 comments per post
21:00: 16.01 comments per post


In [55]:
top_5_hours_post_gmt_minus_3 = swap_avg_by_hour[:5]
top5_gmtmin3_hours_avg = []
for hr,avg in top_5_hours_post_gmt_minus_3:
    increase_2_hr = dt.timedelta(hours = 2)
    _hour = dt.datetime(1,1,1,hour=hr)
    hr = _hour + increase_2_hr
    top5_gmtmin3_hours_avg.append([hr, avg])

In [56]:
print("Top 5 Hours for Ask Posts Comments in GMT-3")
for date, avg in top5_gmtmin3_hours_avg:
    _hour = date.strftime("%H:%M")
    print(f"{_hour}: {avg:.2f} comments per post")

Top 5 Hours for Ask Posts Comments in GMT-3
17:00: 38.59 comments per post
04:00: 23.81 comments per post
22:00: 21.52 comments per post
18:00: 16.80 comments per post
23:00: 16.01 comments per post


## Conclusion
The topics labeled as 'Ask HN' get more interaction from user comments than the ones labeled 'Show HN' and the top five hours with the most interactions from users are: 17:00, 04:00, 22:00, 18:00 and 23:00 for the topics labeled as 'Ask HN'