# Hacker News Article Analysis

## Introduction

This project utilizes the Hacker News articles dataset. 
It is intended to compare the popularity of two types of posts: <code>Ask HN</code> and <code>Show HN</code>, to determine which obtains more comments, and whether posts uploaded at a certain time receive more comments on average. 

## Preliminaries

We begin by reading and exploring the <code>hacker_news.csv</code> file.

In [2]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file) #Dataset generated

In [3]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We need to remove the header row of the <code>hn</code> dataset.

In [4]:
headers = hn[0]
del hn[0]

The header is displayed below.

In [5]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


The dataset without the header is tested below.

In [6]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Filtering the data

The articles which do not start with <code>Ask HN</code> and <code>Show HN</code> have to be filtered out.

In [7]:
ask_posts = []
show_posts = []
other_posts = []

for r in hn:
    title = r[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(r)
    elif title.lower().startswith('show hn'):
        show_posts.append(r)
    else:
        other_posts.append(r)

In [8]:
print('The number of Ask HN posts is '+str(len(ask_posts)))
print('The number of Show HN posts is '+str(len(show_posts)))
print('The number of other posts is '+str(len(other_posts)))

The number of Ask HN posts is 1744
The number of Show HN posts is 1162
The number of other posts is 17194


## Analysis

### Comparison of Number of Comments on Ask v/s Show posts

We shall determine whether <code>Ask HN</code> or <code>Show HN</code> articles obtain a greater number of comments of average.

In [9]:
total_ask_comments = 0

for p in ask_posts:
    num_comments = int(p[4])
    total_ask_comments+=num_comments
    
avg_ask_comments = round(total_ask_comments/len(ask_posts))
print("The average number of comments on Ask HN posts is "+str(avg_ask_comments))

The average number of comments on Ask HN posts is 14


In [10]:
total_show_comments = 0

for p in show_posts:
    num_comments = int(p[4])
    total_show_comments+=num_comments
    
avg_show_comments = round(total_show_comments/len(show_posts))
print("The average number of comments on Show HN posts is "+str(avg_show_comments))

The average number of comments on Show HN posts is 10


#### Conclusion

From the above outputs, it can be concluded that <code>Ask HN</code> posts receive more comments on average (14) than <code>Show HN</code> posts (10).

Since <code>Ask HN</code> posts receive more comments on average, our remaining analyses will focus solely on these posts.

### Investigating the relationship between post creation time and number of comments

We aim to find whether posts created at a certain time receive more comments on average.

For this purpose, we extract the creation time and number of comments of each article. Using these, we create 2 dictionaries showing total number of posts at a specific time and total number of comments posted at that time respectively




In [25]:
import datetime as dt

result_list = []
for row in ask_posts:
    to_add = []
    to_add.append(row[6]) #created_at 
    to_add.append(row[4]) #num_comments
    result_list.append(to_add)

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    date_hour_str = result[0] 
    comments = int(result[1])
    
    date_hour = dt.datetime.strptime(date_hour_str,"%m/%d/%Y %H:%M")
    date_hour = date_hour.time().hour
    
    if date_hour not in counts_by_hour:
        counts_by_hour[date_hour] = 1
        comments_by_hour[date_hour] = comments
    else:
        counts_by_hour[date_hour] += 1
        comments_by_hour[date_hour] += comments

We will now find the average number of comments per post for posts created during each hour of the day.

In [31]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,(comments_by_hour[hour]/counts_by_hour[hour])])

print("Read as [Hour, Average Number of Comments]")    
print(avg_by_hour)

Read as [Hour, Average Number of Comments]
[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


To facilitate the interpretation of the above result, we will sort by 5 highest comment values after swapping positions of the elements of the <code>avg_by_hour</code> list.

In [33]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


In [34]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [41]:
print("Top 5 Hours for Ask Posts Comments\n")

for each in sorted_swap[:5]:
    creation_hour = dt.datetime.strptime(str(each[1]),"%H")
    template = "{t}:\t{a:.2f} average comments per post"
    output = template.format(t=creation_hour.strftime("%H:%M"),a=each[0])
    print(output)

Top 5 Hours for Ask Posts Comments

15:00:	38.59 average comments per post
02:00:	23.81 average comments per post
20:00:	21.52 average comments per post
16:00:	16.80 average comments per post
21:00:	16.01 average comments per post


Since the times output above are of the EST time zone, to determine the optimum time for us to create a post, we need to convert the time to Mauritius time (EST +8).

In [44]:
mtius_time_offset = dt.timedelta(hours=8)

print("Top 5 Hours for Ask Posts Comments (Mauritius Time)\n")

for each in sorted_swap[:5]:
    creation_hour = dt.datetime.strptime(str(each[1]),"%H")
    creation_hour += mtius_time_offset
    template = "{t}:\t{a:.2f} average comments per post"
    output = template.format(t=creation_hour.strftime("%H:%M"),a=each[0])
    print(output)

Top 5 Hours for Ask Posts Comments (Mauritius Time)

23:00:	38.59 average comments per post
10:00:	23.81 average comments per post
04:00:	21.52 average comments per post
00:00:	16.80 average comments per post
05:00:	16.01 average comments per post


###### A guided project by sharjs, learning Data Science at DataQuest.io