# Hacker news: Analysis of posts

## About:
Hacker news is a website extremely populer in technology and startup circles. User-submitted stories (known as "posts") are voted and commented upon by the site visitors.

We're specifically interested in posts whose titles begin with either **`Ask HN`** or **`Show HN`**. 

* Users submit **`Ask HN`** posts to ask the Hacker News community a specific question.
* Users submit **`Show HN`** posts to show the Hacker News community a project, product, or just generally something interesting.

## Goal:
Our goal in this project is:
   1. To compare the *Ask* and *Show* posts of the Hacker news data set and see which of the two receive most comments?
   2. To perform an analysis based on certain time the post was made to determine if that had a bearing on higher comments made for the post?


## Data set:

Data set for this project can be download from [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

**Note**: The data set has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are descriptions of the columns:

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if it the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

Below, let's look at a few rows from the data set to understand the how its presented.

In [1]:
from csv import reader
hn_file = open('hacker_news.csv')
hn_read = reader(hn_file)
hn = list(hn_read)
headers = hn[0]
hn = hn[1:]

print('Total number of posts:',len(hn),'\n')
print(headers)
print('-' * 75,'\n')
for post in hn[0:5]:
    print(post,'\n')

Total number of posts: 20100 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
--------------------------------------------------------------------------- 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://w

## Data filtering:

As explained in the beginning of the project, we're only concerned with post titles beginning with **Ask HN** or **Show HN**

In the section below, we will seperate the posts in the data set on that basis.

* `ask_posts`: List of data containing only the **Ask HN** posts
* `show_posts`: List of data containing only the **Show HN** posts
* `other_posts`: List of data containing posts other than the above two

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(post)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

print('Number of Ask HN posts:',len(ask_posts))
print('Number of Show HN posts:',len(show_posts))
print('Number of other posts:',len(other_posts))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


## Data analysis:

### Step 1: Comments - Ask HN vs Show HN posts

In [3]:
total_ask_comments = 0
avg_ask_comments = 0
total_ask_posts = len(ask_posts)
ask_freq_table = {}

def get_range(num_comments):    
    if num_comments >= 1 and num_comments <=5:
        comments_range = 'a: 1 - 5 comments'
    elif num_comments > 5 and num_comments <=10:
        comments_range = 'b: 6 - 10 comments'
    else:
        comments_range = 'c: 10+ comments'
    return comments_range

def display_comment_range(freq_table):    
    print('\n','Number of posts by Comment ranges')
    for bucket in sorted(freq_table):
        print(bucket,':',freq_table[bucket],'posts')

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    comments_range = get_range(num_comments)    
    if comments_range not in ask_freq_table:
        ask_freq_table[comments_range] = 1
    else:
        ask_freq_table[comments_range] += 1        
avg_ask_comments = total_ask_comments/total_ask_posts

total_show_comments = 0
avg_show_comments = 0
total_show_posts = len(show_posts)
show_freq_table = {}
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    comments_range = get_range(num_comments)    
    if comments_range not in show_freq_table:
        show_freq_table[comments_range] = 1
    else:
        show_freq_table[comments_range] += 1
avg_show_comments = total_show_comments/total_show_posts

print('*' * 5,'Ask Posts', '*'*5)
print('Total posts = ',total_ask_posts)
print('Total comments = ',total_ask_comments)
print('Average comments = ',avg_ask_comments)
display_comment_range(ask_freq_table)
print('\n')
print('*' * 5,'Show Posts', '*'*5)
print('Total posts = ',total_show_posts)
print('Total comments = ',total_show_comments)
print('Average comments = ',avg_show_comments)
display_comment_range(show_freq_table)

***** Ask Posts *****
Total posts =  1744
Total comments =  24483
Average comments =  14.038417431192661

 Number of posts by Comment ranges
a: 1 - 5 comments : 1083 posts
b: 6 - 10 comments : 305 posts
c: 10+ comments : 356 posts


***** Show Posts *****
Total posts =  1162
Total comments =  11988
Average comments =  10.31669535283993

 Number of posts by Comment ranges
a: 1 - 5 comments : 820 posts
b: 6 - 10 comments : 110 posts
c: 10+ comments : 232 posts


From the results above, we see that:
* After our data filtering, there are 1,744 "`Ask HN`" posts and 1,162 "`Show HN`" posts
* Across their respective posts, there are 24,483 comments for "`Ask HN`" posts and 11,988 comments for "`Show HN`" posts
* On an average "`Ask HN`" posts have ~14 comments and "`Show HN`" posts have ~10 comments

We conclude that:

> On an average **Ask HN** posts receive more comments than the **Show HN** posts

### Step 2: Analyse the number of posts and total comments by the hour created

We will now see the number of posts made and total number of comments received based on the hour of the day the post was created (Based on the `created_at` column). We will do this on both `Ask HN` and `Show HN` posts.

**Note**: According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the timezone used is Eastern Time in the US.

We check based on the hour of the day post was made (Sorted - Highest first):
1. What are the total posts made in the hour of the day
2. What are the total comments received across all posts made in the hour of the day

Before we display the data, we will prepare our reusable code unit in the next block.

In [4]:
import datetime as dt

# Slice by the time hour - To get number of posts and total comments in that hour
def freq_table_time(input_list):
    result_dict = {}
    date_format = "%m/%d/%Y %H:%M"
    for post in input_list:
        created = dt.datetime.strptime(post[-1],date_format)
        created_hour = created.strftime("%H")
        num_comments = int(post[4])
        if created_hour not in result_dict:
            result_dict[created_hour] = [num_comments]
        else:
            result_dict[created_hour].append(num_comments)
    return result_dict

def display_by_hour(input_list, post_type):
    input_dict = freq_table_time(input_list)
    posts_by_hour = {}     
    comments_by_hour = {}
    avg_comments_by_hour = {}
    for k,v in input_dict.items():
        total_comments = sum(v)
        total_posts = len(v)
        posts_by_hour[k] = total_posts
        comments_by_hour[k] = total_comments
        avg_comments_by_hour[k] = (total_comments/total_posts)
    
    print('*' * 5,post_type,'Posts by hour - Highest first','*' * 5,'\n')        
    output_str_frmt = "{}: {} posts"
    for hour in sorted(posts_by_hour, key=posts_by_hour.get, reverse = True):    
        print(output_str_frmt.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),posts_by_hour[hour]))
    print('\n')
    print('*' * 5,post_type,'Comments by hour - Highest first','*' * 5,'\n')
    output_str_frmt = "{}: {} comments"
    for hour in sorted(comments_by_hour, key=comments_by_hour.get, reverse = True):
        print(output_str_frmt.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),comments_by_hour[hour]))
    return avg_comments_by_hour

#### Ask HN Posts:

In [5]:
ask_avg_comments_by_hour = display_by_hour(ask_posts, 'Ask HN')

***** Ask HN Posts by hour - Highest first ***** 

15:00: 116 posts
19:00: 110 posts
18:00: 109 posts
21:00: 109 posts
16:00: 108 posts
14:00: 107 posts
17:00: 100 posts
13:00: 85 posts
20:00: 80 posts
12:00: 73 posts
22:00: 71 posts
23:00: 68 posts
01:00: 60 posts
10:00: 59 posts
11:00: 58 posts
02:00: 58 posts
00:00: 55 posts
03:00: 54 posts
08:00: 48 posts
04:00: 47 posts
05:00: 46 posts
09:00: 45 posts
06:00: 44 posts
07:00: 34 posts


***** Ask HN Comments by hour - Highest first ***** 

15:00: 4477 comments
16:00: 1814 comments
21:00: 1745 comments
20:00: 1722 comments
18:00: 1439 comments
14:00: 1416 comments
02:00: 1381 comments
13:00: 1253 comments
19:00: 1188 comments
17:00: 1146 comments
10:00: 793 comments
12:00: 687 comments
01:00: 683 comments
11:00: 641 comments
23:00: 543 comments
08:00: 492 comments
22:00: 479 comments
05:00: 464 comments
00:00: 447 comments
03:00: 421 comments
06:00: 397 comments
04:00: 337 comments
07:00: 267 comments
09:00: 251 comments


#### Show HN Posts:

In [6]:
show_avg_comments_by_hour = display_by_hour(show_posts, 'Show HN')

***** Show HN Posts by hour - Highest first ***** 

13:00: 99 posts
17:00: 93 posts
16:00: 93 posts
14:00: 86 posts
15:00: 78 posts
12:00: 61 posts
18:00: 61 posts
20:00: 60 posts
19:00: 55 posts
21:00: 47 posts
22:00: 46 posts
11:00: 44 posts
10:00: 36 posts
23:00: 36 posts
08:00: 34 posts
00:00: 31 posts
09:00: 30 posts
02:00: 30 posts
01:00: 28 posts
03:00: 27 posts
07:00: 26 posts
04:00: 26 posts
05:00: 19 posts
06:00: 16 posts


***** Show HN Comments by hour - Highest first ***** 

14:00: 1156 comments
16:00: 1084 comments
18:00: 962 comments
13:00: 946 comments
17:00: 911 comments
12:00: 720 comments
15:00: 632 comments
20:00: 612 comments
22:00: 570 comments
19:00: 539 comments
11:00: 491 comments
00:00: 487 comments
23:00: 447 comments
07:00: 299 comments
10:00: 297 comments
09:00: 291 comments
03:00: 287 comments
21:00: 272 comments
04:00: 247 comments
01:00: 246 comments
08:00: 165 comments
06:00: 142 comments
02:00: 127 comments
05:00: 58 comments


### Step 3: Average Number of Comments for Ask HN Posts by Hour

In the previous step, we display the number of posts and total comments by the hour for both `Ask HN` and `Shown HN` posts.

Also through the code block above we calculated the average number of comments for both `Ask HN` and `Shown HN` posts and stored in the dictionaries `ask_avg_comments_by_hour` and `show_avg_comments_by_hour`.

Let's now review that data.

#### Ask HN Posts:

In [7]:
print('*' * 5,'Ask HN Posts: Average number of comments by hour - Highest first','*' * 5,'\n')
output_str_frmt = "{}: {:.2f} average comments per post"
for hour in sorted(ask_avg_comments_by_hour, key=ask_avg_comments_by_hour.get, reverse = True):    
    print(output_str_frmt.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),ask_avg_comments_by_hour[hour]))
print('\n')

***** Ask HN Posts: Average number of comments by hour - Highest first ***** 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per

From the above analysis, the hour\(*) that receives the most comments per post on average is 15:00 (3 pm EST), with an average of 38.59 comments per post. 

The 2nd best hour\(*) that receives the most comments per post on average is 13:00 (1 pm EST), with an average of 22.22 comments per post.

**\(*)** The timezone used is Eastern Time in the US so 15:00 hours is 03:00 pm EST

#### Show HN Posts:

In [8]:
print('*' * 5,'Show HN Posts: Average number of comments by hour - Highest first','*' * 5,'\n')
output_str_frmt = "{}: {:.2f} average comments per post"
for hour in sorted(show_avg_comments_by_hour, key=show_avg_comments_by_hour.get, reverse = True):    
    print(output_str_frmt.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),show_avg_comments_by_hour[hour]))
print('\n')

***** Show HN Posts: Average number of comments by hour - Highest first ***** 

18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post
12:00: 11.80 average comments per post
16:00: 11.66 average comments per post
07:00: 11.50 average comments per post
11:00: 11.16 average comments per post
03:00: 10.63 average comments per post
20:00: 10.20 average comments per post
19:00: 9.80 average comments per post
17:00: 9.80 average comments per post
09:00: 9.70 average comments per post
13:00: 9.56 average comments per post
04:00: 9.50 average comments per post
06:00: 8.88 average comments per post
01:00: 8.79 average comments per post
10:00: 8.25 average comments per post
15:00: 8.10 average comments per post
21:00: 5.79 average comments per post
08:00: 4.85 average comments per post
02:00: 4.23 average comments per post
05:00: 3.05 average comments per po

For `Show HN` posts however, purely by looking at the average number of comments per post is not very conclusive as the total number of posts and in turn the comments have both got the same linear reduction/increase factor.

So by taking the average number of comments per post and overlaying it on the total number of posts, total number of comments by the hour - We see the hour\(*) that receives the most comments for the posts and a better average number of comments per post is 14:00 (2 pm EST), with an average of 13.44 comments per post (which is not the highest) but with 1,156 comments over 86 posts it can be concluded as a better hour for `Show HN` posts.

**\(*)** The timezone used is Eastern Time in the US so 14:00 hours is 02:00 pm EST

## Conclusion:

From the above data analysis we conclude the following on our goals we initially set out for this project:

* Ask HN vs Show HN posts - Popularity by comments engagement

   > **Ask HN** posts are popular by comments engagement with an average Ask HN posts have ~14 comments per post

* Best hour for posts based on comments received
     
   > **For Ask HN**: Is between ~15:00 - 16:00 (3-4 pm EST), with an average of 38.59 comments per post
   
   > **For Show HN**: Is between ~14:00 - 15:00 (2-3 pm EST), with an average of 13.44 comments per post (Total of 1,156 comments over 86 posts)