# Exploring Hacker News Posts using Python

* This notebook is analysis of a sampling of the Hacker News Posts dataset from Kaggle. The link to the original dataset can be found here: [link](https://www.kaggle.com/hacker-news/hacker-news-posts)

* There are 7 variables present in the dataset

| Variable | Description |
| -------- | ----------- |
| id | The unique identifier from Hacker News for the post |
| title | The title of the post |
| url | The URL that the posts links to, if it the post has a URL |
| num_points | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments | The number of comments that were made on the post |
| author | The username of the person who submitted the post |
| created_at | The date and time at which the post was submitted |

* The objective of our analysis is to determine which type of posts draw the most points and comments (Ask HN or Show HN). Is the time of posting is a significant contributor to the number of comments received?

In [1]:
from csv import reader
op_file = open('hacker_news.csv')
rd_file = reader(op_file)
h_news = list(rd_file)
header = h_news[0]
h_news = h_news[1:]
print(header, "\n")
print(h_news[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Step 1: Isolating "Ask" and "Show" posts
- We split up the sample dataset into 3 separate list of lists. "Ask" posts, "Show" posts and "Other" posts.
- This would enable us to directly compare each category using a performance metric
- For this analysis let us consider the "average number of comments received per post" as our measure of performance 

In [2]:
ask_hn = []
show_hn = []
other_posts = []

for row in h_news:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_hn.append(row)
    elif (title.lower()).startswith('show hn'):
        show_hn.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_hn))
print(len(show_hn))
print(len(other_posts))

1744
1162
17194


In [3]:
def total_comments(temp_list):
    total_comment_count = 0
    for each in temp_list:
        count = each[4]
        total_comment_count += int(count)
        
    return total_comment_count

In [4]:
total_ask_comments = total_comments(ask_hn)
print(total_ask_comments,"\n")
total_show_comments = total_comments(show_hn)
print(total_show_comments,"\n")
total_other_comments = total_comments(other_posts)
#print(total_other_comments,"\n")

24483 

11988 



In [5]:
avg_ask_comments = total_ask_comments/len(ask_hn)
print("Average no. of comments per 'Ask' post: ", avg_ask_comments,"\n")
avg_show_comments = total_show_comments/len(show_hn)
print("Average no. of comments per 'Show' post: ", avg_show_comments,"\n")
avg_other_comments = total_other_comments/len(other_posts)
#print("Average no. of comments per 'Other' post: ", avg_other_comments,"\n")

Average no. of comments per 'Ask' post:  14.038417431192661 

Average no. of comments per 'Show' post:  10.31669535283993 



We can see that between "Ask" and "Show" posts the former receives atleast 3-4 more comments per post based on our analysis. 

The question based nature of "Ask" posts could be the driving force behind more commenters being drawn to them.

## Step 2: Determining the influence of time
* Does the time when a post is created affect the number of comments it gains ?

We look at the "number of posts created per hour of the day" alongside it's comment count to generate a frequency distribution and view the average number of comments received per hour. 

This should provide more insight towards the activity of the Hacker News community.

In [6]:
import datetime as dtime
dt_format = "%m/%d/%Y %H:%M"

In [7]:
def gen_dict(temp_list):
    
    temp_results = []

    for row in temp_list:
        count = int(row[4])
        create_time = row[6]
        temp = [create_time, count]
        temp_results.append(temp)
    
    hourly_post_count = {}
    comments_per_hour = {}

    for each in temp_results:
        timestamp = dtime.datetime.strptime(each[0],dt_format)
        t_stamp = timestamp.strftime('%H')
    
        if t_stamp not in hourly_post_count:
            hourly_post_count[t_stamp] = 1
            comments_per_hour[t_stamp] = each[1]
        else:
            hourly_post_count[t_stamp] += 1
            comments_per_hour[t_stamp] += each[1]

    return (hourly_post_count, comments_per_hour)

In [8]:
[counts_ask_hour, comments_ask_hour] = gen_dict(ask_hn)
[counts_show_hour, comments_show_hour] = gen_dict(show_hn)

In [9]:
def avg_hourly(dict1, dict2):
    result_list = []
    
    for key in dict1:
        result_list.append([key, (dict2[key]/dict1[key])])
    
    return result_list

In [10]:
avg_asks_hour = avg_hourly(counts_ask_hour, comments_ask_hour)
print("Average no. of comments each hour for 'Ask' posts: \n",avg_asks_hour,'\n')
avg_show_hour = avg_hourly(counts_show_hour, comments_show_hour)
print("Average no. of comments each hour for 'Show' posts: \n",avg_show_hour,'\n')

Average no. of comments each hour for 'Ask' posts: 
 [['12', 9.41095890410959], ['07', 7.852941176470588], ['01', 11.383333333333333], ['17', 11.46], ['20', 21.525], ['06', 9.022727272727273], ['10', 13.440677966101696], ['08', 10.25], ['03', 7.796296296296297], ['19', 10.8], ['09', 5.5777777777777775], ['11', 11.051724137931034], ['02', 23.810344827586206], ['16', 16.796296296296298], ['13', 14.741176470588234], ['14', 13.233644859813085], ['18', 13.20183486238532], ['00', 8.127272727272727], ['23', 7.985294117647059], ['04', 7.170212765957447], ['22', 6.746478873239437], ['21', 16.009174311926607], ['05', 10.08695652173913], ['15', 38.5948275862069]] 

Average no. of comments each hour for 'Show' posts: 
 [['12', 11.80327868852459], ['07', 11.5], ['20', 10.2], ['10', 8.25], ['06', 8.875], ['17', 9.795698924731182], ['08', 4.852941176470588], ['03', 10.62962962962963], ['19', 9.8], ['09', 9.7], ['11', 11.159090909090908], ['01', 8.785714285714286], ['05', 3.0526315789473686], ['16', 1

In [11]:
def swapper(temp_list):
    swapped = []
    
    for each in temp_list:
        swapped.append([each[1], each[0]])
        
    sort_swap = sorted(swapped,reverse=True)
    
    return sort_swap

In [12]:
swapped_avg_ask = swapper(avg_asks_hour)
swapped_avg_show = swapper(avg_show_hour)

In [13]:
print('\n The Top Hours for "Ask" Post Comments \n')

frmt = "%H"
for each in swapped_avg_ask:
    timestring = dtime.datetime.strptime(each[1],frmt)
    ts = timestring.strftime("%H:%M")
    output = "{}: {:.2f} average comments per post"
    print(output.format(ts, each[0]),'\n')


 The Top Hours for "Ask" Post Comments 

15:00: 38.59 average comments per post 

02:00: 23.81 average comments per post 

20:00: 21.52 average comments per post 

16:00: 16.80 average comments per post 

21:00: 16.01 average comments per post 

13:00: 14.74 average comments per post 

10:00: 13.44 average comments per post 

14:00: 13.23 average comments per post 

18:00: 13.20 average comments per post 

17:00: 11.46 average comments per post 

01:00: 11.38 average comments per post 

11:00: 11.05 average comments per post 

19:00: 10.80 average comments per post 

08:00: 10.25 average comments per post 

05:00: 10.09 average comments per post 

12:00: 9.41 average comments per post 

06:00: 9.02 average comments per post 

00:00: 8.13 average comments per post 

23:00: 7.99 average comments per post 

07:00: 7.85 average comments per post 

03:00: 7.80 average comments per post 

04:00: 7.17 average comments per post 

22:00: 6.75 average comments per post 

09:00: 5.58 average com

In [14]:
print('\n The Top Hours for "Show" Post Comments \n')

frmt = "%H"
for each in swapped_avg_show:
    timestring = dtime.datetime.strptime(each[1],frmt)
    ts = timestring.strftime("%H:%M")
    output = "{}: {:.2f} average comments per post"
    print(output.format(ts, each[0]),'\n')


 The Top Hours for "Show" Post Comments 

18:00: 15.77 average comments per post 

00:00: 15.71 average comments per post 

14:00: 13.44 average comments per post 

23:00: 12.42 average comments per post 

22:00: 12.39 average comments per post 

12:00: 11.80 average comments per post 

16:00: 11.66 average comments per post 

07:00: 11.50 average comments per post 

11:00: 11.16 average comments per post 

03:00: 10.63 average comments per post 

20:00: 10.20 average comments per post 

19:00: 9.80 average comments per post 

17:00: 9.80 average comments per post 

09:00: 9.70 average comments per post 

13:00: 9.56 average comments per post 

04:00: 9.50 average comments per post 

06:00: 8.88 average comments per post 

01:00: 8.79 average comments per post 

10:00: 8.25 average comments per post 

15:00: 8.10 average comments per post 

21:00: 5.79 average comments per post 

08:00: 4.85 average comments per post 

02:00: 4.23 average comments per post 

05:00: 3.05 average commen

# Conclusions

We can see that the post comments receieved by "Ask HN" posts are at 3:00 PM ET or 12:00 PM PT. This corresponds perfectly with lunch hour on the west coast of USA. This could indicate that commenters might be skewed to the west coast during the day. The IP addresses of the commenters or a timestamp of their comments would shed more light on where they are commenting from.

"Ask HN" posts seem to attract more than double the number of comments "Show HN" posts gather during their respective peak hours. Even at their lowest, "Ask HN" posts draw more attention. This could be a reflection of the helpful and/or inquisitive nature of the Hacker News community.

The next step in the analysis would be to look at the "Genre" of the posts that draw the most attention during peak hours to diagnose what interests the community.