# Exploring Hacker News Posts

In this project, we will be working with a data set of submissions to the popular techonology site [Hacker News](https://news.ycombinator.com/), which was started by the startup incubator [Y Combinator](https://www.ycombinator.com/).  On this site, user-submitted stories, or posts, are voted and commented on by readers.  Hacker News is extremely popular in technology and start up circles, and posts which make it to the top of their listings often get hundreds of thousands of visitors.

The data set which will be used in this project can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).  Please not that the data set has been reduced from almost 300,000 entries to approximately 20,000 by removing any posts that didn't receive any comments, and then randomly sampling the remaining entries.

The columns for each post are described as follows:

- `id` : The unique identifier from Hacker News for the post
- `title` : The title of the post
- `url` : The URL that the post links to, if it has a URL
- `num_points` : The number of points the post acquired, calculated as the total number of upvotes, minus the total number of downvotes
- `num_comments` : The number of comments that were made on the post
- `author` : The username of the person who submitted the post
- `created_at` : The date and time at which the post was submitted

We will be analysing posts whose titles begin with either `Ask HN` or `Show HN`.  Users submit `Ask HN` posts to ask the Hacker News community a specific question, for example:

- `Ask HN: How to improve my personal website?`
- `Ask HN: Am I the only one outraged by Twitter shutting down share counts?`
- `Ask HN: Any recent changes to CSS that broke mobile?`

`Show HN` posts are submitted to show the Hacker News community a project, product, or just generally something interesting, for example:

- `Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform`
- `Show HN: Something pointless I made`
- `Show HN: Shanhu.io, a programming playground powered by e8vm`

We will compare the two types of posts to determine the following:

- Do `Ask HN` or `Show HN` posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We'll start by importing the libraries we need and reading the data set into a list of lists.

# Opening and Exploring the Data

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn_data = list(read_file)
headers = hn_data[0]
hn = hn_data[1:]

We'll create a function to simplify exploring the data set, called **explore_data()**, which will print the rows in an easily-readable format and also has the option to show the number of rows and columns for the data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(hn_data, 0, 5, rows_and_columns = True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Number of rows: 20101
Number of columns: 7


# Extracting Ask HN and Show HN Posts

Now we can filter our data to find only the post titles beginning with `Ask HN` or `Show HN` and create a new list of lists containing the data for only these titles.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now, let's determine whether ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0

for row in ask_posts:
    comments = row[4]
    total_ask_comments += int(comments)

avg_ask_comments = total_ask_comments/len(ask_posts)

print('Average comments for ask posts:', avg_ask_comments)

Average comments for ask posts: 14.038417431192661


In [6]:
total_show_comments = 0

for row in show_posts:
    comments = row[4]
    total_show_comments += int(comments)
    
avg_show_comments = total_show_comments/len(show_posts)

print('Average comments for show posts:', avg_show_comments)

Average comments for show posts: 10.31669535283993


The calculations above show that `Ask HN` posts receive an average of 14 comments per post, and `Show HN` posts receive an average of 10 comments per post.  

As we've found that, on average, `Ask HN` posts receive more comments than `Show HN` posts, we will focus our remaining analysis on just these posts.

Next, we'll determine whether ask posts created at certain times are likely to attract more comments than others.

To do this, we will:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

# Finding the Amount of Ask Posts and Comments by Hour Created

First, we will create a list called `result_list`, with each row consisting of the time that the post was created plus the number of comments the post received.

In [7]:
import datetime as dt

result_list = []

for row in ask_posts:
    created = row[6] ## the time at which the post was created
    comments = int(row[4]) ## the number of comments the post received
    result_list.append([created, comments])

Then we will create two dictionaries: 

- `counts_by_hour`: shows the number of ask posts created during each hour of the day
- `comments_by_hour`: contains the corresponding number of comments that ask posts created at each hour received

In [8]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    date_dt = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date_dt.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment    
        
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

Now, we will use the two dictionaries created above to calculate the average number of comments for `Ask HN` posts created during each hour of the day.

In [9]:
avg_by_hour = []

for hr in comments_by_hour:
        avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])
        
avg_by_hour

[['22', 6.746478873239437],
 ['17', 11.46],
 ['12', 9.41095890410959],
 ['23', 7.985294117647059],
 ['08', 10.25],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234],
 ['19', 10.8],
 ['11', 11.051724137931034],
 ['04', 7.170212765957447],
 ['10', 13.440677966101696],
 ['02', 23.810344827586206],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['03', 7.796296296296297],
 ['09', 5.5777777777777775],
 ['21', 16.009174311926607],
 ['14', 13.233644859813085],
 ['06', 9.022727272727273],
 ['18', 13.20183486238532],
 ['20', 21.525],
 ['15', 38.5948275862069],
 ['00', 8.127272727272727],
 ['05', 10.08695652173913]]

# Sorting and Printing Values from a List of Lists

We have now obtained the results that we need, but they are in a format which is difficult to make sense of.  We can finish off by sorting the `avg_by_hour` list of lists and printing the five highest values in order.

First, we will swap the columns in `avg_by_hour` so that the first element becomes the second element, and vice versa; then, we can arrange them in descending order, so that the hours with the highest average number of comments per post are at the top.

In [10]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[6.746478873239437, '22'], [11.46, '17'], [9.41095890410959, '12'], [7.985294117647059, '23'], [10.25, '08'], [7.852941176470588, '07'], [14.741176470588234, '13'], [10.8, '19'], [11.051724137931034, '11'], [7.170212765957447, '04'], [13.440677966101696, '10'], [23.810344827586206, '02'], [16.796296296296298, '16'], [11.383333333333333, '01'], [7.796296296296297, '03'], [5.5777777777777775, '09'], [16.009174311926607, '21'], [13.233644859813085, '14'], [9.022727272727273, '06'], [13.20183486238532, '18'], [21.525, '20'], [38.5948275862069, '15'], [8.127272727272727, '00'], [10.08695652173913, '05']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

Finally, we will print out the top 5 hours for comments on `Ask HN` posts.

In [11]:
print('Top 5 Hours for Ask Posts Comments')

for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, '%H').strftime('%H:%M'),avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The results show us that the posts posted at 15:00 received the highest number of comments, with an average of 38.59 comments per post.  The next hour down the list is 02:00, which has an average of 21.52 comments per post - nearly 60% less than the highest.

When we look at the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/), it tells us that the time zone for each post is Eastern Standard Time in the US, so we could write the time as 3pm EST so that readers can work out what the equivalent time in their time zone would be.

# Conclusion

In this project, we were looking at posts on the `Hacker News` website, with the aim of determining which posts gained the highest number of comments on average.  

We analysted the data to find out whether, *of the posts which received any comments*:

- `Ask HN` or `Show HN` posts received the most comments
- posts posted at a particular time of day received more comments than others

We have come to the conclusion that the most commented on posts are `Ask HN` posts posted between 3pm - 4pm EST.