# Exploring Hacker News Post

This project is about exploring the data set of submissions on hacker news.

[Hacker News](https://news.ycombinator.com/) is a website started by a startup incubator [YCombinator](https://www.ycombinator.com/). The users add stories (known as "posts") which are voted (upvotes and downvotes) and commented upon. Hacker News is extremely famous among startups and the top posts get viewed by thousands of people.

The data set is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It includes the following columns:

- `title:` title of the post (self explanatory)

- `url:` the url of the item being linked to, if it has any

- `num_points:` the number of points the post received (calculate by upvotes - downvotes)

- `num_comments:` the number of comments the post received

- `author:` the name of the account that made the post

- `created_at:` the date and time the post was made (the time zone is Eastern Time in the US)

Note: The data has been slightly cleaned to remove posts with no comments.

In this data set we are particularly interested in two types of posts.
1. These are the posts whose title begin with __`Ask HN`__. 
Users use this tag at the beginning of the title to ask the Hacker News community a specific question. Below are some examples:

```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
```

2. These are the posts whose title begin with __`Show HN`__. Users use this tag to show their project demo, product or something interesting. Below are some examples:

```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
```

#### Goal of the project
1. Do __`Ask HN`__ or __`Show HN`__ receive more comments on average?
2. Do posts created at certain time receive more comments on average?

In [1]:
from csv import reader

f = open("hacker_news.csv", "r")
hn = list(reader(f))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Separate Header from the list

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting  `ask posts` and `show posts` from the data set.

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of 'asks': ", len(ask_posts))
print("Number of 'shows': ", len(show_posts))
print("Number of 'others': ", len(other_posts))

Number of 'asks':  1744
Number of 'shows':  1162
Number of 'others':  17194


### Evaluating _`average comments`_ for `Ask HN` and `Show HN` posts

In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average 'Ask HN' comments: ", round(avg_ask_comments, 2))

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average 'Show HN' comments: ", round(avg_show_comments, 2))

Average 'Ask HN' comments:  14.04
Average 'Show HN' comments:  10.32


On an average '__ask__' posts get 14 comments while '__show__' posts get 10 comments. As `ask` posts get more number of comments than `show` posts, we will focus on `ask` posts for the next goal.

Next, we will find out if the 'ask' posts created at a certain time attract more comments.

We'll use the following steps to achieve this analysis:

    1. Calculate the number of ask posts created in each hour, along with the number of comments received.
    2. Calculate the average number of comments received for the created hour

### Part 1: Finding the number of posts created and number of comments received by `hour created`

In [5]:
import datetime as dt
from collections import Counter

def get_hour_from_date(datestring):
    result = dt.datetime.strptime(datestring, "%m/%d/%Y %H:%M")
    return result.hour

counts_by_hour = Counter()
comments_by_hour = Counter()

for row in ask_posts:
    created_at = row[6]
    hour = get_hour_from_date(created_at)
    num_comments = int(row[4])
    counts_by_hour[hour] += 1
    comments_by_hour[hour] += num_comments
    
print(counts_by_hour)
print("\n")
print(comments_by_hour)


Counter({15: 116, 19: 110, 21: 109, 18: 109, 16: 108, 14: 107, 17: 100, 13: 85, 20: 80, 12: 73, 22: 71, 23: 68, 1: 60, 10: 59, 2: 58, 11: 58, 0: 55, 3: 54, 8: 48, 4: 47, 5: 46, 9: 45, 6: 44, 7: 34})


Counter({15: 4477, 16: 1814, 21: 1745, 20: 1722, 18: 1439, 14: 1416, 2: 1381, 13: 1253, 19: 1188, 17: 1146, 10: 793, 12: 687, 1: 683, 11: 641, 23: 543, 8: 492, 22: 479, 5: 464, 0: 447, 3: 421, 6: 397, 4: 337, 7: 267, 9: 251})


### Part 2: Find the average number of comments for posts by `hour created`

In [6]:
avg_by_hour = Counter()
for hour, count in counts_by_hour.items():
    num_comments = comments_by_hour[hour]
    avg_by_hour[hour] = round(num_comments / count, 2)
    
print(avg_by_hour)

Counter({15: 38.59, 2: 23.81, 20: 21.52, 16: 16.8, 21: 16.01, 13: 14.74, 10: 13.44, 14: 13.23, 18: 13.2, 17: 11.46, 1: 11.38, 11: 11.05, 19: 10.8, 8: 10.25, 5: 10.09, 12: 9.41, 6: 9.02, 0: 8.13, 23: 7.99, 7: 7.85, 3: 7.8, 4: 7.17, 22: 6.75, 9: 5.58})


To get the top 5 most common hours where the average number of comments is higher.

In [7]:
print(avg_by_hour.most_common(5))

[(15, 38.59), (2, 23.81), (20, 21.52), (16, 16.8), (21, 16.01)]


Let us print the 5 most common hours with highest average comments in a proper readable format.

In [9]:
print("Top 5 hours where the average comments for new post are higher.")
for item in avg_by_hour.most_common(5):
    hour, num_comments = item
    hour = dt.datetime.strptime(str(hour), "%H")
    hour_str = hour.strftime("%H:%M")
    output = "{}: {} average comments per post.".format(hour_str, num_comments)
    print(output)

Top 5 hours where the average comments for new post are higher.
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.8 average comments per post.
21:00: 16.01 average comments per post.


__The above analysis shows us that posting a question on Hacker News between 15:00 and 15:00 (Eastern Time in US) gives us the most number of comments.__ That's about 60% increase in the number of comments between the highest and the second highest hours of post creation.

# Conclusion

In this project, we analyzed Ask HN and Show HN posts to check which one of the category receives more comments. We also analysed the average number of comments for Ask HN by hour created. The result of the analysis shows that Ask HN receives more comments than Show HN. We can also conclude that on an average the Ask HN created between 15:00 and 16:00 EST receives more comments than the posts created at other hours.

### Future Work:

> We can convert this time which is in US Eastern Time into our local time zone.