# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The dataset used is taken from hacker_news.csv.
Below are the descriptioins of the columns:

* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

In [1]:
from csv import reader
import datetime as dt

In [2]:
open_file = open("datasets/hacker_news.csv")
read_file = reader(open_file)

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/hacker_news.csv'

In [None]:
hn = list(read_file)

In [None]:
print(*hn[:6], sep='\n')

In [None]:
headers = hn[0]
hn = hn[1:]

In [None]:
print(headers)

In [None]:
print(*hn[:5], sep='\n')

### Extracting "Ask HN" and "Show HN" posts

In [None]:
ask_posts = []
show_posts = []
other_posts = []

In [None]:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [None]:
print("No. of ask_posts:", len(ask_posts))
print("No. of show posts:", len(show_posts))
print("No. of other_posts:", len(other_posts))

### Calculating the average no. of comments for "Ask HN" and "Show HN" Posts 

In [None]:
total_ask_comments = sum([int(row[4]) for row in ask_posts])

In [None]:
avg_ask_comments = total_ask_comments / len(ask_posts)

In [None]:
total_show_comments = sum([int(row[4]) for row in show_posts])
avg_show_comments = total_show_comments / len(ask_posts)                           

In [None]:
print("Average no. of comments on 'Ask HN' posts: %.4f" % avg_ask_comments)
print("Average no. of comments on 'Show HN' posts: %.4f" % avg_show_comments)

Thus, we observe that "Ask HN" posts receive more no. of comments than "Show HN" posts on average.

### Finding the amount of "Ask" posts and comments by the hour created

In [None]:
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [None]:
counts_by_hour = {}
comments_by_hour = {}

In [None]:
print(*[row[0] for row in result_list[:50]], sep='\n')

In [None]:
for row in result_list:
    dateobj = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dateobj.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

In [None]:
for k, v in counts_by_hour.items():
    print(k, v)

In [None]:
for k, v in comments_by_hour.items():
    print(k, v)

Thus, `counts_by_hour` contains the no. of ask posts created during each hour of the day.
`comments_by_hour` contains the no. of comments for the ask posts created during each hour of the day. 

### Calculating average no. of comments per hour per "Ask HN" post

In [None]:
avg_by_hour = []

In [None]:
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [None]:
print(*avg_by_hour, sep='\n')

Sorting `avg_by_hour` in desc. order with the key set as avg no. of comments for each hour (i.e. second element in each sublist inside the list 'avg_by_hour'):

In [None]:
avg_by_hour.sort(reverse=True, key=lambda row:row[1])

In [None]:
print(*avg_by_hour, sep='\n')

## Conclusion

Displaying hour and corresponding avg no of comments for 
"Top 5 hours for Ask Posts comments":

In [None]:
for row in avg_by_hour[:5]:
    dateobj = dt.datetime.strptime(row[0], "%H")
    hour = dateobj.strftime("%H:%M")
    avg_comments = row[1]
    print("{}: {:.2f} average comments per post".format(hour, avg_comments))

Thus, posting at these hours seem to have a higher chance for receiving comments.