# Exploring Hacker News Post.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Here is the link to download the [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) , but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the - - - total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

### Objective :
This site has many posts but in this project we want to compare and analyse two posts which start with Ask HN and Show HN.
The Ask HN post asks the Hacker News community a question while the Show HN posts show the community a project.

More Specifically we want to know :
- Which post recieves more number of comments
- The time in which the post recieves more comments on average


In [1]:
# importing the dependencies

from csv import reader
import datetime as dt

## Reading the data.

Firstly, let's read the dataset and display the first 5 rows

In [2]:
# Reading the dataset

# from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

## Removing Headers From List of Lists

In [3]:
hn_headers = hn[0]
hn_data = hn[1:]

In [4]:
# displying the headers
hn_headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [5]:
# displaying first 5 rows
hn_data[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN Posts

As stated in the begining one of our goal is to compare the Ask HN and Show HN posts to see which of these two type of posts has more number of comments. 
Hence it would be a lot easier if we just seperate the data into these two types of posts respectively.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_data:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number Of Ask Posts : {}'.format(len(ask_posts)))
print('Number Of Show Posts : {}'.format(len(show_posts)))
print('Number Of Other Posts : {}'.format(len(other_posts)))

Number Of Ask Posts : 1744
Number Of Show Posts : 1162
Number Of Other Posts : 17194


we can see that there are 1744 Ask hn posts and 1162 Show Hn posts,

Now that we have segregated the posts it is easier to calculate the total and average comments for the two posts.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [7]:
hn_headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [8]:
# ask hn posts

# total comments
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

# average comments
average_ask_comments = total_ask_comments/len(ask_posts)

print('Total Comments On Ask Hn Posts : {:,}'.format(total_ask_comments))
print('Average Comments On Ask Hn Posts : {:.2f}'.format(average_ask_comments))

Total Comments On Ask Hn Posts : 24,483
Average Comments On Ask Hn Posts : 14.04


In [9]:
# show hn posts

# total comments
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

# average comments
average_show_comments = total_show_comments/len(show_posts)

print('Total Comments On Show Hn Posts : {:,}'.format(total_show_comments))
print('Average Comments On Show Hn Posts : {:.2f}'.format(average_show_comments))

Total Comments On Show Hn Posts : 11,988
Average Comments On Show Hn Posts : 10.32


Now we know thaa on average a Ask post on Hacker News recieves more comments than the Show post.

Lets do further analysis on the Ask Hn posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Now our focus is in determining the hour/time in which an Ask HN post might receive maximum comments.

Lets try to find out the number of posts created at each hour of the day and also the number of comments these posts received.

In [10]:
hn_headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [11]:
# list of hour with comments for each post
comments_by_hour = []

for post in ask_posts:
    created_at = post[-1]
    num_comments = int(post[4])
    comments_by_hour.append([created_at, num_comments])

print(comments_by_hour[0:10])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1], ['4/22/2016 12:24', 4], ['11/16/2015 9:22', 1], ['2/24/2016 17:57', 1], ['6/4/2016 17:17', 2]]


In [12]:
# count of post by hour
# count of comment by hour

# import datetime as dt

count_post_by_hour = {}
count_comments_by_hour = {}

for list in comments_by_hour:
    date = list[0]
    comment = list[1]
    date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    if hour not in count_post_by_hour:
        count_post_by_hour[hour] = 1
        count_comments_by_hour[hour] = comment
    elif hour in count_post_by_hour:
        count_post_by_hour[hour] += 1
        count_comments_by_hour[hour] += comment

In [13]:
# displaying nuber of posts created by hour

count_post_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [14]:
# displaying count of comments by hour

count_comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Now that we have the dictionaries containing the number of comments during the hours and the number of posts during the hours of the day, let's calculate the Average number of comments for the posts by hour.

In [15]:
avg_comments_by_hour = []

for hour in count_comments_by_hour:
    avg_comments_by_hour.append([hour, count_comments_by_hour[hour]/count_post_by_hour[hour]])

In [16]:
# displying the average comment for a post in a hour

avg_comments_by_hour


[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Printing Values from a List of Lists

Because the Average comments by hour is in the list of list format we can swap the position of the hour and aveage comments so that we can sort the comments in a descending order inorder to know that time in which the Ask HN posts recieved more comments.

In [17]:
# Swapping the positions of the hour and average comments
swap_avg = []
for list in avg_comments_by_hour:
    swap_avg.append([list[1], list[0]])
    
    # sorting the average comments in a descending order
    sorted_swap_avg = sorted(swap_avg, reverse=True)

print(sorted_swap_avg)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [18]:
# displaying top 5 hours to get more comments on a ask post

print('Best Hours To Get More Comments On Ask Post','\n')

for average, hour in sorted_swap_avg[0:5]:
    print('{h} :  {a:.2f} average comments per post'.format(h= dt.datetime.strptime(hour, '%H').strftime('%H:%M'), a=average))

Best Hours To Get More Comments On Ask Post 

15:00 :  38.59 average comments per post
02:00 :  23.81 average comments per post
20:00 :  21.52 average comments per post
16:00 :  16.80 average comments per post
21:00 :  16.01 average comments per post


We can see that the posts made during the time 15:00 (3:00 Pm) received an average of 38.59 comments per post.

## Conclusion


In this project we analyzed the Ask HN posts and Show HN posts from the Hacker News Posts to determine which of these two types of posts receive more comments on average and also to determine the suitable time to create the post inorder to get maximum comments.

Based on our analysis it is clear that the Ask HN posts recieved more comments on average compared to the Show HN posts and Ask HN posts created between 15:00 and 16:00 (3:00 pm and 5:00 pm) recieved more comments on average.

