# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.\
Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

This project focuses on two types of posts; **Ask HN** posts (asking HN community specific questions) and **Show HN** posts (showing projects to HN community).
This project aims to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Other than comments, we will also consider the following:
- Do show or ask posts receive more points on average?
- Are posts created at a certain time more likely to receive more points?
- Compare the results to the average number of comments and points other posts receive.

Data set can be accessed [here](https://www.kaggle.com/hacker-news/hacker-news-posts)\
*The data set used in this project has been reduced from 300.000 rows to approximately 20.000 rows.\
Submissions without comments were removed and the remaining submissions were randomly sampled*

In [1]:
#Reading csv file
from csv import reader

with open('hacker_news.csv') as f:
    f_read = reader(f)
    hn = list(f_read)

#Removing first row of data which contains column names
headers = hn[0]
hn = hn[1:]

for index, column in enumerate(headers):
    print(index, column)

0 id
1 title
2 url
3 num_points
4 num_comments
5 author
6 created_at


In [1]:
#Reading first four data
hn[:4]

NameError: name 'hn' is not defined

Before we start our analysis, we will filter our data based on the types of posts: Ask, Show, and others. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('show hn'):
        show_posts.append(row)
        
    elif title.startswith('ask hn'):
        ask_posts.append(row)
        
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [4]:
#First four ask posts
ask_posts[:4]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20']]

In [5]:
#First four show posts
show_posts[:4]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11']]

## Analzying Hacker News Posts

After we categorize by post type, we will start our analysis by comparing comments for Ask HN posts and Show HN posts. Additionally, we will also analzye how the posts differ based on the number of points.

### Comparison Based on Comments

Referring back to the introduction, our first goal is to determine whether the Ask HN posts receive more comments on average than the Show HN posts. Below we will calculate the total and the average number of comments for each type of posts. 

In [6]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments = total_ask_comments + num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print(f"Average number of comments for Ask HN posts: {avg_ask_comments}")

Average number of comments for Ask HN posts: 14.038417431192661


In [7]:
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments = total_show_comments + num_comments

avg_show_comments = total_show_comments/len(show_posts)
print(f"Average number of comments for Show HN posts: {avg_show_comments}")

Average number of comments for Show HN posts: 10.31669535283993


**Result:**\
Based on the results above, we can infer that Ask HN posts, with an average of 14 comments, receive more comments than Show HN posts with an average of only 10 comments.

Since the Ask HN posts receive more comments, the remaining analysis will be focused on these posts. 

### Most Common Time for Comments Posted

Based on the previous code cell, we determined that the Ask HN posts receive more comments. Next, we will analyze the following:
1. The amount of posts created, along with the number of comments received, in each hour of the day.
2. The average number of comments received by the hour created

In [8]:
#importing datetime library
import datetime as dt

In [9]:
#empty list to isolate created_at and num_comments
result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    new_list = [created_at, num_comments]
    result_list.append(new_list)
    
result_list[:4]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3]]

In [10]:
#create empty dictionary to count number of comments at certain hour
counts_by_hour = {}

#create empty dictionary for total number of comments at certain hour
comments_by_hour = {}

for result in result_list:
    date = dt.datetime.strptime(result[0], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour]  = 1
        comments_by_hour[hour] = result[1]
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

In [11]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [12]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [13]:
avg_by_hour = []

#calculate average number of comments per hour
for hour in comments_by_hour:
    avg_comments = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments])
    
avg_by_hour[:4]

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085]]

In [14]:
#swapping number of comments and hour positions
swap_avg_by_hour = []

for result in avg_by_hour:
    swap_result = [result[1], result[0]]
    swap_avg_by_hour.append(swap_result)
    
swap_avg_by_hour[:4]

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14']]

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap[:4]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16']]

In [25]:
print("Top 5 Hours for Ask Posts Comments")

#printing top 5 hours for ask posts comments
for result in sorted_swap[:5]:
    hour = dt.datetime.strptime(result[1], '%H')
    hour_format = dt.datetime.strftime(hour, '%H:%M')
    template = "{hour}: {avg:.2f} average comments per post"
    top_hours = template.format(hour = hour_format, avg = result[0])
    print(top_hours)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


At 15.00, the average number of comments per post is nearly 38.59, the highest compared to other hours. As the hours are recorded in the Eastern US time zone, which is -5 UTC, the best time to comment in my time zone (+7 UTC) is at 03.00.

### Comparison Based on Points

Based on the number of comments, the Ask posts receive more comments compared to Show posts. In the dataset there is another factor that we can consider, which is the number of points received. Karma points are calculated as the number of upvotes a given user's content has received minus the number of downvotes.

In [17]:
#Calculating total ask posts points
total_ask_points = 0

for post in ask_posts:
    num_points = int(post[3])
    total_ask_points = total_ask_points + num_points

#Calculating average ask posts points    
avg_ask_points = total_ask_points / len(ask_posts)
print(f"Average number of points for Ask HN posts: {avg_ask_points}")

Average number of points for Ask HN posts: 15.061926605504587


In [18]:
#Calculating total shows posts points
total_show_points = 0

for post in show_posts:
    num_points = int(post[3])
    total_show_points = total_show_points + num_points
    
avg_show_points = total_show_points / len(show_posts)
print(f"Average number of points for Show HN posts: {avg_show_points}")

Average number of points for Show HN posts: 27.555077452667813




**Result**\
Based on the average number of points, we found that the Show HN posts receive more points than the Ask HN posts, with 27.55 points. As a result, our following analysis will be focused on the Show HN posts

### Most Common Time for Points Given

In [19]:
#create empty list to isolate created_at and num_points
result_list = []

for post in show_posts:
    created_at = post[6]
    num_points = int(post[3])
    new_list = [created_at, num_points]
    result_list.append(new_list)
    
result_list[:2]

[['11/25/2015 14:03', 26], ['11/29/2015 22:46', 747]]

In [20]:
#create empty dictionary to count number of comments at certain hour
counts_by_hour = {}

#create empty dictionary for total number of comments at certain hour
points_by_hour = {}

for result in result_list:
    date = dt.datetime.strptime(result[0], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour]  = 1
        points_by_hour[hour] = result[1]
        
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += result[1]

In [26]:
#create empty list to calculate average points per hour
avg_by_hour = []

for hour in points_by_hour:
    avg_points = points_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([avg_points, hour])

#sort average in descending order
sorted_avg = sorted(avg_by_hour, reverse = True)


sorted_avg[:5]

[[42.388888888888886, '23'],
 [41.68852459016394, '12'],
 [40.34782608695652, '22'],
 [37.83870967741935, '00'],
 [36.31147540983606, '18']]

In [34]:
#Printing top 5 hours for most number of points
print("Top 5 Hours for Show Posts Points")
print(f'-'*40)

#Looping through top 5 show posts
for post in sorted_avg[:5]:
    hour = dt.datetime.strptime(post[1], '%H')
    hour_format = dt.datetime.strftime(hour, '%H:%M')
    template = "{hour}: {avg:.2f} average points per post"
    top_hours = template.format(hour = hour_format, avg = post[0])
    print(top_hours)
    

Top 5 Hours for Show Posts Points
----------------------------------------
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


## Conclusion

From our findings, we found that **Ask Posts** comments receive more **comments** than Show Posts.
- The Asks posts has an average number of 14 comments while Show posts averaging only 10.
- The highest chance of receiving comments is posting at 15:00, where there is an average of 38.59 comments per post.
- Referring back to the documentation of the data set, the hours are recorded in the Eastern Time US time zone, which is -5 UTC. In order for me to receive the highest number of comments possible, I need to create a post at 03.00 (time zone +7 UTC).

Additionally, we also compared the posts by the number of points and discovered that unlike the number of comments, the Show HN posts receive more **points** than Ask HN posts.
- The Show posts has an average of 27 points while the Ask posts only averaged 15 comments.
- The highest chance of receiving points is posting at 23.00, with an average of 42.39 points per post.
- I need to post at 11.00 in order to receice the highest number of points.