# The Best Way to Post in Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data se[t he](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts)re, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submiions.re

This project intuit is to explore data of thteHacker News websitenk](https://news.ycombinator.com/) animplement some data analysis to define a good way to get engagement on the plataform.

I am specifically interested in posts with titles that begin with either **Ask HN** or **Show HN**. Users submit **Ask HN** posts to ask the Hacker News community a specific question and **Show HN** to show something they learned.
e:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Basic Data Reading

I start reading the data and separating the header row from the dataset:

In [5]:
import datetime as dt
import csv
import pytz

data = open('hacker_news.csv', encoding='utf8')
data = csv.reader(data)
hn = list(data)
hn[:5]


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [6]:
headers = hn[:1]
hn = hn[1:]
headers

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [7]:
hn[7:8] # An example of Ask HN post

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55']]

## Calculations

Separating the data in 3 new lists to implement the analysis:

- `ask_posts`
- `show_posts`
- `other_posts`

In [10]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))

1744


In [11]:
# Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0

for i in ask_posts:
    total_ask_comments += int(i[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [12]:
# Calculate the average number of comments `Show HN` posts receive.
total_show_comments = 0

for i in show_posts:
    total_show_comments += int(i[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


On **average**, clearly the ask posts receive near 40% more comments. So, I am going to continue my analysis just with this type of post, since my objective is 
to find the easiest way to get viral at HackerNews.

Continuing, below I create a list that contains only the creation time and number of comments of each post of the dataset.

After, I implement frequency tables to find out the number of posts and the comments on posts each hour. 

## Filtering by Time

In [15]:


result_list = []

for i in ask_posts:
    created_at = i[6]
    post_comments = int(i[4])
    result_list.append([created_at,post_comments])
    

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for i in result_list:
    date = i[0]
    comment = i[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment

    else: 
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1

comments_by_hour



{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

The number of comments each hour alone it is not useful information. In this way, I calculate the average post for each hour to find out more insightful information 

In [17]:
avg_by_hour = []

for post in comments_by_hour:
    avg_by_hour.append([post, (comments_by_hour[post]/counts_by_hour[post])])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [18]:
# Swap the order of the 'avg_by_hour' list to make it easier to sort for the highest values below.
swap_avg_by_hour= []

for i in avg_by_hour:
    swap_avg_by_hour.append([i[1], i[0]])
print(swap_avg_by_hour)   

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [19]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print ('Top 5 Hours for Ask Posts Comments')

Top 5 Hours for Ask Posts Comments


In [20]:
# Filtering the top 5 periods of the day to post

date_format = '%H'

for i in sorted_swap[:5]:
    avg = i[0]
    time1 = i[1]
    time2 = dt.datetime.strptime(time1, date_format).strftime('%H:%M')
    print('{time}: {avg:.2f} average comments per post'.format(time = time2, avg = avg))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Extra Analysis

It is important to notice that the Hacker News Data is stored with the East Europe timezone. As I live in Brazil and hypothetically would post in Brazil, it is important to find out the Top 5 hours to post in Brazil timezone.

Of course, I could just ask Google the difference between timezones and mentally calculate it, but this is a good way to let the code ready to change the analysis to any timezone. 

In [23]:



date_format = '%H'

# Defining TimeZones
tz_europe = pytz.timezone('Europe/Lisbon')  # Fuso horário de Europa Ocidental
tz_brazil = pytz.timezone('America/Sao_Paulo')  # Fuso horário do Brasil

for i in sorted_swap[:5]:
    avg = i[0]
    time1 = i[1]
    
    # Localize method is used to explicitly associate a time zone with a datetime object that originally has no time zone information
    dt_europe = dt.datetime.strptime(time1, date_format)
    dt_europe = tz_europe.localize(dt_europe)
    
    # 
    dt_brazil = dt_europe.astimezone(tz_brazil)
    
    # Format the time for output
    time_brazil = dt_brazil.strftime('%H')
    
    print('{time}: {avg:.2f} average comments per post'.format(time=time_brazil, avg=avg))

12: 38.59 average comments per post
23: 23.81 average comments per post
17: 21.52 average comments per post
13: 16.80 average comments per post
18: 16.01 average comments per post


Funny to notice, but my initial premise ('Ask HN' and 'Show HN' posts receive more attention) was wrong and apparently **posts without this tags receive more comments on average**. 

In [25]:
# Using 'other_posts' list, which I left in the beggining of the code 
total_other_comments = 0

for i in other_posts:
    total_other_comments += int(i[4])

avg_other_comments = total_other_comments / len(other_posts)
print(avg_other_comments)

26.8730371059672


## Conclusion

The basic conclusion is kinda obvious. Between using 'ASK HN' tag or 'SHOW HN' tag in Hacker News website, use 'ASK HN' tag to get more engagement.

The best time to post in Brazil on Hacker News is near midday.