# Exploring Hacker News Posts - the Best Time to Submit a Post that Draws Attention
## Introduction
[Hacker News](https://commonmark.org/help/) is a social news website, under the startup incubator [Y Combinator](https://www.ycombinator.com/), with a focus on computer science and entrepreneurship. Hacker News gains huge popularity in technology and startup communities. On this site, users can submit any posts, which "*gratify one's intellectual curiosity*" ([Ref: Hacker News Guidelines](https://news.ycombinator.com/newsguidelines.html)). Their posts are voted and commented upon, where the top-ranked posts can draw hundreds of thousands of traffic.

You can find the original dataset for Hacker News posts (12-month period) until 26th September 2016 [here](https://www.kaggle.com/hacker-news/hacker-news-posts). For this project, we use the `hacker_news.csv` dataset, a modified dataset, of which approximately 300,000 data rows have been trimmed down to 20,000 rows by:
- Deleting all the posts without any comments
- Sampling randomly from the remaining posts after the deletion

Here are the explanations for the columns of the `hacker_news.csv` dataset:
- `id`: The unique identifier for the post
- `title`: The title of the post
- `url`: The URL that the posts link to if the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted (time zone - Eastern Time in the US)

### Goal of the Project
Our project interested in analysing the posts with titles starting with:
- `Ask HN`: The posts for **asking** Hacker News community **a specific question**.
- `Show HN`: The posts for **sharing** with the community a project, product, or just generally something interesting.

The goal of our project is to compare these two types of posts to **study whether the number of comments and points (total number of upvotes) is influenced by**:
- **The type of posts** — either `Ask HN` or `Show HN`?
- **The submission time of the posts** — what time?

### Summary of Results
Based on our data analysis, we concluded that **`Ask HN` has a slightly higher number of comments** and the best time to get high attention is submitting a post is at **22:00 Eastern European Time (EET) our time (or 15:00 Eastern Time, EST)**. On the other hand, **`Share HN` has a higher average points, and its top hour is at 06:00 EET or 23:00 EST**.

Please check out the details below for the full data analysis.

## Opening and Preparing the Data
We open and read `hacker_news.csv` as a list of lists and assign it to the variable `hn`. For data analysis purpose, we remove the **header** row (`hn[0]`) of the dataset and keep only the rows (`hn[1:]`) that contain the data.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

Here, you can see the `headers` and the first five rows (`hn[:5]`) of the data after removing the header row:

In [2]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Which Post has a Higher Average Number of Comments — 'Ask HN' or 'Show HN'?

Since we focus only on the post titles start with `Ask HN` or `Show HN`, we generate 
new lists of lists for `ask_posts`, `show_posts` and `other_posts`. 

In [4]:
ask_posts = []    # For `Ask HN`
show_posts = []    # For `Show HN`
other_posts = []    # Posts that are neither `Ask HN` nor `Show HN`

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:
print('The number of posts for:')
print('    ask_posts:     ', len(ask_posts))
print('    show_posts:    ', len(show_posts))
print('    other_posts:   ', len(other_posts))

The number of posts for:
    ask_posts:      1744
    show_posts:     1162
    other_posts:    17194


Let's inspect the first five rows of `ask_posts` and `show_posts` respectively below:

In [6]:
print('ask_posts:')
print(ask_posts[:5])

print('\n')

print('show_posts:')
print(show_posts[:5])

ask_posts:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


show_posts:
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https:/

Next, we will calculate the average number of comments for `ask_posts` and `show_posts`. The `num_comments` has an `index` of *4*.

In [7]:
def average_comments(posts_type):
    total_comments = 0

    for row in posts_type:
        num_comments = int(row[4])
        total_comments += num_comments

    avg_comments = round(total_comments/len(posts_type), 2)
    return avg_comments

In [8]:
print('Average number of comments for ask_posts:   ', average_comments(ask_posts))
print('Average number of comments for show_posts:   ', average_comments(show_posts))

Average number of comments for ask_posts:    14.04
Average number of comments for show_posts:    10.32


Based on our results, **`ask_post` has a slightly higher average number of comments** than `show_post`. Nonetheless, it is **not statistically significant**, as they only have a difference of ~4 comments.

## Determining the Amount of 'Ask Posts' and Comments by Hour Created
Since `ask_posts` has slightly more comments than `show_posts`, we decided to focus on the analysis of `ask_posts` starting from this step.

To investigate whether the creation time of `ask_posts` influences the number of comments received, we apply the following approach:
1. Determine the number of `ask_posts` created every hour, as well as the corresponded number of comments obtained.
2. Determine the average number of comments obtained by `ask_posts` per hour of its creation.

To perform the analysis for part 1, we will apply the `datetime` module to work with the data in the `created_at` column (*index: -1*).

In [9]:
import datetime as dt

result_list = []
counts_by_hour = {}    # The number of ask_posts created every hour
comments_by_hour = {}  # The number of comments obtained by the ask_posts 

for row in ask_posts:
    created_at = row[-1]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
for result in result_list:
    comments_result = result[1]
    creation_time_str = result[0]
    creation_time_dt = dt.datetime.strptime(creation_time_str, '%m/%d/%Y %H:%M')
    creation_hour = creation_time_dt.strftime('%H')

    if not creation_hour in counts_by_hour:
        counts_by_hour[creation_hour] = 1
        comments_by_hour[creation_hour] = comments_result
    else:
        counts_by_hour[creation_hour] += 1
        comments_by_hour[creation_hour] += comments_result

print('counts_by_hour:')
print(counts_by_hour)
print('\n')
print('comments_by_hour:')
print(comments_by_hour)

counts_by_hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


comments_by_hour:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next, we will use `counts_by_hour` and `comments_by_hour` dictionaries to determine the average number of comments for posts created during each hour of the day.

In [10]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_comments_per_posts = round((comments_by_hour[hour]/counts_by_hour[hour]),2)
#     print(avg_comments_per_posts)
    avg_by_hour.append([hour, avg_comments_per_posts])
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


## Identifying the Top Five Hours where 'Ask Post' got Most Comments
To make the results easier to read, we sort the average number of comments per post in descending order. The date and time used in this dataset are following the **Eastern Time (EST)** in the United States. Thus, we convert the **EST** (`UTC-05`) time zone to **our time zone, Eastern European Time** (**EET**, `UTC+02`) to make it more relevant to us.

In [11]:
# Swap the columns in `avg_by_hour` and assign it to a list named `swap_avg_by_hour`.
swap_avg_by_hour = []

for a in avg_by_hour:
    swap_avg_by_hour.append((a[1], a[0]))    # a[1] = avg_comments_per_posts; a[0] = hour

print('swap_avg_by_hour:')
print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)    # Sorting the average number of comments

print("The Top Five Hours where 'Ask Post' got Most Comments:")

for item in sorted_swap[:5]:
    # US/Eastern timezone (EST) - UTC-05
    est_hour_dt = dt.datetime.strptime(item[1], '%H')
    est_hour_str = est_hour_dt.strftime('%H:%M')
    
    # Our timezone (EET) - UTC+02: 7 hours ahead of EST
    # Converting the `Hour` from EST to EET
    our_hour_dt = dt.datetime.strptime(item[1], '%H') + dt.timedelta(hours=7)
    our_hour_str = our_hour_dt.strftime('%H:%M')
    
    print('   ', '{est_time} EST (UTC-05) or {our_time} EET/our time (UTC+02):    {avg:.2f} average comments per post'.format(est_time=est_hour_str, our_time=our_hour_str, avg=item[0]))    # Use two decimal places to format avg

swap_avg_by_hour:
[(5.58, '09'), (14.74, '13'), (13.44, '10'), (13.23, '14'), (16.8, '16'), (7.99, '23'), (9.41, '12'), (11.46, '17'), (38.59, '15'), (16.01, '21'), (21.52, '20'), (23.81, '02'), (13.2, '18'), (7.8, '03'), (10.09, '05'), (10.8, '19'), (11.38, '01'), (6.75, '22'), (10.25, '08'), (7.17, '04'), (8.13, '00'), (9.02, '06'), (7.85, '07'), (11.05, '11')]


The Top Five Hours where 'Ask Post' got Most Comments:
    15:00 EST (UTC-05) or 22:00 EET/our time (UTC+02):    38.59 average comments per post
    02:00 EST (UTC-05) or 09:00 EET/our time (UTC+02):    23.81 average comments per post
    20:00 EST (UTC-05) or 03:00 EET/our time (UTC+02):    21.52 average comments per post
    16:00 EST (UTC-05) or 23:00 EET/our time (UTC+02):    16.80 average comments per post
    21:00 EST (UTC-05) or 04:00 EET/our time (UTC+02):    16.01 average comments per post


Our results show that **creating a post at 15:00 - 16:00 EST has the highest chance of receiving comments**. One of the possible explanations is that 15:00 EST is a time when users in both North America and Europe are active. This is based on our assumption that most of the Hacker News users are from these two continents. For the practicality reason, **the best time for us to submit a post at our time zone is 22:00, and it is followed by 09:00 and 23:00 EET**.

## Which Post has More Points on Average — 'Ask HN' or 'Show HN'?

The next question that we ask is — *what type of post has more points on average*? The number of points is the total number of upvotes deducts the total number of downvotes.

To answer this question, we apply a similar approach as before to compute the average number of points for `ask_posts` and `show_posts`. The `index` for `num_points` is *3*.

In [12]:
def average_points(posts_type):
    total_points = 0

    for row in posts_type:
        num_points = int(row[3])
        total_points += num_points

    avg_points = round(total_points/len(posts_type), 2)
    return avg_points

In [13]:
print('Average number of points for ask_posts:   ', average_points(ask_posts))

Average number of points for ask_posts:    15.06


In [14]:
print('Average number of points for show_posts:   ', average_points(show_posts))

Average number of points for show_posts:    27.56


In contrary to the numbers of comments, `show_posts` shows a higher average number of points than `ask_posts`. Given that, we will focus on determining the creation time for `show_posts` that probably receive more points.

## Determining whether the Number of Points are Influenced by Post Creation Time

We wonder whether posts created at a certain time are more likely to obtain more points.

First, we examine the number of show_posts created per hour along with the total points obtained:

In [15]:
result_list_show_posts = []
counts_by_hour_show_posts = {}    # The number of ask_posts created every hour
points_by_hour_show_posts = {}  # The number of points obtained by the show_posts

for row in show_posts:
    created_at = row[-1]
    num_points = int(row[3])
    result_list_show_posts.append([created_at, num_points])
    
for result in result_list_show_posts:
    points_result = result[1]
    creation_time_str = result[0]
    creation_time_dt = dt.datetime.strptime(creation_time_str, '%m/%d/%Y %H:%M')
    creation_hour = creation_time_dt.strftime('%H')

    if not creation_hour in counts_by_hour_show_posts:
        counts_by_hour_show_posts[creation_hour] = 1
        points_by_hour_show_posts[creation_hour] = points_result
    else:
        counts_by_hour_show_posts[creation_hour] += 1
        points_by_hour_show_posts[creation_hour] += points_result

print('counts_by_hour_show_posts:')
print(counts_by_hour_show_posts)
print('\n')
print('points_by_hour_show_posts:')
print(points_by_hour_show_posts)

counts_by_hour_show_posts:
{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


points_by_hour_show_posts:
{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


The lines of code above yields the `counts_by_hour_show_posts` and `points_by_hour_show_posts` dictionaries. By using these dictionaries, we will determine the average number of points for `show_posts` created during each hour of the day:

In [16]:
avg_points_by_hour_show_posts = []

for hour in points_by_hour_show_posts:
    avg_points_per_posts = round((points_by_hour_show_posts[hour]/counts_by_hour_show_posts[hour]),2)
#     print(avg_points_per_posts)
    avg_points_by_hour_show_posts.append([hour, avg_points_per_posts])
print(avg_points_by_hour_show_posts)

[['14', 25.43], ['22', 40.35], ['18', 36.31], ['07', 19.0], ['20', 30.32], ['05', 5.47], ['16', 28.32], ['19', 30.95], ['15', 28.56], ['03', 25.15], ['17', 27.11], ['06', 23.44], ['02', 11.33], ['13', 24.63], ['08', 15.26], ['21', 18.43], ['04', 14.85], ['11', 33.64], ['12', 41.69], ['23', 42.39], ['09', 18.43], ['01', 25.0], ['10', 18.92], ['00', 37.84]]


Next, we sort the average number of points per post in descending order:

In [17]:
# Swap the columns in `avg_points_by_hour_show_posts` and assign it to a list named `swap_avg_points_by_hour_show_posts`.
swap_avg_points_by_hour_show_posts = []

for a in avg_points_by_hour_show_posts:
    swap_avg_points_by_hour_show_posts.append((a[1], a[0]))    # a[1] = avg_points_per_posts; a[0] = hour

print('swap_avg_points_by_hour_show_posts:')
print(swap_avg_points_by_hour_show_posts)
print('\n')

sorted_swap_points = sorted(swap_avg_points_by_hour_show_posts, reverse=True)    # Sorting the average number of points

print("The Top Five Hours where 'Share Post' got Most Points:")

for item in sorted_swap_points[:5]:
    # US/Eastern timezone (EST) - UTC-05
    est_hour_dt = dt.datetime.strptime(item[1], '%H')
    est_hour_str = est_hour_dt.strftime('%H:%M')
    
    # Our timezone (EET) - UTC+02: 7 hours ahead of EST
    # Converting the `Hour` from EST to EET
    our_hour_dt = dt.datetime.strptime(item[1], '%H') + dt.timedelta(hours=7)
    our_hour_str = our_hour_dt.strftime('%H:%M')
    
    print('   ', '{est_time} EST (UTC-05) or {our_time} EET/our time (UTC+02):    {avg:.2f} average points per post'.format(est_time=est_hour_str, our_time=our_hour_str, avg=item[0]))    # Use two decimal places to format avg

swap_avg_points_by_hour_show_posts:
[(25.43, '14'), (40.35, '22'), (36.31, '18'), (19.0, '07'), (30.32, '20'), (5.47, '05'), (28.32, '16'), (30.95, '19'), (28.56, '15'), (25.15, '03'), (27.11, '17'), (23.44, '06'), (11.33, '02'), (24.63, '13'), (15.26, '08'), (18.43, '21'), (14.85, '04'), (33.64, '11'), (41.69, '12'), (42.39, '23'), (18.43, '09'), (25.0, '01'), (18.92, '10'), (37.84, '00')]


The Top Five Hours where 'Share Post' got Most Points:
    23:00 EST (UTC-05) or 06:00 EET/our time (UTC+02):    42.39 average points per post
    12:00 EST (UTC-05) or 19:00 EET/our time (UTC+02):    41.69 average points per post
    22:00 EST (UTC-05) or 05:00 EET/our time (UTC+02):    40.35 average points per post
    00:00 EST (UTC-05) or 07:00 EET/our time (UTC+02):    37.84 average points per post
    18:00 EST (UTC-05) or 01:00 EET/our time (UTC+02):    36.31 average points per post


The result shows that four out of the top 5 hours (6:00, 05:00, 07:00 and 01:00 EET) are at night or early morning at our time zone, which falls into the timeframe of 18:00 - 00:00 EST. This **implies that `share_posts` are more popular among users from North America**, and it has a **much higher chance of getting more points or upvotes during the evening to midnight timeframe at the EST time zone**. We also noticed that the differences in average points between the top 5 hours are smaller compared to those of average comments.

Unlike the average of number points, **the average number of comments seems to be more equally contributed by users from both North America and Europe** at their active hours. Additionally, users from these two continents appear to share an equal interest in `ask_posts`.

## Conclusion
In this project, we analyzed the data of Hacker News posts to study which type of posts and its submission time has an average higher number of comments. We found that **`Ask HN` has a slightly higher number of comments** and receives the **most comments at the submission time of 15:00 EST**, which **equivalents to 22:00 EET at our time zone**. On the other hand, the **`Share HN` posses a higher average number of points** with the top hours of gaining points at **23:00 EST or 06:00 EET**.

This result serves as a useful guide for us to submit our posts, either `Ask HN` or `Share HN`, on Hacker News with a higher chance of gaining a high level of attention.