# Exploring Hacker News Posts

For this Project, we will explore the dataset containing data from HackerNews posts up to September 26 2016. The [original dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) contains ~300,000 rows, but we have reduced it to approximately 20,000 rows by removing submissions with no comments and then randomly sampling from the remaining submissions.

Columns are as follows:
* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the posts links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission

We will focus mostly on the posts whose titles begin with either `Ask HN` or `Show HN`. `Ask HN` posts represent user's questions to the HN community. We can see some examples below:

```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
```

`Show HN` posts are showcasing user's projects, products, ideas or simply something interesting. We can see some examples below:

```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
```

Some of the questions we will try to answer are as follows:

* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Importing the dataset

We will import the `reader` from `csv` module, then read the `hacker_news.csv` as list of lists assigned to a variable `hn`. We will remove the header and assign it to `hn_header`.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [2]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Isolating the data with post titles Ask HN & Show HN

For this purpose we will use string method `.startswith()`. It checks whether a string starts with a certain argument, and if it does it returns `True`. 

It is case-sensitive, so in order to ensure we capture all data we will first convert all titles to lower case using a `.lower()` method.

We will first create 3 empty lists:
* `ask_posts` - will contain all post records starting with `Ask HN`
* `show_posts` - will contain all post records starting with `Show HN`
* `other_posts` - will contain all other post records

In [4]:
ask_posts = []
show_posts = []
other_posts = []

Then, we will loop through `hn` dataset and assign title of each row to a variable `title`:

In [5]:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [6]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [7]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


We can see that our lists have been filled as follows:
* 1744 record in ask_posts
* 1162 in how_posts
* 17194 in other_posts

In the next step, we will sum up the number of comments in `ask_posts`, which are stored in column with index 4:

In [8]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    


In [9]:
print(total_ask_comments)

24483


We can now calculate the average number of comments per ask post:

In [10]:
avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
print(avg_ask_comments)

14.04


We do the same for `show_posts`:

In [11]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

In [12]:
print(total_show_comments)

11988


In [13]:
avg_show_comments = round(total_show_comments / len(show_posts), 2)
print(avg_show_comments)

10.32


From the above, we can see that on average `ask_posts` receive more comments (14.04) per post than `show_posts` (10.32).

Since ask posts receive more comments on average, we will focus only on that post category in our further analysis.

## Finding the relation between number of comments and hour of post creation

In our next steps, we will try to determine if ask posts created at a certain time of day attract more comments.

In [14]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [15]:
print(ask_posts[0])

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


We can see that the post creation information is stored in column created_at - index number 6. Let's check the type of data stored:

In [16]:
type(ask_posts[0][6])

str

In [17]:
print(ask_posts[0][6])

8/16/2016 9:55


Since the information is stored as a string, we will have to use the `datetime` module to parse dates and times stored as strings. More specifically, we will use `datetime.strptime()` method

In [18]:
import datetime as dt

In [19]:
result_list = []

In [20]:
for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])])

In the above step, we have created an empty list which will contain all post creation times with corresponding number of comments for each creation time.

In [21]:
print(result_list[:3])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]


Before we can parse date and time, we need to check which format is used. From the above printed rows we see that date is formatted as `%m/%d/%Y`, and time as `%H:%M`

We will now create two empty dictionaries:
* `counts_by_hour` - it will store number of post creations in each hour
* `comments_by_hour` - it will store number of comments per each hour

In [22]:
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

In [23]:
for row in result_list:
    creation_time = row[0]
    n_comments = row[1]
    dt_creation_time = dt.datetime.strptime(
        creation_time, date_format) # convert to datetime object
    hour = dt_creation_time.strftime('%H') #convert hour to string
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1 #increment by 1 for each post creation in the hour
        comments_by_hour[hour] += n_comments # increment by number of comments in the hour
    
    

In [24]:
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next step is to calculate the average number of comments per post for each hour of the day. We will store it as list of lists `avg_by_hour`

In [25]:
avg_by_hour = []

In [26]:
for hour in comments_by_hour:
    avg = round(comments_by_hour[hour]/counts_by_hour[hour], 2)
    avg_by_hour.append([hour, avg])

In [27]:
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


To `sort()` the list, we will swap rows, so we can sort by avg comments:

In [28]:
swap_avg_by_hour = []

In [29]:
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [30]:
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


In [31]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [32]:
print(sorted_swap)

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]


We will now loop through the `sorted_swap` list and print out the top 5 results in an easy to read format:

In [33]:
print("Top 5 Hours for 'Ask HN' Comments")

for row in sorted_swap[:5]:
    hr = row[1]
    avg = row[0]
    dt_hour = dt.datetime.strptime(hr, '%H') #parse dt object
    str_hour = dt_hour.strftime('%H:%M') #format back into string
    print(
        "{}: {:.2f} average comments per post".format(str_hour, row[0])
    )
    

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We can now convert these times to our local time zone, which is GMT +2 (6 
hours ahead of the Eastern Time in which the original dataset was made):

In [54]:
my_timezone = []
new_hr = 0

for row in sorted_swap:
    hr = row[1]
    avg = row[0]
    dt_hr = dt.datetime.strptime(hr, '%H')
    new_hr = dt_hr + dt.timedelta(hours = 6)
    str_hr = new_hr.strftime('%H:%M')
    my_timezone.append([avg, str_hr])

We have converted all hour values from `sorted_swap` to a dt object, then we have added 6 hours using the `.timedelta()` method. Once hours have been added, we have converted back to string using `%H:%M` format.

In [55]:
print(my_timezone)

[[38.59, '21:00'], [23.81, '08:00'], [21.52, '02:00'], [16.8, '22:00'], [16.01, '03:00'], [14.74, '19:00'], [13.44, '16:00'], [13.23, '20:00'], [13.2, '00:00'], [11.46, '23:00'], [11.38, '07:00'], [11.05, '17:00'], [10.8, '01:00'], [10.25, '14:00'], [10.09, '11:00'], [9.41, '18:00'], [9.02, '12:00'], [8.13, '06:00'], [7.99, '05:00'], [7.85, '13:00'], [7.8, '09:00'], [7.17, '10:00'], [6.75, '04:00'], [5.58, '15:00']]


In [57]:
print("Top 5 Hours for 'Ask HN' Comments")

for row in my_timezone[:5]:
    hr = row[1]
    avg = row[0]
    print(
        "{}: {:.2f} average comments per post".format(hr, avg)
    )
    

Top 5 Hours for 'Ask HN' Comments
21:00: 38.59 average comments per post
08:00: 23.81 average comments per post
02:00: 21.52 average comments per post
22:00: 16.80 average comments per post
03:00: 16.01 average comments per post


## Conclusion

From this simple analysis, we have first concluded that ask posts are receiving more comments on average than the show posts. Going further, we have calculated Top 5 post creation hours in the day which receive the most comments on average. It is observed that posts created at 3pm Eastern time / 21h EU time receive significantly more comments (at least 62% more) than posts created at other hours.

Second runner is 2am Easter time / 8am EU time. 

In order to determine why is that, we need to obtain more data, but preliminary conclusions might be that these times coincide with morning and evening in the EU, when people are not at work or just arrived/going to work. All of the Top 5 hours are outside of the common working hours in the US and EU, which can further confirm our assumption that posts created when people are not working will receive more comments on average.