# Hacker News Data Analysis 

Hacker News (HN) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.



Dataset used for this project contains commented posts (user-submitted stories).

We will focus on:
- posts where users submit specific question to the community. Such posts contain **"Ask HN"** string in the title.
- posts that show user's project, ideas or anything interesting that the user want's to share with community. These posts begin with **"Show HN"**.

**The analysis task is executed using almost pure Python, without using Pandas, NumPy, etc. libraries to practice handling fundamental python data structures.**

## Main questions: 

#### 1) Do "Ask HN" or "Show HN" receive more comments on average?
#### 2) Do posts created at a certain time receive more comments on average?

## Dataset

The dataset is available on Kaggle:
https://www.kaggle.com/hacker-news/hacker-news-posts

Now let's load the file.

In [1]:
import csv

file = open('hacker_news.csv', encoding="utf8")
read = csv.reader(file)
hn = list(read)

header, hn = hn[0], hn[1:]


Header of the dataset - we have 7 columns in total. Short description of each feature:
- **id**: Unique identifier of the post
- **title**: The title of the post
- **url**: provides link (if post containst such URL)
- **num_points**: The number of points of the post -> UPVOTES - DOWNVOTES
- **num_comments**: The number of comments made on particular post
- **author**: Username of the author
- **created_at**: The date and time of publishing the post.

In [2]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First few rows of the table:


In [3]:
display(hn[:5])

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

### 1) Do "Ask HN" or "Show HN" receive more comments on average?

The initial step will be doing some filtering. We can get rows containing just "Ask HN" or "Show HN" substring. Below the code we can see resulting counts of posts in each category.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    if row[1].lower().startswith('ask hn'):
        ask_posts.append(row)
    elif row[1].lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

for count, post_type in zip([len(ask_posts), len(show_posts), len(other_posts)],['Ask posts', 'Show posts', 'Other posts']):
    string = f'{post_type} : {count} in total.'
    print(string)

Ask posts : 9139 in total.
Show posts : 10158 in total.
Other posts : 273822 in total.


**Which of the category receives more comments on average ?** Now we can derive this information easily. Only things we need is to find calculate average value of 5th column (of index 4).

In [5]:
for category_name, category in zip(['Ask_posts', 'Show_posts', 'Other_posts'], [ask_posts, show_posts, other_posts]):
    
    com_total = 0
    length = len(category)
    
    for row in category:
        
        com_total += int(row[4])
    
    average = com_total / length
    
    string = f'Category "{category_name}" has {average} comments per post on average.'
    print(string)

Category "Ask_posts" has 10.393478498741656 comments per post on average.
Category "Show_posts" has 4.886099625910612 comments per post on average.
Category "Other_posts" has 6.4572678601427205 comments per post on average.


As we see in the result, posts where people ask something have about twice amount of comments as "Show_posts". Good news for common Googlers !
The comparison shows that there is wider spectrum of post types on HN. Our choice of "Ask_posts" and "Show_posts" represent just specific subset.

### 2) Do posts created at a certain time receive more comments on average?

Now to the next point. We'll try to determine if posts created at certain time are more likely to attract comments. At first, we calcualte amount of post for each hour in the day, along with the count of post's comments. Information of time of publishing can be obtained from the last column of our dataset.

In [12]:
import datetime as dt

hours_post_count = {}
hours_comment_count = {}

for category_name, category in zip(['Ask_posts', 'Show_posts', 'Other_posts'], [ask_posts, show_posts, other_posts]):
    
    hours_post_count[category_name] = {}
    hours_comment_count[category_name] = {}
    
    for row in category:  
        n_of_comments = int(row[4])
        time_info = dt.datetime.strptime(row[-1], '%m/%d/%Y %H:%M')
#         print(time_info.hour, n_of_comments)
        
        if time_info.strftime('%H') not in hours_post_count[category_name]:
            hours_post_count[category_name][time_info.strftime('%H')] = 1
        else:
            hours_post_count[category_name][time_info.strftime('%H')] += 1
            
        if time_info.strftime('%H') not in hours_comment_count[category_name]:
            hours_comment_count[category_name][time_info.strftime('%H')] = n_of_comments
        else:
            hours_comment_count[category_name][time_info.strftime('%H')] += n_of_comments




In [23]:
print('POSTS FOR EACH HOUR:')
for posts_type in hours_post_count:
    
    string = f'{posts_type} for each hour in total : '
    print(string,'\n', hours_post_count[posts_type],'\n')

    print('\n', 'COMMENTS IN TOTAL FOR EACH HOUR')
for posts_type in hours_post_count:
    string = f'{posts_type} total comments each hour : '
    print(string,'\n', hours_comment_count[posts_type],'\n')

POSTS FOR EACH HOUR:
Ask_posts for each hour in total :  
 {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209} 


 COMMENTS IN TOTAL FOR EACH HOUR
Show_posts for each hour in total :  
 {'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172} 


 COMMENTS IN TOTAL FOR EACH HOUR
Other_posts for each hour in total :  
 {'03': 6649, '02': 6977, '01': 7391, '00': 8391, '23': 9720, '22': 11657, '21': 13568, '20': 14920, '19': 15929, '18': 17406, '17': 18363, '16': 18790, '15': 18043, '14': 16929, '13': 14874, '12': 11876, '11': 9638, '10': 9130, '09': 8528, '08': 7930, '07': 73


Having the data above, we can calculate average number of comments for posts created during each hour of the day. In the end, just few hours having highest average comment count are displayed (for each category of posts).

In [64]:
import pprint

avg_by_hour = {}
for category in hours_post_count:
    
    avg_by_hour[category] = []
    
    for hour in hours_post_count[category]:
        avg_by_hour[category].append([hour,hours_comment_count[category][hour]/hours_post_count[category][hour]])
    
    
    result = sorted(avg_by_hour[category], key=lambda x: x[1], reverse=True)[:6]
    print(category,'\n',result,'\n')

Ask_posts 
 [['15', 28.676470588235293], ['13', 16.31756756756757], ['12', 12.380116959064328], ['02', 11.137546468401487], ['10', 10.684397163120567], ['04', 9.7119341563786]] 

Show_posts 
 [['12', 6.994186046511628], ['07', 6.682203389830509], ['11', 6.002487562189055], ['08', 5.6044303797468356], ['14', 5.515804597701149], ['13', 5.432786885245902]] 

Other_posts 
 [['12', 7.58521387672617], ['11', 7.374144013280763], ['02', 7.1807367063207685], ['13', 7.146833400564744], ['05', 6.786839967506093], ['00', 6.613156953879156]] 



## Conclusion

Having these results, we can see that if you submit question type post, the best  time to publish it would be between noon and 3PM to get the most comments. In remaining categories, there is similar trend. Note that the time zone is Eastern Time in the US, so comments submitted right after midnight could possibly come from another part of the world. Or it could be just that geeks are up until the morning.