# Exploring Hacker News Posts

In this project we analyze posts in Hacker News, asking these questions:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

The data can be found here: [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)

In [1]:
# Read in data
from csv import reader
opened_file = open("C:\\Users\\ASUS\\Downloads\\Hacker_News_Dataset.csv", encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)

In [2]:
# Here we look at first 5 rows of our data
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

In [3]:
# First row contains names of columns we don't want in data to analyze
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
# Here we get rid of the header but keep the rest of the data
hn = hn[1:]
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

In [5]:
# Glance at data
hn[10:15]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578893',
  'Muroc Maru',
  'http://www.weirdca.com/location.php?location=511',
  '1',
  '0',
  'x43b',
  '9/26/2016 2:46'],
 ['12578879',
  'Why companies make their products worse',
  'https://www.1843magazine.com/ideas/the-daily/why-companies-make-their-products-worse',
  '4',
  '0',
  'RachelF',
  '9/26/2016 2:40'],
 ['12578866',
  'Tuning AWS SQS Queues',
  'http://blog.simontaranto.com/post/2016-09-25-tuning-aws-sqs.html/',
  '3',
  '0',
  'srt32',
  '9/26/2016 2:37'],
 ['12578857',
  'The Promise of GitHub',
  'http://constantbetasoftware.com/2016/09/26/github.html',
  '2',
  '0',
  'ttam',
  '9/26/2016 2:34']]

## Data Slicing

Since we only want to analyze "Show HN" and "Ask HN" posts, we will now trim the data

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    lower_title = title.lower()
    if lower_title.startswith('ask hn'):
        ask_posts.append(row)
    elif lower_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


9139
10158
273822


As we can see, we have 9139 'Ask HN' posts, 10,158 'Show HN' posts, and 273,822 other posts in our data set.

Let's look at the first five lines of our Ask and Show datasets:

In [7]:
# Ask
print(ask_posts[:5])
print('')
# Show
print(show_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/2

# Data Exploration

## Part One

To see which posts get more user interaction, we will look at how many comments each type of posts get, on average.

In [8]:
# "Ask HN" posts
total_ask_comments = 0
row_num = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    row_num += 1
    
avg_ask_comments = total_ask_comments / row_num
print(avg_ask_comments)

10.393478498741656


In [9]:
# "Show HN" posts
total_show_comments = 0
row_num = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    row_num += 1
    
avg_show_comments = total_show_comments / row_num
print(avg_show_comments)

4.886099625910612


Clearly, "Ask HN" posts generate far more comments, on average, than "Show HN" posts. This may be due to users wanting to help answer questions more so than showing off new projects.

Since "Ask HN" posts are so popular, the remainder of this analysis will focus on "Ask HN" posts.

## Part Two

Is there a time of day which generally leads to more popular posts?


In [10]:
# Library to help format time #
import datetime as dt
#-----------------------------#

result_list = []
for row in ask_posts:
    new_list = []
    time_col = row[6]
    num_comments = int(row[4])
    new_list = [time_col, num_comments]
    result_list.append(new_list)
    
counts_by_hour = {}
comments_by_hour = {}

result_list[:5]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2]]

In [12]:
# First part of each row (index 0) contains the date / time of each post, 
# the second part (index 1) contains the number of comments on given post

for row in result_list:
    date_and_time = row[0]
    date_and_time = dt.datetime.strptime(date_and_time, "%m/%d/%Y %H:%M")
    post_time = date_and_time.strftime("%H")
    if post_time not in counts_by_hour:
        counts_by_hour[post_time] = 1
        comments_by_hour[post_time] = int(row[1])
    else:
        counts_by_hour[post_time] += 1
        comments_by_hour[post_time] += int(row[1])

Above, we created two dictionaries. 

The first, *counts_by_hour*, contains the number of ask posts created during each hour of the day.

The second, *comments_by_hour*, contains the corresponding number of comments ask posts created at each hour received.

### Make List of Average Number of Comments per Post in each Hour

In [19]:
avg_list = []
for each in counts_by_hour:
    avg_list.append([each, comments_by_hour[each] / counts_by_hour[each]])
    
print(avg_list)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [20]:
swap_avg_list = []
for val in avg_list:
    swap_avg_list.append([val[1], val[0]])
    
swap_avg_list    

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [22]:
sorted_swap = sorted(swap_avg_list, reverse = True)
sorted_swap

Top avg. Comments per hour


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [23]:
print('Top 5 Hours for Ask Posts Comments')
for val in sorted_swap[:5]:
    my_hr = val[1]
    my_hr = dt.datetime.strptime(my_hr, "%H")
    post_time = my_hr.strftime("%H")
    print('{hr}: {num:.2f} average comments per post'.format(hr = post_time, num = val[0]))

Top 5 Hours for Ask Posts Comments
15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
02: 11.14 average comments per post
10: 10.68 average comments per post


# Conclusion

"Ask HN" posts receive more traffic than "Show HN" posts.

As we can see, the best times for user engagement on "Ask HN" posts are between 10 AM and 4 PM. These are common work hours. Perhaps people that are checking Hacker News are also taking a break at work.

# Further Research

- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.