# Exploring Hacker News Posts

In this project we will will explore the Hacker News site, where user-submitted posts are voted and commented on. Users submit posts whose titles begin with Ask HN to ask the community a specific question or Show HN to show the community a porject or something interesting. We will compare these two types of post to determine the following: 

- Do Ask HN or Show HN recieve more comments on average?

- Do posts created at a certain time receive more comments on average?

We will begin by reading in the data as a list of lists and displaying the first five rows.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

In [2]:
hn[:6]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 

Create a separate header variable and remover the header from the data set.

In [3]:
headers = hn[:1]
hn = hn[1:]
print(headers)
hn[:6]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12'],
 ['10482257',
  'Title II kil

We are now ready to filter our data. We are only concerned with post titles beginning with Ask HN or Show HN and will create new list of lists for just those titles.

In [4]:
ask_posts = []
show_posts = []
other_posts = []
n_ask = 0
n_show = 0
n_other = 0

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        n_ask +=1
        ask_posts.append(row)
    elif title.startswith('show hn'):
        n_show +=1
        show_posts.append(row)
    else:
        n_other +=1
        other_posts.append(row)

In [5]:
print('Number of Ask posts:',n_ask)
print('Number of Show posts:',n_show)
print('Number of other posts:',n_other)

Number of Ask posts: 1744
Number of Show posts: 1162
Number of other posts: 17194


Next we will determine if ask posts or show posts receive more comments on average.

In [6]:
total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
avg_ask_comments = total_ask_comments / n_ask

total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
    
avg_show_comments = total_show_comments / n_show

print('Average amount of comments on Ask posts:', avg_ask_comments)
print('Average amount of comments on Show posts:', avg_show_comments)

Average amount of comments on Ask posts: 14.038417431192661
Average amount of comments on Show posts: 10.31669535283993


Ask posts receive more comments on average (14) than show posts (10).

Since ask posts receive more comments than show post, we'll focus the remaining analysis on these posts.

Next we will determine if ask posts created at a certain time are more likely to attract comments.

In [7]:
import datetime as dt

In [8]:
result_list = []

for row in ask_posts:
    created = row[6]
    comments =  int(row[4])
    result_list.append([created, comments])

In [12]:
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comments = row[1]
    dt_date = dt.datetime.strptime(date, date_format)
    hour = dt_date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Next we'll calculate the average number of comments for posts during each hour, using the two previous dictonaries. The avg_by_hour list of lists first element is the hour and the second element is the average number.

In [15]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])
    
avg_by_hour

[['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['09', 5.5777777777777775],
 ['00', 8.127272727272727],
 ['04', 7.170212765957447],
 ['10', 13.440677966101696],
 ['18', 13.20183486238532],
 ['11', 11.051724137931034],
 ['15', 38.5948275862069],
 ['17', 11.46],
 ['14', 13.233644859813085],
 ['01', 11.383333333333333],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['03', 7.796296296296297],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['16', 16.796296296296298],
 ['08', 10.25],
 ['19', 10.8],
 ['23', 7.985294117647059],
 ['22', 6.746478873239437],
 ['05', 10.08695652173913],
 ['12', 9.41095890410959]]

Next we will sort this list and print the highest values in a format easier to read.

In [20]:
swap_avg_hour = []
for row in avg_by_hour:
    swap_avg_hour.append([row[1], row[0]])

print(swap_avg_hour)

sorted_swap = sorted(swap_avg_hour, reverse = True)

sorted_swap

[[16.009174311926607, '21'], [14.741176470588234, '13'], [5.5777777777777775, '09'], [8.127272727272727, '00'], [7.170212765957447, '04'], [13.440677966101696, '10'], [13.20183486238532, '18'], [11.051724137931034, '11'], [38.5948275862069, '15'], [11.46, '17'], [13.233644859813085, '14'], [11.383333333333333, '01'], [9.022727272727273, '06'], [7.852941176470588, '07'], [7.796296296296297, '03'], [21.525, '20'], [23.810344827586206, '02'], [16.796296296296298, '16'], [10.25, '08'], [10.8, '19'], [7.985294117647059, '23'], [6.746478873239437, '22'], [10.08695652173913, '05'], [9.41095890410959, '12']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [22]:
print("Top 5 Hours for Ask Posts Comments")

for avg,hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The most comments occur during hour 15:00 with an average of 38.59 comments.

# Conclusion

In this project, we analyzed ask and how posts to determine which type of post receives the most amount of comments on average. We determined that ask post on average received the most comments between 15:00 - 16:00.

Possible next steps include:
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.