# Exploring Hacker News
**About Hacker News**

Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Let's open the file and exlore data a little bit

In [2]:
opened_file = open("hacker_news.csv", encoding="utf8-")
from csv import reader
read_file = reader(opened_file)
data = list(read_file)
hn_header = data[0]
hn = data[1:]

print(hn_header)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


**We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.**

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's use these methods (*lower()* and *startswith()*) to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

In [3]:
asks_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        asks_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(asks_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


As shown above, the number of asks_posts = 9139 while the number of show_posts = 10158

Now, we will find the average number of comments for both types of posts and determine which one has more comments

In [4]:
total_ask_comments = 0
total_show_comments = 0
for row in asks_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_ask_comments = total_ask_comments/len(asks_posts)
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of ASK comments is", avg_ask_comments)
print("Average number of SHOW comments is", avg_show_comments)

Average number of ASK comments is 10.393478498741656
Average number of SHOW comments is 4.886099625910612


**In the previous cell, we have calculated the average number of comments of both types. It is evident that the average number of ASK comments is higher than the average number of SHOW comments.**

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [5]:
import datetime as dt
result_lists = []

for post in asks_posts:
    date = post[6]
    num_comments = int(post[4])
    result_lists.append([date, num_comments])
counts_by_hour = {}
comments_by_hour = {}
for result in result_lists:
    date = result[0]
    num_comments = result[1]
    date_object = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_object.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

print(counts_by_hour)
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


**We have calculated the number of hours and the number of comments written on these hours. Next we will find the average number of comments by hour**

In [6]:
result = []
for hour in counts_by_hour:
    avg_comments_by_hour = comments_by_hour[hour]/counts_by_hour[hour]
    result.append([hour, avg_comments_by_hour])
print(sorted(result))

[['00', 7.5647840531561465], ['01', 7.407801418439717], ['02', 11.137546468401487], ['03', 7.948339483394834], ['04', 9.7119341563786], ['05', 8.794258373205741], ['06', 6.782051282051282], ['07', 7.013274336283186], ['08', 9.190661478599221], ['09', 6.653153153153153], ['10', 10.684397163120567], ['11', 8.96474358974359], ['12', 12.380116959064328], ['13', 16.31756756756757], ['14', 9.692007797270955], ['15', 28.676470588235293], ['16', 7.713298791018998], ['17', 9.449744463373083], ['18', 7.94299674267101], ['19', 7.163043478260869], ['20', 8.749019607843136], ['21', 8.687258687258687], ['22', 8.804177545691905], ['23', 6.696793002915452]]


In [7]:
swap_avg_by_hour = []
for res in result:
    swapped = [res[1], res[0]]
    swap_avg_by_hour.append(swapped)
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]


In [9]:
for avg, hour in sorted_swap:
    date_pattern = "%H"
    date_object = dt.datetime.strptime(hour, date_pattern)
    hour_final = date_object.strftime("%H:%S")
    str_pattern = "{hour}: {average:.2f} average comments per post"
    print(str_pattern.format(hour=hour_final, average=avg))

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
04:00: 9.71 average comments per post
14:00: 9.69 average comments per post
17:00: 9.45 average comments per post
08:00: 9.19 average comments per post
11:00: 8.96 average comments per post
22:00: 8.80 average comments per post
05:00: 8.79 average comments per post
20:00: 8.75 average comments per post
21:00: 8.69 average comments per post
03:00: 7.95 average comments per post
18:00: 7.94 average comments per post
16:00: 7.71 average comments per post
00:00: 7.56 average comments per post
01:00: 7.41 average comments per post
19:00: 7.16 average comments per post
07:00: 7.01 average comments per post
06:00: 6.78 average comments per post
23:00: 6.70 average comments per post
09:00: 6.65 average comments per post


## The goal is achieved.
**We found out the time of the day (ASK comments) that receives the heighest average number of posts**

However, we will boarden our goals and:
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.

In [10]:
ask_points = 0
show_points = 0
for row in asks_posts:
    num_points = int(row[3])
    ask_points += num_points
for row in show_posts:
    num_points = int(row[3])
    show_points += num_points

In the cell above, we have retrieved and appended points of ASK and SHOW posts to seperate variables
Now, we are going to evaluate average number of points they receive

In [12]:
str_pattern = "The average number points {number} of {topic} posts"
print(str_pattern.format(number=ask_points/len(asks_posts), topic="ASK"))
print(str_pattern.format(number=show_points/len(show_posts), topic="SHOW"))

The average number points 11.31174089068826 of ASK posts
The average number points 14.843571569206537 of SHOW posts


Let's compare it with the values we have.

In [13]:
print("Average number of ASK comments is", avg_ask_comments)
print("Average number of SHOW comments is", avg_show_comments)

Average number of ASK comments is 10.393478498741656
Average number of SHOW comments is 4.886099625910612


**From these results, we can observe that even though, SHOW posts receive higher number of points in average it faces the lack of comments. At the same time, the average number of points of ASK posts is relatively lower than SHOW posts possess. However, the ASK posts receive considerably higher amount of comments as compared to the SHOW posts**

Let's check if posts created at a certain time are more likely to receive more points. Since we are interested in both type of posts, the analysis will be given respectively.

In [14]:
ask_points_by_hour = {}
show_points_by_hour = {}