** EXPLORING HACKER NEWS **

In this project, we'll compare two different types of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken for Data Science?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [44]:
## Reading the data from CSV File ##
from csv import reader
opened_file = open('hacker_news.csv',encoding='utf-8')
read_file = reader(opened_file)
hn = list(read_file)
##Printing the First Five Rows ##
hn[:5]



[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

** Removing The Header and Printing The First Five Rows **

In [46]:
header = hn[0]
hn = hn[1:]
print(hn[:5])

[['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]


** Extracting Ask HN , Show HN , Other Post from the Dataset **

First, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [48]:
## Identifying Different types of Post ##
askPost = []
showPost = []
otherPost = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        askPost.append(row)
    elif title.lower().startswith('show hn'):
        showPost.append(row)
    else:
        otherPost.append(row)
        
print("The number of Post in askPost is ",len(askPost))
print("The number of Post in showPost is ",len(showPost))
print("The number of Post in otherPost is ",len(otherPost))

The number of Post in askPost is  9139
The number of Post in showPost is  10158
The number of Post in otherPost is  273821


** Calculating The Average Comments recieved for AskHN and Show HN Posts**

In [54]:
## Average Comments Recieved for ASK HN Post ##
totalAskComment = 0

for row in askPost:
    co = int(row[4])
    totalAskComment += co
avg_ask = totalAskComment/len(askPost)
print("Average comment for Ask HN Post : ",avg_ask)

Average comment for Ask HN Post :  10.393478498741656


In [59]:
## Average Comments Recieved for SHOW HN Post ##
total_show_comments = 0

for post in showPost:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(showPost)
print("Average comment for SHOW HN post:", avg_show_comments)

Average comment for SHOW HN post: 4.886099625910612


On average, ask posts in our sample receive approximately 10 comments, whereas show posts receive approximately 4. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

In [62]:
### Calculate the amount of ask posts created during each hour of day and the number of comments recieved ###
import datetime as dt

result = []

for row in askPost:
    result.append([row[6],int(row[4])])
    
commentsByHour = {}
countsByHour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date,date_format).strftime("%H")
    if time in countsByHour:
        commentsByHour[time]  += comment
        countsByHour[time] +=1
    else:
        commentsByHour[time] = comment
        countsByHour[time] = 1
commentsByHour

{'00': 2277,
 '01': 2089,
 '02': 2996,
 '03': 2154,
 '04': 2360,
 '05': 1838,
 '06': 1587,
 '07': 1585,
 '08': 2362,
 '09': 1477,
 '10': 3013,
 '11': 2797,
 '12': 4234,
 '13': 7245,
 '14': 4972,
 '15': 18525,
 '16': 4466,
 '17': 5547,
 '18': 4877,
 '19': 3954,
 '20': 4462,
 '21': 4500,
 '22': 3372,
 '23': 2297}

In [64]:
## Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive. ##
avg_by_hour = []

for hr in commentsByHour:
    avg_by_hour.append([hr, commentsByHour[hr] / countsByHour[hr]])

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

In [65]:
##Getting the Sorted order of Highest Number of Comments ##
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [67]:
## Sort the values and print the the 5 hours with the highest average comments. ##

print("Top 5 Hours for 'Ask HN' Comments in the dataset")
for avg, hr in sorted_swap[:5]:
    print(
        "{}  {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments in the dataset
15:00  28.68 average comments per post
13:00  16.32 average comments per post
12:00  12.38 average comments per post
02:00  11.14 average comments per post
10:00  10.68 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 28.68 comments per post. There's about a 75% increase in the number of comments between the hours with the highest and second highest average number of comments.

**CONCLUSION **

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.