# HACKER NEWS - Analyzing Ask and Show Posts 

Hackers News is a website operated by the startup incubator, Y Combinator. Users submit stories (posts) on to the website and these stories are voted and commented on. Hacker News is propular in technology and startup networks. The data set we are working with initially contained 300,000 rows but has since been reduced to 20,000 rows. The rows that were removed contained null values in the comment column or where not chosen in the random sampling. 

In the dataset, we are interested in posts whose titles begin with either Ask HN or Show HN. These are posts submitted to ask the community a specific question or show the community something, respectively. 

In [61]:
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
from csv import reader 
read_file = reader(opened_file)
HN = list(read_file)
print(HN[0])
print('\n')
print(HN[1])
print('\n')
print(HN[2])
print('\n')
print(HN[3])
print('\n')
print(HN[4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


# Cleaning the Data

As a part of the next step, we will proceed to clean the data. In the next few lines of code, we can see that we have nearly 300,000 rows in the dataset. We will first want to separate the header row from the sample rows. Then, we will remove the rows that do not contain any comments. Following that, we will sample the first 20,000 rows in the dataset for use in our analysis.

In [62]:
headers = HN[0] #Separate the header row
HN = HN[1:] #Reassign the dataset without the header row
print(headers)
print('\n')
print(HN[0:4])
print('\n')
print('Number of rows: ', len(HN))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Number of rows:  293119


In [63]:
HN_refined = [] 
for row in HN: #for loop that will create a new list with only the rows that have at least one comment
    comment = int(row[4])
    if comment > 0:
        HN_refined.append(row)
print(HN_refined[:4])
print('Count HN Refined: ', len(HN_refined))

[['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']]
Count HN Refined:  80401


In [64]:
HN_refined = HN_refined[0:20000] #Reassign the dataset to only contain the first 20,000 records
print('Count HN Refined: ', len(HN_refined))

Count HN Refined:  20000


In [65]:
ask_posts = [] 
show_posts = []
other_posts = []
for row in HN_refined: #this for loop will partition our dataset into three lists: ask posts, show posts, and others
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(title)
print('Count Ask Posts: ', len(ask_posts))
print('Count Show Posts: ', len(show_posts))
print('Count Other Posts: ', len(other_posts))

Count Ask Posts:  1987
Count Show Posts:  1260
Count Other Posts:  16753


In [66]:
print(headers)
print('\n')
print(ask_posts[:4])
print('\n')
print(show_posts[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty',

In [67]:
total_ask_comments = 0
for row in ask_posts: #this for loop will count the number of comments in the ask posts
    comments = int(row[4])
    total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments) #display the average amount of comments in the ask posts

total_show_comments = 0
for row in show_posts: #this for loop will count the number of comments in the show posts
    comments = int(row[4])
    total_show_comments += comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments) #display the average amount of comments in the show posts
    

16.52088575742325
9.765079365079366


# Findings - Average Comment Count

After conducting our analysis, we can see that the average number of comments in the Ask Posts (Average of 16.52 comments) is greater than the average number of comments in the Show Posts (Average of 9.76 comments). We will continue to use the Ask Posts only since it has a greater amount of comments on average.

In [68]:
import datetime as dt #import datetime module

In [73]:
result_list = []
for row in ask_posts: #iterate over ask_posts to create a new list containing the "created at" date and the number of comments
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])
counts_by_hour = {} #create empty dictionary that will contain the number of ask posts created during each day
comments_by_hour = {} # create empty dictionary that will contain the corresponding number of ask post comments created at each hour received
for row in result_list: # iterate over result_list
    comments = row[1]
    date_dt = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M') #convert "created at" date to datetime format
    hour = dt.datetime.strftime(date_dt, '%H') #extract hour from datetime as a string
    if hour not in counts_by_hour: #append post and comment count by hour to each respective dictionary
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
print('Counts by Hour: ', counts_by_hour)
print('\n')
print('Comments by Hour: ', comments_by_hour)


Counts by Hour:  {'02': 61, '01': 70, '22': 67, '21': 121, '19': 107, '17': 130, '15': 140, '14': 116, '13': 102, '11': 86, '10': 76, '09': 53, '07': 57, '03': 48, '16': 105, '08': 67, '00': 57, '23': 77, '20': 113, '18': 121, '12': 101, '04': 42, '06': 54, '05': 40}


Comments by Hour:  {'02': 606, '01': 497, '22': 852, '21': 1415, '19': 1442, '17': 2573, '15': 5136, '14': 1898, '13': 3326, '11': 1038, '10': 1215, '09': 673, '07': 902, '03': 818, '16': 980, '08': 1028, '00': 755, '23': 619, '20': 2343, '18': 1264, '12': 1707, '04': 907, '06': 509, '05': 517}


In [83]:
avg_by_hour = []
for hour in comments_by_hour: #Calculate the average number of comments per post for posts created during each hour
    total_counts = counts_by_hour[hour]
    total_comments = comments_by_hour[hour]
    avg_by_hour.append([hour, round(total_comments / total_counts, 2)])

In [85]:
print(avg_by_hour)
print('\n')
print(len(avg_by_hour))

[['02', 9.93], ['01', 7.1], ['22', 12.72], ['21', 11.69], ['19', 13.48], ['17', 19.79], ['15', 36.69], ['14', 16.36], ['13', 32.61], ['11', 12.07], ['10', 15.99], ['09', 12.7], ['07', 15.82], ['03', 17.04], ['16', 9.33], ['08', 15.34], ['00', 13.25], ['23', 8.04], ['20', 20.73], ['18', 10.45], ['12', 16.9], ['04', 21.6], ['06', 9.43], ['05', 12.93]]


24


In [87]:
swap_avg_by_hour = [] #prep to sort the results
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[9.93, '02'], [7.1, '01'], [12.72, '22'], [11.69, '21'], [13.48, '19'], [19.79, '17'], [36.69, '15'], [16.36, '14'], [32.61, '13'], [12.07, '11'], [15.99, '10'], [12.7, '09'], [15.82, '07'], [17.04, '03'], [9.33, '16'], [15.34, '08'], [13.25, '00'], [8.04, '23'], [20.73, '20'], [10.45, '18'], [16.9, '12'], [21.6, '04'], [9.43, '06'], [12.93, '05']]


In [88]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True) #sort the results
print(sorted_swap)

[[36.69, '15'], [32.61, '13'], [21.6, '04'], [20.73, '20'], [19.79, '17'], [17.04, '03'], [16.9, '12'], [16.36, '14'], [15.99, '10'], [15.82, '07'], [15.34, '08'], [13.48, '19'], [13.25, '00'], [12.93, '05'], [12.72, '22'], [12.7, '09'], [12.07, '11'], [11.69, '21'], [10.45, '18'], [9.93, '02'], [9.43, '06'], [9.33, '16'], [8.04, '23'], [7.1, '01']]


In [99]:
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[0:5]: #for loop to print the top 5 hours for Ask Posts comments
    average = row[0]
    hour_dt = dt.datetime.strptime(row[1], '%H')
    hour = hour_dt.strftime('%H:%M')
    output = "{time}: {number} average comments per post"
    output = output.format(time = hour, number = average)
    print(output)

Top 5 Hours for Ask Posts Comments
15:00: 36.69 average comments per post
13:00: 32.61 average comments per post
04:00: 21.6 average comments per post
20:00: 20.73 average comments per post
17:00: 19.79 average comments per post


# Conclusion

In conclusion, through our data analysis we have determined that 3PM, 1PM, 4AM, 8PM, and 5PM (All EST) are the best times to create a post to have a higher chance of receiving comments. Posts created during these hours receives anywhere from approximately 20 to 36 comments on average. 