
# Exploring Hackers News Posts

In this project, we'll compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

  - Do Ask HN or Show HN receive more comments on average?
  - Do posts created at a certain time receive more comments on average?
  - Determine if show or ask posts receive more points on average.
  - Determine if posts created at a certain time are more likely to receive more points.
  

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions

# Introduction

In [11]:
#Read the Hacker news file .Assign the header to header variable and the remaining lines to hn_lines
import csv 

with open('hacker_news.csv') as f:
    hn_lines = list(csv.reader(f))

header = hn_lines[0]
hn_lines = hn_lines[1:]   
print(header)
hn_lines[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]


# Extracting Ask HN and Show HN Posts

First, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.


In [13]:
ask_posts =[]
show_posts = []
other_posts = []

for row in hn_lines : 
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print("Total no of ask posts {} out of {} total posts".format(len(ask_posts), len(hn_lines)))
print("Total no of show posts {} out of {} total posts".format( len(show_posts), len(hn_lines)))
print("Total no of other posts {} out of {} total posts".format( len(other_posts), len(hn_lines)))
        


Total no of ask posts 1744 out of 20100 total posts
Total no of show posts 1162 out of 20100 total posts
Total no of other posts 17194 out of 20100 total posts




# <font color='green'> Calculating the Average Number of Comments for Ask HN and Show HN Posts
</font>

Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.




In [15]:
# Calcualte  for the ask comments
total_ask_comments = 0 
for post in ask_posts:
    total_ask_comments+= int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)    

# Calcualte for the show comments
total_show_comments = 0 
for post in show_posts:
    total_show_comments+= int(post[4])
avg_show_comments = total_show_comments/len(show_posts)    


print("Avereage no of ask comments are ",avg_ask_comments)
print("Avereage no of show comments are ",avg_show_comments)

Avereage no of ask comments are  14.038417431192661
Avereage no of show comments are  10.31669535283993




On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.



# <font color='green'> Finding the Amount of Ask Posts and Comments by Hour Created </font>

Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.


In [45]:
import datetime as dt 

comments_by_hour = {}
counts_by_hour = {}
for post in ask_posts:
    post_date = dt.datetime.strptime(post[-1], '%m/%d/%Y %H:%M')
    post_hour = post_date.hour
    if post_hour not in comments_by_hour:
        counts_by_hour[post_hour]= 1
        comments_by_hour[post_hour] =int(post[-3])
    else: 
        counts_by_hour[post_hour]+=1
        comments_by_hour[post_hour]+=int(post[-3])
        
avg_by_hour = {}
for hr in comments_by_hour:
   
    avg_by_hour[hr]= comments_by_hour[hr]/counts_by_hour[hr]
   

print(comments_by_hour)
print(counts_by_hour)
print(avg_by_hour)
 

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 5.5777777777777775, 13: 14.741176470588234, 10: 13.440677966101696, 14: 13.233644859813085, 16: 16.796296296296298, 23: 7.985294117647059, 12: 9.41095890410959, 17: 11.46, 15: 38.5948275862069, 21: 16.009174311926607, 20: 21.525, 2: 23.810344827586206, 18: 13.20183486238532, 3: 7.796296296296297, 5: 10.08695652173913, 19: 10.8, 1: 11.383333333333333, 22: 6.746478873239437, 8: 10.25, 4: 7.170212765957447, 0: 8.127272727272727, 6: 9.022727272727273, 7: 7.852941176470588, 11: 11.051724137931034}


##  Sort the values and print the top 5 hours with highest comments 

In [49]:
avg_by_hours = sorted(avg_by_hour.items(), key = lambda kv:(kv[1],kv[0]),reverse = True)

for row in avg_by_hours[:5]:
    print("At hour {} average number of comments per post are {}".format(
        dt.datetime.strptime(str(row[0]),"%H").strftime("%H:%M") ,row[1]))
    

At hour 15:00 average number of comments per post are 38.5948275862069
At hour 02:00 average number of comments per post are 23.810344827586206
At hour 20:00 average number of comments per post are 21.525
At hour 16:00 average number of comments per post are 16.796296296296298
At hour 21:00 average number of comments per post are 16.009174311926607




The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.




# <font color='green'>  Calculating the Average Number of Points for Ask HN and Show HN Posts  </font>


In [53]:
# Calcualte for the ask points
total_ask_points = 0 
for post in ask_posts:
    total_ask_points+= int(post[3])
avg_ask_points = total_ask_points/len(ask_posts)    

# Calcualte for the show points
total_show_points = 0 
for post in show_posts:
    total_show_points+= int(post[3])
avg_show_points = total_show_points/len(show_posts)    


print("Avereage no of ask points are ",avg_ask_points)
print("Avereage no of show points are ",avg_show_points)

Avereage no of ask points are  15.061926605504587
Avereage no of show points are  27.555077452667813


On average, ask posts in our sample receive approximately 15 points, whereas show posts receive approximately 27. Since show posts are more likely to receive points, we'll focus our remaining analysis just on these posts.

# <font color='green'> Finding the Amount of Show Posts and points by Hour Created </font>

Next, we'll determine if we can maximize the amount of points a show post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of points those posts received. Then, we'll calculate the average amount of points  show posts created at each hour of the day receive.



In [58]:
import datetime as dt 

total_points_by_hour = {}
points_counts_by_hour = {}
for post in ask_posts:
    post_date = dt.datetime.strptime(post[-1], '%m/%d/%Y %H:%M')
    post_hour = post_date.hour
    if post_hour not in total_points_by_hour:
        points_counts_by_hour[post_hour]= 1
        total_points_by_hour[post_hour] =int(post[3])
    else: 
        points_counts_by_hour[post_hour]+=1
        total_points_by_hour[post_hour]+=int(post[3])
        
avg_points_by_hour = {}
for hr in total_points_by_hour:
   
    avg_points_by_hour[hr]= total_points_by_hour[hr]/points_counts_by_hour[hr]
   

# print(total_points_by_hour)
# print(points_counts_by_hour)
print(avg_points_by_hour)

{9: 7.311111111111111, 13: 24.258823529411764, 10: 18.677966101694917, 14: 11.981308411214954, 16: 23.35185185185185, 23: 8.544117647058824, 12: 10.712328767123287, 17: 19.41, 15: 29.99137931034483, 21: 15.788990825688073, 20: 14.3875, 2: 13.672413793103448, 18: 15.972477064220184, 3: 6.925925925925926, 5: 12.0, 19: 13.754545454545454, 1: 11.666666666666666, 22: 7.197183098591549, 8: 10.729166666666666, 4: 8.27659574468085, 0: 8.2, 6: 13.431818181818182, 7: 10.617647058823529, 11: 14.224137931034482}


## Sort the values and print the top 5 hours with highest comments

In [61]:
avg_points_by_hours = sorted(avg_points_by_hour.items(), key = lambda kv:(kv[1],kv[0]),reverse = True)

for row in avg_points_by_hours[:5]:
    print("At hour {} average number of points per post are {}".format(
        dt.datetime.strptime(str(row[0]),"%H").strftime("%H:%M") ,row[1]))

At hour 15:00 average number of points per post are 29.99137931034483
At hour 13:00 average number of points per post are 24.258823529411764
At hour 16:00 average number of points per post are 23.35185185185185
At hour 17:00 average number of points per post are 19.41
At hour 10:00 average number of points per post are 18.677966101694917



The hour that receives the most points per post on average is 15:00, with an average of 29.99 points per post. There's about a 8% increase in the number of points between the hours with the highest and second highest average number of points.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

To maximize the amount of points a post receives, we'd recommend the post be categorized as show post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.
