
### Exploring Hackers News Posts

In this project, we'll compare two different types of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

     - Do Ask HN or Show HN receive more comments on average?
     - Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

### Introduction

First, we'll read in the data and remove the headers.


In [1]:
# opening and reading in the csv file, as a list of lists:
import csv

f = open('hacker_news.csv')#reading the hacker_news.csv file in as a list of lists
hn = list(csv.reader(f))#converting reader method into a list of lists, assigning variable
hn[:5] #displaying the first five rows of hn of list of lists

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

### Removing Headers from a List of Lists

In [2]:
headers = hn[0] #display header row only
hn = hn[1:] #removing the headers row
print(headers)#display the headers row only
print() # display an empty line for readabilty
print(hn[:5]) #display the first five rows to verify header row removed

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let's start by exploring the number of comments for each type of post.

### Extracting Ask HN and Show HN Posts

First, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.


In [3]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
#Create three empty lists
ask_posts = []
show_posts = [] 
other_posts = []

for post in hn: #loop through each row in hn
    title = post[1] #assign the title column to a variable name: title
    if title.lower() .startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

#Check the number of posts in each list        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.


In [4]:
# Calculate the average number of comments `Ask HN` posts receive:

tac = total_ask_comments = 0

for post in ask_posts: #loop to iterate over the ask posts
    tac += int(post[4]) #convert the value to integer to calculate the sum

#Compute the average number of comments and display the results:
aac = avg_ask_comments = tac / len(ask_posts)
print("avg_ask_comments:", aac)

#===================

# Calculate the average number of comments `Show HN` posts receive:

tsc = total_show_comments = 0

for post in show_posts:
    tsc += int(post[4])

asc = avg_show_comments = tsc / len(show_posts)
print("avg_show_comments:", asc)


avg_ask_comments: 14.038417431192661
avg_show_comments: 10.31669535283993


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

### Finding the Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.


In [5]:
# Calculate the amount of ask posts created during each hour of day,
# and the number of comments received.

import datetime as dt #bring into the environment the datetime module, and use alias: dt

result_list = [] #stores list

for post in ask_posts: #iterate through the ask_posts list to build a result_list
    result_list.append(
        [post[6], int(post[4])]
    )
#first element is time: created at, index 6; and second element is: number of comments, index 4

comments_by_hour = {}#create empty 'comments_by_hour' dictionary to store comments by hour key-pair
counts_by_hour = {}#create empty'counts_by_hour' dictionary to store to store counts by hour key-pair
date_format = "%m/%d/%Y %H:%M" #date format in 'created at' column

#calculates the total number of Ask Post comments by hour, by rows of the Created Date value
for each_row in result_list: #iterates through each element of row in result_list
    date = each_row[0] #stores first element in result_list
    comment = each_row[1] #stores second element in result_list
    time = dt.datetime.strptime(date, date_format).strftime("%H") #extracts hour from the date
    if time in counts_by_hour: #if key is present in dictionary counts_by_hour
        comments_by_hour[time] += comment #then hour key is stored as 2nd element in dictionary
        counts_by_hour[time] += 1 #increment the key counter by its current counter value + 1
    else: #key is not present in dictionary
        comments_by_hour[time] = comment #then hour key is stored as 2nd element in dictionary
        counts_by_hour[time] = 1 #increment the key counter by value of 1

comments_by_hour #displays 'comments_by_hour' received in key-pair dictionary format
#comments_by_hour: contains the corresponding number of comments "ask posts" created at each hour received.

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [6]:
#Calculate the average amount of comments `Ask HN` posts created at each hour of the day received.

avg_by_hour = [] #Initialized an empty list (of lists) and assigned it to avg_by_hour

for hr in comments_by_hour: #Iterate over the keys(0-23) in comments_by_hour dictionary
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]]) #calculate avg_by_hour
    
avg_by_hour #display results in a list of lists, named avg_by_hour:  

[['11', 11.051724137931034],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['07', 7.852941176470588],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['15', 38.5948275862069],
 ['04', 7.170212765957447],
 ['19', 10.8],
 ['21', 16.009174311926607],
 ['17', 11.46],
 ['03', 7.796296296296297],
 ['00', 8.127272727272727],
 ['20', 21.525],
 ['06', 9.022727272727273],
 ['18', 13.20183486238532],
 ['14', 13.233644859813085],
 ['12', 9.41095890410959],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['02', 23.810344827586206],
 ['16', 16.796296296296298]]

### Sorting and Printing Values from a List of Lists

In [7]:
#sorting avg_by_hour list and printing the five highest values in a format that's easier to read.

swap_avg_by_hour = [] #Initialized an empty list (of lists) and assigned it to swap_avg_by_hour

for row in avg_by_hour: #Iterate over the rows of avg_by_hour list
    swap_avg_by_hour.append([row[1], row[0]])#swop element order of avg_by_hour, so 2nd element is first
    
print(swap_avg_by_hour) #display swopped order list

sorted_swap = sorted(swap_avg_by_hour, reverse=True)#highest value in the first column appears first in the list

sorted_swap #display results, high to low order, of avg amount of comments for posts: Ask HN posts by hour



[[11.051724137931034, '11'], [7.985294117647059, '23'], [13.440677966101696, '10'], [7.852941176470588, '07'], [10.08695652173913, '05'], [10.25, '08'], [38.5948275862069, '15'], [7.170212765957447, '04'], [10.8, '19'], [16.009174311926607, '21'], [11.46, '17'], [7.796296296296297, '03'], [8.127272727272727, '00'], [21.525, '20'], [9.022727272727273, '06'], [13.20183486238532, '18'], [13.233644859813085, '14'], [9.41095890410959, '12'], [11.383333333333333, '01'], [6.746478873239437, '22'], [5.5777777777777775, '09'], [14.741176470588234, '13'], [23.810344827586206, '02'], [16.796296296296298, '16']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [8]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")#displays the enclosed string

#below iterates thru ea. hour element of first 5 rows of sorted_swap list above,
#the first column of this list above is the avg number of comments,
#the second column of this list is the two digit "hour" of the comments post

for avg, hr in sorted_swap[:5]: 
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )
    
#Note: in the loop above, the iteration variable, hr, 
#is the second element, and date_string of the sorted_swap list above.

#In the second print line clause above, we are using the datetime.strptime()
#constructor to create a datetime object from the given string of first line.
#This datetime object then uses the strftime() method to specify the format
#of the time.

#Note: we use the str.format() method with two delimiter braces,
#one for each element to format the sorted_swap list:

#the first brace picks up the 'hr' element. So it will be formatted first.
    
#the second brace with argument, {:.2f}, formats the second element string,
#number of avg comments per hour. The ':.2F' says,
#to format no more than two decimal places:
#So, the hour and average format,should be formatted:
#15:00: 38.59 average comments per post.

#the print clause: dt.datetime.strptime(hr,%H), avg ;  
#says to parse and display the previously formatted string, and display,
#the two digit 'hr' value first and then display the formatted 'avg' variable
#value: which is the number of avg comments to two decimal places, followed
#by the string, "average comments per post".

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 62% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

### Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.
