# Exploring Hacker News Posts 

* Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Below are descriptions of the columns: 
1. id: The unique identifier from Hacker News for the post
2. title: The title of the post
3. url: The URL that the posts links to, if it the post has a URL
4. num_points: The number of points the post acquired, calculated as the 5. 5. total number of upvotes minus the total number of downvotes
6. num_comments: The number of comments that were made on the post
7. author: The username of the person who submitted the post
8. created_at: The date and time at which the post was submitted

* We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

"Ask HN: How to improve my personal website?"

"Show HN: Something pointless I made"


My analysis revolves around answering two questions:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?


We start by importing the reader from the csv module in order to read our list and convert it to a list of lists without the header row for easier analysis

In [None]:
from csv import reader
opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
list_file = list(read_file)
hn = list_file[1:]
headers = list_file[0]
print (headers)
print (len(hn))

Now we want to separate the lists that begin with Ask HN, Show HN and Others. TO do that, we first create the new lists, then we loop through the hacker news list of lists and for every iteration we:
1. assign post row to post variable 
2. make it all lower case so it eliminates capilization errors
3. if post starts with Ask HN append ask list
4. else if post starts with Show HN apend show list
5. else append other list 

Now we have our three lists, we see how long they are to compare the volume of posts for each type

In [None]:
ask_posts = []
show_posts = []
other_posts = []

for every_row in hn:
    post = every_row [1]
    post = post.lower()
    if post.startswith('ask hn'):
        ask_posts.append(every_row)
    elif post.startswith('show hn'):
        show_posts.append(every_row)
    else: 
        other_posts.append(every_row)
        
print ('There are {} posts that start with Ask HN.'.format(len(ask_posts)))
print ('\n')
print ('There are {} posts that start with Show HN.'.format(len(show_posts)))
print ('\n')
print ('There are {} other posts.'.format(len(other_posts)))

Now we want to see the average number of comments per post type to see which one gets more comments. 
Start by making a variable total ask comments and setting it equal to 0 
iterate over the ask_posts list and for every iteration we assign the number of ask comments to the interger of the row at index 4. 


In [None]:
total_ask_comments = 0
total_show_comments = 0

for every_row in ask_posts:
    num_ask_comments = int(every_row[4])
    total_ask_comments += num_ask_comments 
    
for every_row in show_posts:
    num_show_comments = int(every_row[4])
    total_show_comments += num_show_comments 
    
avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
avg_show_comments = round(total_show_comments / len(show_posts), 2)

print ('Ask HN posts have an average of {} comments per post.'.format(avg_ask_comments))
print ('Show HN posts have an average of {} comments per post.'.format(avg_show_comments))

As you can see, the Ask HN posts recieve on average 4 more comments than the show HN posts.
For that reason, we will only deal with Ask posts in our analysis. 
Next, we're going to see which time gets the most comments. 
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments recieved .
2. calculate the average number of comments ask posts recieve by hour created.

So first we import the datetime module as dt for easier use
* To isolate the datetime and number of comments a post got:
1. loop through the ask_posts list
2. for every iteration we assign the number of comments to the integer of the row index 4
3. assign the datetime to every row at index -1
4. create two empty dictionaries, counts by hour and comments by hour 
5. loop through the results_list and for every iteration 
6. create a datetime object and parse it using the .strptime method to convert the date string to an object 
7. conver the datetime object to an hour string using the .strftime method 
8. if the hour is not in counts by hours, we set it equal to 1, else, increment by 1
9. if the hour is not in comments by hour, we set it equal to the number of comments, else, increment by the number of comments 

In [None]:
import datetime as dt

result_list = []
 
for every_row in ask_posts:
    num_comments = int(every_row[4])
    date = every_row[-1]
    result_list.append([date, num_comments])
    
    
counts_by_hour = {}
comments_by_hour = {}

for every_row in result_list:
    time = dt.datetime.strptime(every_row[0], '%m/%d/%Y %H:%M')
    hour = time.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = every_row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += every_row[1]


Now we create a new empty list avg by hour. we loop through the comments per hour dictionary, and for every iteration we append the avg by hour list with the avg comments for that hour and the hour. 
Essentially we created a list of lists

In [34]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([round(comments_by_hour[hour] / counts_by_hour[hour], 2), hour])
    
avg_by_hour

[[10.09, '05'],
 [7.17, '04'],
 [16.01, '21'],
 [7.85, '07'],
 [13.44, '10'],
 [16.8, '16'],
 [9.41, '12'],
 [14.74, '13'],
 [38.59, '15'],
 [8.13, '00'],
 [10.25, '08'],
 [6.75, '22'],
 [9.02, '06'],
 [13.2, '18'],
 [13.23, '14'],
 [11.46, '17'],
 [5.58, '09'],
 [10.8, '19'],
 [11.38, '01'],
 [11.05, '11'],
 [21.52, '20'],
 [23.81, '02'],
 [7.8, '03'],
 [7.99, '23']]

Now we sort the list created in descending order to see which hours recieved the most comments. Since the times are in eastern and we live in central time, we subtract an hour from each and have our top 5 times to post:

In [None]:
sorted_avg = sorted(avg_by_hour, reverse = True)

print ('Top 5 Hours for Ask HN Posts Comments')


for every_row in sorted_avg:
    hour = dt.datetime.strptime(every_row[1], "%H")
    hour = hour.strftime('%H:%M')
    avg_comments = every_row[0]
    print ('{}: {} average comments per post'.format(hour, avg_comments))
    

Now we could figure out if ask or show posts recieve more points on average

In [None]:
total_ask_points = 0
total_show_points = 0

for every_row in ask_posts:
    num_points = every_row[3]
    total_ask_points += int(num_points)

for every_row in show_posts:
    num_points = every_row[3]
    total_show_points += int(num_points)

avg_ask_points = round(total_ask_points / len (ask_posts),2) 

avg_show_points = round(total_show_points / len (show_posts),2) 

print ("Ask HN recieved {} points on average". format(avg_ask_points))
print ('\n')
print ("Show HN recieved {} points on average". format(avg_show_points)) 
print (avg_ask_points - avg_show_points)

WE can see that show hn posts recieved 12.5 more points on average than ask hn.

Determine if posts created at a certain time are more likely to receive more points. Since show HN recieved more points on average, i'll only be using that.

In [None]:
show_points_time = {}

for every_row in show_posts:
    num_points = int(every_row[3])
    date = every_row[-1]
    datetime = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    hour = datetime.strftime('%H')
    if hour not in show_points_time:
        show_points_time[hour] = num_points
    else:
        show_points_time[hour] += num_points
show_points_time

In [None]:
sorted_list = []

for every_key in show_points_time:
    hour = every_key
    points = show_points_time[every_key]
    sorted_list.append([round(points / counts_by_hour[every_key],2), hour])
    
sorted_list = sorted(sorted_list, reverse = True)
sorted_list

By my analysis, posts at 12 pm, 1 pm, 10 pm, 11 am and 5 pm receive the most points.

By previous analysis, posts at 05 am, 4 am, 9 pm, 7 am, and 10 am receive the most comments