# Exploring Hacker News Posts

## Introduction
In this project, I will be working with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/), a site started by incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories ("posts") are voted and commented upon, similar to reddit. It's extremely popular among tech and startup circles.

We will be working with the following [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts), totaling approximately 20,000 rows with 7 columns, with the following column descriptions:
* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

## Objective

We are specifically interested in posts whose titles begin with **Ask HN** or **Show HN**. Both types of posts seek to engage and warrant response from the Hacker News community so our goal is to compare the two types of posts to determine:
* Do **Ask HN** or **Show HN** receive more comments (thus engagement) on average?
* Do posts created at a *certain time* receive more comments on average?






### Conclusion

After analyzing the data, we can say first that **Ask HN** posts receive more comments than **Show HN** posts. Second, the time of posting has an impact on the average number of comments per post. Analyzing just the **Ask HN** post category, those posted at 15:00pm or 3:00pm ET receive, on average, the highest number of comments, and thus, engagement. 

## Exploring the data

In [1]:
#First let's read the hacker_news.csv file and display the first five rows

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)
    print("\n")

print('The data set has ' + str(len(hn)) + ' rows.')
print('Each row has ' + str(len(hn[0]))+ ' columns.')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


The data set has 20101 rows.
Each row has 7 columns.


In [2]:
#Next let's remove the headers to properly analyze our data
headers = hn[0]
hn = hn[1:]

print(f"Header: {headers}")
print("\n")
print(f"First 5 Rows without Header : {hn[0:5]}")

Header: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


First 5 Rows without Header : [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbe

Now that we have removed the headers from our dataset, let's filter the data. Since we are only concerned with post titles beginning with **Ask Hn** or **Show HN**, we';ll create a new lists containing just the data with those titles. 

To find those posts, we'll use the string method *startswith* to see if they begin with either of the above as well as the *lower* method to account of case variations (lowercase or uppercase). 

# Data Analysis

In [3]:
#First i will create three empty lists
ask_posts = []
show_posts = []
other_posts = []

#Then we will loop through each row in our dataset
#to separate the different posts

for row in hn:
    
    title = row[1]
    
    #to account for variance in case-use, let's convert
    #the titles to lowercase for consistency when checking
    
    title = title.lower() 
    
    #Finally we will separate the posts into our three different lists
    #for analysis
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        

And printing a sample of our new lists, looks like it worked!

In [4]:
print("First Five Rows of Ask Posts")
print(ask_posts[0:5])
print("\n")
print(f"The list is {len(ask_posts):,} rows long")
print("\n")
print("First Five Rows of Show Posts")
print(show_posts[0:5])
print("\n")
print(f"The list is {len(show_posts):,} rows long")
print("\n")
print("First Five Rows of Other Posts")
print(other_posts[0:5])
print("\n")
print(f"The list is {len(other_posts):,} rows long")

First Five Rows of Ask Posts
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


The list is 1,744 rows long


First Five Rows of Show Posts
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show 

### Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
#First let's find the total number of comments in ask posts 
#and assign it to a variable

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
print(f"Total number of ask comments is {total_ask_comments:,}")

#Then compute the average number of comments on ask posts

avg_ask_comments = total_ask_comments / len(ask_posts)
print(f"{avg_ask_comments:.2f} comments per post")
    

Total number of ask comments is 24,483
14.04 comments per post


Out of all of the Ask HN posts, each posts receives on average 14 comments per post.

In [6]:
#Now we will do the same process to calculate the numnber of
# show comments
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

print(f"Total number of show comments is {total_show_comments:,}")

#Then compute average number of comments on show posts
avg_show_comments = total_show_comments / len(show_posts)
print(f"{avg_show_comments:.2f} comments per post")

Total number of show comments is 11,988
10.32 comments per post


Out of all of the Show HN posts, each posts receives on average 10 comments per post.

|HN Posts| Total | Average|
|-------|-------|--------|
|ask_post|24,483 | 14.04|
|show_post|11,988|10.32|

Looking at both the number of comments on Ask HN Posts vs. Show HN posts, and thus, the level of engagement, **Ask HN receives, on average, a higher level of enagement and comments per post.** 

This isn't too surprising when looking at both the purpose and format of each posts. Ask HN posts come from a place of seeking knowledge, advice, critique and is made for engagement and exchanging ideas. It seeks active engagement.

The Show HN posts are a bit more passive. It's aimed as a Show and Tell of sorts, where posters will share news, articles, projects. While it may not necessarily seek advice or engagement outright, that doesn't preclude users from engaging with those posts to further the conversation. 

Regardless, since **ask posts** are more likely to receive comments, we'll focus the remaining analysis on just these posts.

We will now determine if ask posts created at certain *time* are more likely to attract comments. We will do the following to perform this analysis:
* Calculate the amount of ask posts created in each hour of the day, along with the numnber of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [7]:
#First I will tackle calculating the amount of ask posts and comments
#by the hour they were created

import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [8]:
print(result_list[0:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


### Show total number of hour and comments in different dictionaries

Next I will create two dictionaries to store the following data:
* Counts by Hour: Will contain the number of ask posts created during each hour of the day
* Comments by Hour: Will contain the corresponding number of comments ask posts created at each hour received 

In [9]:
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    
    comments = row[1]
    time = row[0]
    time = dt.datetime.strptime(time, date_format).strftime('%H')
    
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments
        
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments

print(f"Counts by Hour : {counts_by_hour}")
print("\n")
print(f"Comments by Hour : {comments_by_hour}")

Counts by Hour : {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Comments by Hour : {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Show the average number of comments by hour

Now, let's calculate the average number of comments for posts created during each hour of the day. I will do this by creating a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [10]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print(avg_by_hour)

#The below list is a list of lists in which the first element is the hour
# and the second element is the average number of comments per post.


[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Now that we have the average number of comments for posts created during each hour of the day, let's finish by sorting the lists of lists and printing the five highest performing hours.

In [11]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Now that i have created a list with the columns swapped, I will now sort the the list in decending order, by the average number of comments.

### Show the top 5 hours 

In [12]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments :", sorted_swap[:5])

Top 5 Hours for Ask Posts Comments : [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


# Top 5 Hours for Ask Posts Comments

In [13]:
for row in sorted_swap[:5]:
    average = row[0]
    time = row[1]
    time = dt.datetime.strptime(time, '%H').strftime("%H:%M %p")
    string = "{}: {:.2f} average comments per post".format(time, average)
    print(string)

15:00 PM: 38.59 average comments per post
02:00 AM: 23.81 average comments per post
20:00 PM: 21.52 average comments per post
16:00 PM: 16.80 average comments per post
21:00 PM: 16.01 average comments per post


As we can see and based on the datset documentation, the timezone used is the eastern time in the US. 3pm ET is the best time to post an Ask HN post if the goal is to garner the highest average number of comments or engagement for the post. This is followed up 2am ET in the morning, a popular time amongst the tech, hacker community, followed up 8pm, 4pm and 9pm. It seems the time blocks that see the most engagement is in the afternoon lull, and the later hours, assumed to be after the regular work day.  

### Determine HN posts points per average

In [14]:
# calculate total and average ask_post points
#Column index for points is i = 3

total_ask_points = 0 

for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    
print(f"Total number of ask post points is {total_ask_points:,}")

#Then compute the average number of points on ask posts

avg_ask_points = total_ask_points / len(ask_posts)
print(f"{avg_ask_points:.2f} points per post")


Total number of ask post points is 26,268
15.06 points per post


In [15]:
# calculate total and average show_post points
total_show_points = 0 

for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    
print(f"Total number of show post points is {total_show_points:,}")

#Then compute the average number of points on ask posts

avg_show_points = total_show_points / len(show_posts)
print(f"{avg_show_points:.2f} points per post")

Total number of show post points is 32,019
27.56 points per post


Remembering that the points signify the number of "up" votes different posts receive, we can see that show posts tend to receive on average a higher number of up votes, signifying a positive experience between the user and the post. 

|Post Type|Total Points|Average Points|
|---------|------------|--------------|
|Ask HN   | 26,268     | 15.06|
|Show HN | 32,019       | 27.56|

### Show the number of hours and number of points by hour in different dictionaries
For this analysis, since Show HN post points were high, we will explore that list for time data

In [18]:
#First I will tackle calculating the amount of show posts and points
#by the hour they were created

result_list_point = []

for row in show_posts:
    created_at = row[6]
    num_points = int(row[3])
    result_list_point.append([created_at, num_points])
    
points_by_hour = {}
p_counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list_point:
    
    points = row[1]
    time = row[0]
    time = dt.datetime.strptime(time, date_format).strftime('%H')
    
    if time in p_counts_by_hour:
        p_counts_by_hour[time] += 1
        points_by_hour[time] += points
        
    else:
        p_counts_by_hour[time] = 1
        points_by_hour[time] = points

print(f"Post Counts by Hour : {p_counts_by_hour}")
print("\n")
print(f"Points by Hour : {points_by_hour}")
    

Post Counts by Hour : {'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


Points by Hour : {'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


### Average Points By Hour

In [20]:
# Create an empty list for average points by hour

avg_by_hour = []

for hour in points_by_hour:
    avg_by_hour.append([hour, points_by_hour[hour]/p_counts_by_hour[hour]])

print(avg_by_hour)

#The below list is a list of lists in which the first element is the hour
# and the second element is the average number of comments per post.
    

[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]


In [25]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Show Posts Points :", sorted_swap[:5])


Top 5 Hours for Show Posts Points : [[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18']]


THe hour that receives the highest average number of points or upvotes is 11pm EST.

### Top 5 Hours for Show Posts Points

In [29]:
print("Top 5 Hours for Points on 'Show HN' posts")
for row in sorted_swap[:5]:
    average = row[0]
    time = row[1]
    time = dt.datetime.strptime(time, '%H').strftime("%H:%M %p")
    string = "{}: {:.2f} average points per post".format(time, average)
    print(string)

Top 5 Hours for Points on 'Show HN' posts
23:00 PM: 42.39 average points per post
12:00 PM: 41.69 average points per post
22:00 PM: 40.35 average points per post
00:00 AM: 37.84 average points per post
18:00 PM: 36.31 average points per post


The top 5 hours for most upvotes from the Hacker News community for show posts are 11:00pm ET, 12pm ET, 10:00pm ET, 12am ET, and 6pm ET.

## Result Summary

This project analyzed ask posts and show posts on the Hacker News platform to dtermine which type of post and timer period received the most comments and points, on average.

Based on this analysis, to optimize the possibility of receiving more comments and engagement from the community, we'd recommend users to post on the hacker news platform using the 'ask hn' title and possibly create the post sometime between the time period of 15:00pm - 16:00pm EST or 12:00pm - 1:00pm PST.

Furthermore, when looking at both types of posts based on points or upvotes received, the data shows that there are more points on average for show posts than there are for ask posts. This suggests that while 'ask hn' posts are more likely to receive more comments from the community, 'show hn' posts tend to receive more points on average. 

Summary as follows:

|Post Type|Total Points| Avg Points | Total Comments| Avg Comments|
|---------|------------|------------|---------------|-------------|
|Ask HN   | 26,268     | 15.06      |     24,483    |      14.04  |
| Show HN | 32,019     | 27.56      |      11,988   |   10.32     |