# Exploring Hacker News Posts

[Dataset and Documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts)

This project uses the Hacker News Posts datset available on Kaggle. 
The aim of this project is to: 
1) Analyze whether posts tagged with 'Ask HN' (posts asking Hacker News community as specific question) or 'Show HN' (posts showing the Hacker News community a project, product, or something interesting) receive more comments on average

2) Analyze whether posts created at a certain time receive more comments on average

In [1]:
import csv

with open('hacker_news.csv') as file: 
    reader = csv.reader(file)
    hn = list(reader)
    
headers = hn[0]
hn = hn[1:]

In [2]:
#printing the headers
print(headers)
print('\n')

# printing the first five rows 
row_index = 0
while row_index <= 5: 
    print(hn[row_index])
    print('\n')
    row_index += 1 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


['10482257

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn: 
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: 
        other_posts.append(row)
        
print(f"The length of posts starting with 'ask hn' is: {len(ask_posts)}.")
print(f"The length of posts starting with 'show hn' is: {len(show_posts)}.")
print(f"The length of other post types is: {len(other_posts)}.")
        
    
    

The length of posts starting with 'ask hn' is: 1744.
The length of posts starting with 'show hn' is: 1162.
The length of other post types is: 17194.


As we can see above, the length of 'ask hn' posts receive more coments than show posts. 

Now, lets determine if ask posts created at a certain time are more likely to attract comments. 
1) Calculate the amount of ask posts created in each hour of the day, along with the number of comments received. 

2) Calculate the average number of ask posts receive by hour created. 

In [4]:
import datetime as dt

result_list = []
for row in ask_posts: 
    row_4_int = int(row[4])
    row_append = [row[6], row_4_int]
    result_list.append(row_append)
counts_by_hour = {}
comments_by_hour = {}

# example format '8/16/2016 9:55'

counts_by_hour = {}
comments_by_hour = {}

for row in result_list: 
    comments_no = row[1]
    datevar = row[0]
    
    date_check = dt.datetime.strptime(datevar, "%m/%d/%Y %H:%M")
    hour_check = date_check.hour
    if hour_check not in counts_by_hour: 
        counts_by_hour[hour_check] = 1
        comments_by_hour[hour_check] = comments_no
    elif hour_check in counts_by_hour: 
        counts_by_hour[hour_check] += 1
        comments_by_hour[hour_check] += comments_no

print('Checking counts by hour: ')
print(counts_by_hour)
print('\n')
print('Checking comments by hour: ')
print(comments_by_hour)
    
    
    

Checking counts by hour: 
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


Checking comments by hour: 
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Creating a list of lists, calculating the average number of comments per post for posts created during each hour of the day:

In [5]:
avg_by_hour = []

for hour in comments_by_hour: 
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

In [7]:
swap_avg_by_hour = []

for val in avg_by_hour: 
    swap_avg_by_hour.append([val[1], val[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


Sorting the swap_avg_by_hour list so that the highest value shows up first in the list

In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, 15],
 [23.810344827586206, 2],
 [21.525, 20],
 [16.796296296296298, 16],
 [16.009174311926607, 21],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [13.20183486238532, 18],
 [11.46, 17],
 [11.383333333333333, 1],
 [11.051724137931034, 11],
 [10.8, 19],
 [10.25, 8],
 [10.08695652173913, 5],
 [9.41095890410959, 12],
 [9.022727272727273, 6],
 [8.127272727272727, 0],
 [7.985294117647059, 23],
 [7.852941176470588, 7],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [6.746478873239437, 22],
 [5.5777777777777775, 9]]

In [14]:
print("Top 5 Hours for Ask Posts Comments:")

avg_index = 0 
for avg,hr in sorted_swap[:5]:
    print(
    f"{dt.datetime.strptime(str(hr), '%H').strftime('%H:%M')}: {avg:.2f} average comments per post")

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The original timezone of the dataset is Eastern Standard Time 

Converting this to western time we get the most common comments per post in Pacific Standard Time as the following: 

12:00: 38.59 average comments per post
23:00: 23.81 average comments per post
17:00: 21.52 average comments per post
13:00: 16.80 average comments per post
18:00: 16.01 average comments per post

so the most popular times (using the 12 hr format are):
12:00pm, 11:00 pm, 5pm, 1pm, and 6pm in order 