## Analyzing Hacker News Postings 
#hacker news dataset project; Comparing two types of posts on Hacker News site, Ask HN or Show HN. 
I will be comparing these types of posts to see if whether (1) Ask HN or Show HN receive more comments on average? and (2) to see if posts during a certain time receive more comments? 

*dataset was cleaned to remove posts that did not receive any comments 


# Importing dataset 

In [49]:
import pandas as pd 
import csv
f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

# Removing Headers

In [50]:
#assigning headers to var headers
headers = hn[0]
#removing headers from list of lists 
hn = hn[1:]
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

# Parsing posts into Ask_Posts/Show_Posts/Other_Posts

In [51]:
#creating empty list to store values given below criteria
ask_posts = [] 
show_posts = []
other_posts = []

for row in hn:
    #assigning title column index to var title
    title = row[1]
    #checks if the title is lowercase and starts with ___
    if title.lower().startswith("ash hn"):
        ask_posts.append(row) 
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))        

2
1162
18936


In [52]:
total_ask_comments = 0 

for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

3.0


In [53]:
total_show_comments = 0 
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


In [54]:
import datetime as dt 
result_list = []
for post in show_posts:
    result_list.append([post[6], int(post[4])])
#creating empty dict for storing num comments per hr and 
#num comments per ask_posts per hr
comments_by_hour = {}
counts_by_hour = {}
#specifing template for date
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    #extracting the hr from the date field
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'00': 487,
 '01': 246,
 '02': 127,
 '03': 287,
 '04': 247,
 '05': 58,
 '06': 142,
 '07': 299,
 '08': 165,
 '09': 291,
 '10': 297,
 '11': 491,
 '12': 720,
 '13': 946,
 '14': 1156,
 '15': 632,
 '16': 1084,
 '17': 911,
 '18': 962,
 '19': 539,
 '20': 612,
 '21': 272,
 '22': 570,
 '23': 447}

In [55]:
#create list of lists, avg_by_hour
avg_by_hour = []
for hr in comments_by_hour:
    #setting hr to key in dict and value as avg num of comments per post
    avg_by_hour.append([hr, comments_by_hour[hr]/
                        counts_by_hour[hr]])
avg_by_hour

[['23', 12.416666666666666],
 ['01', 8.785714285714286],
 ['20', 10.2],
 ['10', 8.25],
 ['09', 9.7],
 ['02', 4.233333333333333],
 ['12', 11.80327868852459],
 ['15', 8.102564102564102],
 ['11', 11.159090909090908],
 ['05', 3.0526315789473686],
 ['18', 15.770491803278688],
 ['04', 9.5],
 ['14', 13.44186046511628],
 ['00', 15.709677419354838],
 ['07', 11.5],
 ['19', 9.8],
 ['08', 4.852941176470588],
 ['03', 10.62962962962963],
 ['13', 9.555555555555555],
 ['16', 11.655913978494624],
 ['17', 9.795698924731182],
 ['21', 5.787234042553192],
 ['22', 12.391304347826088],
 ['06', 8.875]]

In [56]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[15.770491803278688, '18'],
 [15.709677419354838, '00'],
 [13.44186046511628, '14'],
 [12.416666666666666, '23'],
 [12.391304347826088, '22'],
 [11.80327868852459, '12'],
 [11.655913978494624, '16'],
 [11.5, '07'],
 [11.159090909090908, '11'],
 [10.62962962962963, '03'],
 [10.2, '20'],
 [9.8, '19'],
 [9.795698924731182, '17'],
 [9.7, '09'],
 [9.555555555555555, '13'],
 [9.5, '04'],
 [8.875, '06'],
 [8.785714285714286, '01'],
 [8.25, '10'],
 [8.102564102564102, '15'],
 [5.787234042553192, '21'],
 [4.852941176470588, '08'],
 [4.233333333333333, '02'],
 [3.0526315789473686, '05']]

# Top 5 Hours for Show Posts Comments

In [57]:
print("Top 5 Hours for Show Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )


Top 5 Hours for Show Posts Comments
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post


For Show Posts there were no hours where there the avg number of comments fell below 3, indicating that regardless of when you post a show post it will receive about 3 comments. However, since the dataset was cleaned to encompass only posts that received comments what can be said is that of the shows posts with comments the hour where you'd receive the most comments would be 8 PM EST with about 16. 