# Hacker News data analysis

This analysis focuses on analysising the dataset obtained from Hacker News and compare various different kinds of posts and their comments. Usage of string handling, date and times, and OOP are considered for this project.

In [2]:
# first, we will import the reader to read our csv data
from csv import reader 
data_file = "hacker_news.csv"
opened_file = open(data_file)
read_file = reader(opened_file)
hn = list(read_file)
del hn[0]
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Next, we will check the title of each post and separate the one that starts with "ask hn" and "show hn". Since the word may start with loswer case or upper case, we convert all of them to lower case to remove error due to case sensitivity. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts)) 

1744
1162
17194


In [66]:
total_ask_comments = 0
for data in ask_posts:
    num_comments = int(data[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_comments

14.038417431192661

In [67]:
total_show_comments = 0
for data in show_posts:
    num_comments = int(data[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
avg_show_comments

10.31669535283993

From the findings, it is clear that ask posts have more comments on average compare to show posts. Now. let see the ask post data by hour and analyse how many commentsand postsare received every hour. And also, we will calculate the average comments each post receive on hour basis.

In [68]:
import datetime as dt 

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    time_ = result[0]
    comments = result[1]
    date_time = dt.datetime.strptime(time_,"%m/%d/%Y %H:%M")
    time_hour = date_time.strftime("%H")
    if time_hour not in counts_by_hour:
        counts_by_hour[time_hour] = 1
        comments_by_hour[time_hour] = comments
    else:
        counts_by_hour[time_hour] += 1
        comments_by_hour[time_hour] += comments
        
# counts_by_hour
# comments_by_hour

In [73]:
avg_by_hour =  [] # average number of comments per posts each hour 
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour

[['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['17', 11.46],
 ['12', 9.41095890410959],
 ['22', 6.746478873239437],
 ['13', 14.741176470588234],
 ['07', 7.852941176470588],
 ['16', 16.796296296296298],
 ['09', 5.5777777777777775],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['15', 38.5948275862069],
 ['06', 9.022727272727273],
 ['19', 10.8],
 ['11', 11.051724137931034],
 ['18', 13.20183486238532],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['03', 7.796296296296297],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447],
 ['01', 11.383333333333333],
 ['05', 10.08695652173913],
 ['08', 10.25]]

It is the best practice of analyst to sort the data for clear observation. Hence, we are sorting our data based on the average comments per post.

In [80]:
swap_avg_by_hour = []
for list in avg_by_hour:
    swap_avg_by_hour.append([list[1],list[0]])
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for list in sorted_swap[:5]:
    time_hour = dt.datetime.strptime(str(list[1]),"%H")
    time_hour = time_hour.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(time_hour,list[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From the data, we came to conclusion that 3:00 pm has best average ratio on comments per post on hourly basis. Hence, it can be considered the best time to post. However, from the second best ratio, we can see that post time is not the only factor to receive comments.