# Kaggle Dataset Analysis : Exploring Hacker News Posts

In this project, I've worked with a data set of submissions to a popular Tech site named Hacker News. The dataset I used use is available on Kaggle : https://www.kaggle.com/hacker-news/hacker-news-posts (20,000 rows)

My goal was to determine what are the optimal parameters to get the maximum of interactions when you Post on Hacker News. 

Overall, there are two categories of posts on Hacker News :  
- Ask HN : users submit this kind of posts to ask the Hacker News community a specific question.
- Show HN : users submit this kind of posts to show the Hacker News community a project, product, or just generally something interesting. 

As a consequence, I decided to compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## **Read in the data**

In [1]:
from csv import reader
open_file = open('/Users/tangigouez/Desktop/DATAQUEST/MyDataSets/Project_2/hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
header = hn[0]
hn_data = hn[1:]

In [68]:
# Header
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [70]:
# Data
hn_data[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## **Extracting Ask HN and Show HN Posts**

First, I identified  Ask HN or Show HN in order to separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [71]:
ask_posts = []
show_posts = []
other_posts = []

for rows in hn_data:
    title = rows[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(rows)
    elif title.startswith('show hn'):
        show_posts.append(rows)
    else:
        other_posts.append(rows)

total_ask_posts = len(ask_posts)
total_show_posts = len(show_posts)
total_other_posts = len(other_posts)
print(total_ask_posts)
print(total_show_posts)
print(total_other_posts)

1744
1162
17194


## **Calculating the Average Number of Comments for Ask HN and Show HN Posts**

Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.

In [72]:
# Calculate the average number of comments `Ask HN` posts receive.

total_ask_comments = 0
for rows in ask_posts:
    num_comments = int(rows[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/total_ask_posts
print(round(avg_ask_comments,2))

14.04


In [73]:
# Calculate the average number of comments `Show HN` posts receive.

total_show_comments = 0
for rows in show_posts:
    num_comments = int(rows[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/total_show_posts
print(round(avg_show_comments,2))

10.32


On average, the ask posts receive approximately 14 comments per post against approximately 10 comments for the show posts. As the ask posts are more likely to receive interactions, I decided to focus my analysis on these posts.

## **Finding the Amount of Ask Posts and Comments by Hour Created**

Next, my goal is to determine if there is a specific time at which ask posts received historically more comments on average. First, I found the  amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, I calculated the average amount of comments ask posts created at each hour of the day receive.

In [77]:
import datetime as dt

result_list = []
for posts in ask_posts:
    result_list.append([posts[6],int(posts[4])])

count_by_hour ={}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for rows in result_list:
    date = rows[0]
    comments = rows[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour in count_by_hour:
        count_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        count_by_hour[hour] = 1
        comments_by_hour[hour] = comments

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## **Calculating the Average Number of Comments for Ask HN Posts by Hour**

In [78]:
avg_by_hour = []

for hour in count_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/count_by_hour[hour]])

avg_by_hour   

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## **Sorting and Printing Values from a List of Lists**

In [79]:
swap_avg_by_hour = []

for element in avg_by_hour:
    swap_avg_by_hour.append([element[1],element[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [80]:
# Sort the values and print the the 5 hours with the highest average comments.

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

# Conclusion

Overall, my goal was to analyze ask posts and show posts in order to conclude which type of post and time received the most comments on average. If you want to post on Hacker News and maximize your odds of getting interactions with your content, I'd recommend you to categorize your post as a ask one and to create it between 15:00 and 16:00. Indeed, with these specific parameters, on average one Hacker News ask postreceived 39 comments (as of September 2019). 