#Data Science Project 2: Exploring Hacker News Posts

1. In this project, we'll work with a data set of submissions to popular technology site Hacker News.

2. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

3. Description of the dataset columns:
    1. id: The unique identifier from Hacker News for the post
    2. title: The title of the post
    3. url: The URL that the posts links to, if it the post has a URL
    4. num_points: The number of points the post acquired, calculated as   the total number of upvotes minus the total number of downvotes
    5. num_comments: The number of comments that were made on the post
    6. author: The username of the person who submitted the post
    7. created_at: The date and time at which the post was submitted
    
4. In this project, we are particularly interested in the posts whose titles begin with either *Ask HN* or *Show HN*. 
    1. Users submit Ask HN posts to ask the Hacker News community a specific question.
    2. Show HN posts to show the Hacker News community a project, product, or just generally something interesting.



Firstly, in this project, we are going to read the dataset: hacker_news.csv file as a list of lists.

In [5]:
from csv import reader
open_file = open('hacker_news.csv')
reading = reader(open_file)
hn = list(reading)

#First five rows in this dataset
print(hn[0:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [6]:
#Extract the dataset header
headers = hn[0]
print('Hacker News headers: \n',headers)

#Remove the headers from the dataset and show it
hn = hn[1:]
print('First five rows: \n', hn[0:5])

Hacker News headers: 
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
First five rows: 
 [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walter

Now, after creating and visualizing our dataset, we want to find the asking_question post, and showing post in our dataset.

In [7]:
#First, we create 3 empty dataset for using later
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#After appending all corresponding types of rows, we are going to show the corresponding number of posts
print('ASK Posts numbers are: ', len(ask_posts))
print('SHOW Posts numbers are: ',len(show_posts))
print('Other type of posts numbers are: ', len(other_posts))


ASK Posts numbers are:  1744
SHOW Posts numbers are:  1162
Other type of posts numbers are:  17194


In [11]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print('The average number of ask comments is: ', avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('The average number of show comments is: ', avg_show_comments)


The average number of ask comments is:  14.038417431192661
The average number of show comments is:  10.31669535283993


As we can see from the result above, the ask post, on average, receives more posts than the show post. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [21]:
import datetime as dt

result_list = []
for row in ask_posts:
    date = row[6]
    num_comments = int(row[4])
    result_list.append([date, num_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    time = row[0]
    comment = row[1]
    hour = dt.datetime.strptime(time, date_format).strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += comment 

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

1. counts_by_hour: contains the number of ask posts created during each hour of the day.

2. comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [29]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['02', 23.810344827586206],
 ['06', 9.022727272727273],
 ['17', 11.46],
 ['19', 10.8],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['01', 11.383333333333333],
 ['00', 8.127272727272727],
 ['13', 14.741176470588234],
 ['12', 9.41095890410959],
 ['15', 38.5948275862069],
 ['08', 10.25],
 ['14', 13.233644859813085],
 ['09', 5.5777777777777775],
 ['05', 10.08695652173913],
 ['22', 6.746478873239437],
 ['04', 7.170212765957447],
 ['10', 13.440677966101696],
 ['16', 16.796296296296298],
 ['07', 7.852941176470588],
 ['23', 7.985294117647059],
 ['11', 11.051724137931034]]

Finally, we calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named avg_by_hour.

In [32]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
template = '{0}: {1:.2f} average comments per post'

for avg, hr in sorted_swap[:5]:
    print(template.format(dt.datetime.strptime(hr, '%H').strftime('%H:%M'), 
                         avg))



[[23.810344827586206, '02'], [9.022727272727273, '06'], [11.46, '17'], [10.8, '19'], [13.20183486238532, '18'], [7.796296296296297, '03'], [16.009174311926607, '21'], [21.525, '20'], [11.383333333333333, '01'], [8.127272727272727, '00'], [14.741176470588234, '13'], [9.41095890410959, '12'], [38.5948275862069, '15'], [10.25, '08'], [13.233644859813085, '14'], [5.5777777777777775, '09'], [10.08695652173913, '05'], [6.746478873239437, '22'], [7.170212765957447, '04'], [13.440677966101696, '10'], [16.796296296296298, '16'], [7.852941176470588, '07'], [7.985294117647059, '23'], [11.051724137931034, '11']]
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on the result, I should create post during 