# Hacker News Data Analysis

In this project, we'll work with a dataset of submissions to popular technolofty site Hacker News.

We have reduced from almost 300,0000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. 

Below are descriptions of the columns:

    - id: the unique identifier from Hacker News for the post
    - title: the title of the post
    - url: the URL that the posts links to, if the post has a URL
    - num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    - num_comments: the number of comments on the post
    - author: the username of the person who submitted the post
    - created at: the date and time of the post's submission



In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

The fist five rows of hn are: 

In [2]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We can see the first index of the list is the column header. After isolating that into a variable, we have a list of a data set only.  

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we are only concerned with post titles beginning with ```Ask HN``` or ```Show HN```, we will create new list of lists containing just the data for those titles. 

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Next, we will determine if ask posts or show posts received more comments on average.

In [5]:
total_ask_comments = 0
total_show_comments = 0


for post in ask_posts:
    comments = post[4]
    comments = int(comments)
    total_ask_comments += comments
    
for post in show_posts:
    comments = post[4]
    comments = int(comments)
    total_show_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)    
avg_show_comments = total_show_comments / len(show_posts)


Based on the analysis above, we found out that ask posts receive 14 comments on average and show posts receive 10 comments on average. 

We will now determine if ask posts created at a *certain time* are more likely to attract comments. 

In [6]:
import datetime as dt

result_list = []
counts_by_hour = {}
comments_by_hour = {}

for post in ask_posts:
    created_time = post[6]
    comments = post[4]
    comments = int(comments)
    result_list.append([created_time,comments])
    
 
for result in result_list:
    time = result[0]
    comment = result[1]
    time_dt = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = time_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        

Above, we created two dictionaries:

    - counts_by_hour: contains the number of ask posts created during each hour of the day.
    - comments_by_hour: contains the corresponding number of comments ask posts created at each hour received

We will use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. 


In [7]:
avg_by_hour= []

for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour],1)])


Above, we created a list of an hour and the average number of comments for posts created during each hour of the day. 

In order to identify the hours with the highest number of comments, we are going to sort the list of lists and print the five highest values.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)                     

[[5.6, '09'], [14.7, '13'], [13.4, '10'], [13.2, '14'], [16.8, '16'], [8.0, '23'], [9.4, '12'], [11.5, '17'], [38.6, '15'], [16.0, '21'], [21.5, '20'], [23.8, '02'], [13.2, '18'], [7.8, '03'], [10.1, '05'], [10.8, '19'], [11.4, '01'], [6.7, '22'], [10.2, '08'], [7.2, '04'], [8.1, '00'], [9.0, '06'], [7.9, '07'], [11.1, '11']]


In [9]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    comments = row[0]
    hour = dt.datetime.strptime(row[1], "%H")
    hour = hour.strftime("%H:%M")
    
    print("{hour}: {comments} average comments per post".format(hour = hour, comments = comments))

Top 5 Hours for Ask Posts Comments
15:00: 38.6 average comments per post
02:00: 23.8 average comments per post
20:00: 21.5 average comments per post
16:00: 16.8 average comments per post
21:00: 16.0 average comments per post
