# Exploring Hacker News Posts

### Company Background:
Hacker News is a site where users receive votes and comments on submitted stories (or posts).  Top posts on Hacker News may receive hundreds of thousands of visitors.  

### **Project Goal:**

My goal as the Data Analyst is to answer two questions:
* Does <mark>Ask HN or Show HN</mark> receive more comments? 
* Do posts created at certain times receive more comments on average?

> **This project will use the following six steps of the data analysis process to answer the above two questions of the project goal:**
> * 1.) Ask Question
> * 2.) Get Data
> * 3.) Explore Data
> * 4.) Clean Data
> * 5.) Analyze Data
> * 6.) Conclusion

##  Ask Question:

My goal as the Data Analyst is to answer two questions:
* Does <mark>Ask HN or Show HN</mark> receive more comments on average?
* Do posts created at certain times receive more comments on average?

## Get Data  

The data set that will be used for analysis can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).
The data set includes submissions to the Hacker News site.

In [1]:
from csv import reader

### The hacker news set ###
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
# extract the first row
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Data Dictionary
* id: the unique identifier from Hacker News for the post
* title: the title of the post
* url: the URL that the posts links to, if the post has a URL
* num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: the number of comments on the post
* author: the username of the person who submitted the post
* created_at: the date and time of the post's submission

### Clear Data
Find posts that begin with <mark>Ask HN</mark> or <mark>Show HN</mark>

In [27]:
# filter our data
# use a string method startswith to find the post that begin with Ask HN or Show HN

# create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    #if the lowercase version of title starts with ask hn, append the row to ask_posts
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print("Number of Posts begin with 'Ask HN': ", len(ask_posts), "\n","Number of posts begin with 'Show HN': ", len(show_posts),"\n", "Number of Other posts:", len(other_posts))

print(ask_posts[:3])


1744
1162
17194
Number of Posts begin with 'Ask HN':  1744 
 Number of posts begin with 'Show HN':  1162 
 Number of Other posts: 17194
[['12296411', 'ask hn: how to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'ask hn: am i the only one outraged by twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'ask hn: aby recent changes to css that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


### Posts average
Let's determine if ask posts or show posts receive more comments on average.


In [28]:
# encontramos os posts, agora precisamos ver os comentários
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    # add this value in total_ask_comments
    total_ask_comments += num_comments
#avg of comments for posts
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)
    
    


14.038417431192661


In [29]:
# Find the average comments per show_posts
total_show_comments = 0
for post in show_posts:
    num_1 = int(post[4])
    
    total_show_comments += num_1

avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments for Show HN:")
avg_show_comments
 

Average number of comments for Show HN:


10.31669535283993

> The <mark>Ask HN</mark> posts receive more comments.  The <mark>Ask HN</mark> posts receive around 14 comments per post while the <mark>Show HN</mark> posts receive around 10 comments per post.

Since <mark>Ask HN</mark> posts receive more comments, the analysis will focus on <mark>Ask HN</mark> posts only.

## Analyze  Ask Post Data
Determine if <mark>Ask Posts</mark> created at certain times have more comments:

* calculate the number of <mark>Ask Posts</mark> created each hour along with # comments received

* calculate average number of comments <mark>Ask Posts</mark> receives by hour created

* identify the top five hours with the highest comments per post

* identify the best hours to create a post to have a higher chance of receiving comments


> Calculate the number of <mark>Ask Posts</mark> created each hour along with the number of comments received

In [30]:
# import datetime module as dt
import datetime as dt

In [34]:
# calculate the number of ask posts created each hour along with # comments received
 
result_list = []
for post in ask_posts:
    created_at_col = post[6]
    num_comments_col = post[4]
    result_list.append([created_at_col, num_comments_col]) # create at time and dat, number of comments

counts_by_hour = {} 
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment_num = int(row[1])
    date_formatted_dt = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    hour = date_formatted_dt.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_num
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_num
        
print("Posts per hour:","\n",counts_by_hour,"\n", "\n","Comments per hour:","\n",comments_by_hour) 

Posts per hour: 
 {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} 
 
 Comments per hour: 
 {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


> Calculate average number of comments <mark>Ask Posts</mark> receives by the hour created

In [35]:
# example to create a list of a list by a dict
sample_dict = {
                'apple': 2, 
                'banana': 4, 
                'orange': 6
               }

fruits = []

for fruit in sample_dict:
    fruits.append([fruit, 10*sample_dict[fruit]])
print(fruits)
    

[['apple', 20], ['banana', 40], ['orange', 60]]


Use the example above to calculate the average number of comments per post for posts created during each hour of the day.

In [36]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print('Average number of commments per post by hour:')
avg_by_hour


Average number of commments per post by hour:


[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]