# Analyzing Hacker News Data
In this project, we are interested in specific types of posts on HackerNews' site. We want to analyze Ask HN or Show HN post types. From there we want to uncover any trends or perform some surface level level analysis in order to come up with some decision about the information

***

## The Data Set
Our data set is from [here](https://www.kaggle.com/hacker-news/hacker-news-posts), the original data has about 300,000 rows, but it was reduced to about 20,000 in order to reduce our computational complexity and save time.

The rows of data follow this structure:

| Column | Meaning |
| :---: | :---: |
| id | unique identifier for post |
| title | title of post | 
| url | URL that the post links to, given the post has a URL |
| num_points | number of points the post got, total upvotes - total downvotes|
| num_comments | number of comments on post |
| author | username of person who submit the post |
| created_at | date and time at which post was submitted | 


In [2]:
# reading in and printing out first five rows
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [4]:
headers = hn[0]
del hn[0]

In the above cell, we place the header rows into a different list, that is to not have to iterate through it all the time when we are doing our cleaning and analysis. We also make sure that we run this cell only once as we do not want to delete the zeroth item many times.

In [7]:
# we look at the header and also the first 5 rows of the actual data
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [10]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Comments on posts
We are also interested in determing which type of posts have more comments on average. We iterate over each of the lists and aggregate the total number of comments in all of the posts and divide by number of posts in each to find the average for the different types of posts

In [12]:
total_ask_comments = 0
total_show_comments = 0
for ask in ask_posts:
    n_comments = int(ask[4])
    total_ask_comments += n_comments
for show in show_posts:
    n_comments = int(show[4])
    total_show_comments += n_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)
print('Average ask comments:',avg_ask_comments)
print('Average show comments:',avg_show_comments)

Average ask comments: 14.038417431192661
Average show comments: 10.31669535283993


Based on our caclulations, we see that ask posts recieve more comments on average than do show posts by about 4 more posts. This is consistent given there are more ask posts in general, perhaps it is safe to say that they have more average interactions per user. Hence, we see this difference in the average number of comments.

In [19]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append((post[6], int(post[4])))

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    time = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hours = time.hour
    if hours not in counts_by_hour:
        counts_by_hour[hours] = 1
        comments_by_hour[hours] = row[1]
    else:
        counts_by_hour[hours]+=1
        comments_by_hour[hours]+=row[1]
print(comments_by_hour)
print('\n')
print(counts_by_hour)

{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}


{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


In [20]:
avg_by_hour = []
for time in counts_by_hour:
    avg_by_hour.append((time,comments_by_hour[time]/counts_by_hour[time]))
print(avg_by_hour)


[(0, 8.127272727272727), (1, 11.383333333333333), (2, 23.810344827586206), (3, 7.796296296296297), (4, 7.170212765957447), (5, 10.08695652173913), (6, 9.022727272727273), (7, 7.852941176470588), (8, 10.25), (9, 5.5777777777777775), (10, 13.440677966101696), (11, 11.051724137931034), (12, 9.41095890410959), (13, 14.741176470588234), (14, 13.233644859813085), (15, 38.5948275862069), (16, 16.796296296296298), (17, 11.46), (18, 13.20183486238532), (19, 10.8), (20, 21.525), (21, 16.009174311926607), (22, 6.746478873239437), (23, 7.985294117647059)]


We wish to make the averages for the corresponding hour look a bit nicer, so below I print and format them so that the information is more readable.

In [21]:
for thing in avg_by_hour:
    print('Hour:', thing[0], 'Average:', thing[1])

Hour: 0 Average: 8.127272727272727
Hour: 1 Average: 11.383333333333333
Hour: 2 Average: 23.810344827586206
Hour: 3 Average: 7.796296296296297
Hour: 4 Average: 7.170212765957447
Hour: 5 Average: 10.08695652173913
Hour: 6 Average: 9.022727272727273
Hour: 7 Average: 7.852941176470588
Hour: 8 Average: 10.25
Hour: 9 Average: 5.5777777777777775
Hour: 10 Average: 13.440677966101696
Hour: 11 Average: 11.051724137931034
Hour: 12 Average: 9.41095890410959
Hour: 13 Average: 14.741176470588234
Hour: 14 Average: 13.233644859813085
Hour: 15 Average: 38.5948275862069
Hour: 16 Average: 16.796296296296298
Hour: 17 Average: 11.46
Hour: 18 Average: 13.20183486238532
Hour: 19 Average: 10.8
Hour: 20 Average: 21.525
Hour: 21 Average: 16.009174311926607
Hour: 22 Average: 6.746478873239437
Hour: 23 Average: 7.985294117647059


## Looking at Best Times to Post
We seek to look at the top 5 best times to post based on the average number of comments per hour. So we must sort our information and then look at the 5 elements of the information. 

In [35]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append((row[1], row[0]))
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print('Top 5 Hours for Ask Posts Coments')
for thing in sorted_swap[0:5]:
    print('{hr}:{min}: {avg:.2f} average coments per post'.format(hr=dt.datetime.strptime(str(thing[1]),'%H').hour, min=dt.datetime.strptime(str(thing[1]),'%H').minute,avg=thing[0]))

Top 5 Hours for Ask Posts Coments
15:0: 38.59 average coments per post
2:0: 23.81 average coments per post
20:0: 21.52 average coments per post
16:0: 16.80 average coments per post
21:0: 16.01 average coments per post


## Analysis of Information
### Time Zone Difference
Before our analysis, it is important to know the time zone of the acquired data. The documentation notes that the data set time is in EST (Eastern Time US). This means that this data is 3 hours ahead of my personal time PST (Pacific Standard Time). 
### Analysis
Based on our information, we must take away 3 hours to find the corresponding time that is relevant to me. We see that the best time to post and get the most interactions is at 12PM. This is possible because people might be on break, so use this time wisely!