# Analyzing Hacker News Posts

Hacker News is a site similar to Reddit where users can submit stories and vote and comment on them. Hacker News is popular in tech and startup circles and posts that make it to the top can get hundreds of thousands of visitors.

The dataset we will be analyzing can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), it has been reduced from 300,000 rows to 20,000 by eliminating any submissions without any comments and then randomly sampling from the remaining submissions.

Below are the descriptions of the columns:
* **id**: Identifier from Hacker News for the post
* **title**: The title of the post
* **url**: The URL that the posts links to, if the post has a URL
* **num_points**: The number of points the post has, calculated as the number of upvotes minus the number of downvotes
* **num_comments**: The number of comments made on the post
* **author**: The username of the person who submitted the post
* **created_at**: The date and time at which the post was submitted

For this project we are interested in posts with titles that begin with either "Ask HN" or "Show HN".

Ask HN posts are used to ask the community specific questions while Show HN posts are to show a project, product, or just something interesting to the community.

We'll compare these two types of posts to determine:
* Do Ask HN or Show HN posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?

##  Reading the Data
Lets read in the data and view the first five rows.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# Assign header row to a variable and remove from the original dataset

headers = hn[0]
hn = hn[1:]
print(headers)
print()
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']



[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Filtering the Data

Since we are only concerned with post titles beginning with Ask HN or Show HN we will create new 2D lists that contain just the data for those titles. To do this we will be using the string method startswith.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Determining Average Comments by Post Type

Now that we have separated out the types of posts we can determine whether ask posts or show posts receive more comments on average.

In [4]:
# Find total number of comments in ask posts
total_ask_comments = 0
for post in ask_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments

# Compute and show the avg
avg_ask_comments = total_ask_comments / len(hn[4])
print(avg_ask_comments)

# Find total number of comments in show posts
total_show_comments = 0
for post in show_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
# Compute and show the avg
avg_show_comments = total_show_comments / len(hn[4])
print(avg_show_comments)

3497.5714285714284
1712.5714285714287


It appears in our sample data that on average, ask posts see more commenter engagement than show posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

So we have determined through our basic analysis that Ask posts receive more comments on average. Since these kinds of posts are more likely to receive comments we will focus our analysis on them. Let's determine whether Ask posts created at a certain ***time*** are more likely to attract commenters.

To do so we will use the following steps:
1. Calculate the amount of ask posts created in each hour of the day along with comments recieved.
2. Calculate the average number of comments ask posts recieve by hour created.

In [6]:
import datetime as dt

# Create an empty list to store the time created and the number of comments
result_list = []

# Iterate over ask_posts and append the time and comments to result_list
for row in ask_posts:
    timecreated = row[6]
    numcomments = int(row[4])
    result_list.append([timecreated, numcomments])

# Create two  empty dictionaries to act as frequency tables
# One tracks the number of posts made by hour
# One tracks the number of comments made by hour
counts_by_hour = {}
comments_by_hour = {}
# Specify the date format in our data
date_format = "%m/%d/%Y %H:%M"

# Iterate over the result list, extract the hour from the date
for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")  # Select just the hour
    # If the hour is a key, increment the values
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    # If the hour is not a key, create a key and set to 1, create comment key and set equal to comment
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour
counts_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Now that we have the two dictionaries:
* counts_by_hour: contains the number of ask posts created each hour of the day
* comments_by_hour: containts the corresponding number of comments received by hour

We can calculate the average number of comments for posts created during each hour of the day.

In [8]:
# Create an empty list to store the results

avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])
    
avg_by_hour

[['16', 16.796296296296298],
 ['18', 13.20183486238532],
 ['13', 14.741176470588234],
 ['08', 10.25],
 ['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['09', 5.5777777777777775],
 ['05', 10.08695652173913],
 ['04', 7.170212765957447],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['01', 11.383333333333333],
 ['20', 21.525],
 ['11', 11.051724137931034],
 ['19', 10.8],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['06', 9.022727272727273],
 ['22', 6.746478873239437],
 ['17', 11.46],
 ['12', 9.41095890410959],
 ['14', 13.233644859813085]]

## Sorting and Displaying the Values

We now have the results we were looking for but the readability of the list leaves something to be desired. Let's try now to present our results in a format that is easier to read and understand.

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[16.796296296296298, '16'], [13.20183486238532, '18'], [14.741176470588234, '13'], [10.25, '08'], [7.852941176470588, '07'], [16.009174311926607, '21'], [5.5777777777777775, '09'], [10.08695652173913, '05'], [7.170212765957447, '04'], [23.810344827586206, '02'], [7.796296296296297, '03'], [8.127272727272727, '00'], [7.985294117647059, '23'], [11.383333333333333, '01'], [21.525, '20'], [11.051724137931034, '11'], [10.8, '19'], [13.440677966101696, '10'], [38.5948275862069, '15'], [9.022727272727273, '06'], [6.746478873239437, '22'], [11.46, '17'], [9.41095890410959, '12'], [13.233644859813085, '14']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [10]:
# Sort the values and print out the top 5 hours with the most comments on avg

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

By working with datetimes and strings we were able to determine the hours that see the most number of posts, which hours see the highest number of commenters, and whichs hours see the highest number of comments on average.