# Exploring Hacker News Posts

## About: 
In this project, we'll work with a data set of submissions to popular technology site <a href ="https://news.ycombinator.com/">Hacker News</a>.

Hacker News is a site started by the startup incubator <a href= "https://www.ycombinator.com/">Y Combinator</a>, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

__Data:__ https://www.kaggle.com/hacker-news/hacker-news-posts

For the purpose of this project, we have removed the rows where there are no comments

Data descriptions of the columns:

- _id_: The unique identifier from Hacker News for the post
- _title_: The title of the post
- _url_: The URL that the posts links to, if it the post has a URL
- _num_points_: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- _num_comments_: The number of comments that were made on the post
- _author_: The username of the person who submitted the post
- _created_at_: The date and time at which the post was submitted

## Goal: 
We're specifically interested in posts whose titles begin with either _Ask HN_ or _Show HN_. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Introduction

In [2]:
from csv import reader

hn = list(reader(open('data/hacker_news.csv', encoding='utf8')))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['10176908',
  'Dying vets fuck you letter (2013)',
  'http://dangerousminds.net/comments/dying_vets_fuck_you_letter_to_george_bush_dick_cheney_needs_to_be_read',
  '10',
  '2',
  'mycodebreaks',
  '9/6/2015 5:56'],
 ['10176919',
  'Ask HN: What is/are your favorite quote(s)?',
  '',
  '15',
  '20',
  'kumarski',
  '9/6/2015 6:02'],
 ['10176923',
  "Why we aren't tempted to use ACLs on our Unix machines",
  'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation',
  '34',
  '23',
  'mjn',
  '9/6/2015 6:03'],
 ['10176974',
  "Google's new logo was created by Russian designer in 2008",
  'http://www.dailytech.com/Exclusive+Googles+New+Search+Icon+Was+Created+in+2008+by+Russian+Designer/article37480.htm',
  '25',
  '12',
  'usaphp',
  '9/6/2015 6:49']]

### Removing Headers from a List of Lists

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['10176908',
  'Dying vets fuck you letter (2013)',
  'http://dangerousminds.net/comments/dying_vets_fuck_you_letter_to_george_bush_dick_cheney_needs_to_be_read',
  '10',
  '2',
  'mycodebreaks',
  '9/6/2015 5:56'],
 ['10176919',
  'Ask HN: What is/are your favorite quote(s)?',
  '',
  '15',
  '20',
  'kumarski',
  '9/6/2015 6:02'],
 ['10176923',
  "Why we aren't tempted to use ACLs on our Unix machines",
  'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation',
  '34',
  '23',
  'mjn',
  '9/6/2015 6:03'],
 ['10176974',
  "Google's new logo was created by Russian designer in 2008",
  'http://www.dailytech.com/Exclusive+Googles+New+Search+Icon+Was+Created+in+2008+by+Russian+Designer/article37480.htm',
  '25',
  '12',
  'usaphp',
  '9/6/2015 6:49'],
 ['10176976',
  'My Keyboard',
  'http://zyghost.com/articles/My-Keyboard.html',
  '144',
  '105',
  'efnx',
  '9/6/2015 6:49']]

### Extracting Ask HN and Show HN Posts

We're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]  # Index for Title is 1
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

6911
5059
68431


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

1. Find the total number of comments in ask posts and assign it to total_ask_comments.
    - Set total_ask_comments to 0.
2. Use a for loop to iterate over the ask posts.
    - Because the num_comments column is the fifth column in ask_posts, you'll need to get the element at index 4 in each row.
        - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
        - Add this value to total_ask_comments.
3. Compute the average number of comments on ask posts and assign it to avg_ask_comments.
4. Print avg_ask_comments.
5. Find the total number of comments in show posts and assign it to total_show_comments.
    - Set total_show_comments to 0.
6. Use a for loop to iterate over the show posts.
    - Because the num_comments column is the fifth column in show_posts, you'll need to get the element at index 4 in each row.
    - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
    - Add this value to total_show_comments.
7. Compute the average number of comments on show posts and assign it to avg_show_comments.
8. Print avg_show_comments.
9. Do show posts or ask posts receive more comments on average? Write a markdown cell explaining your findings.

In [5]:
"""
Average Number of Comments for the Ask HN Posts
"""
total_ask_comments = 0

# Iterate through the ask_posts list
for row in ask_posts:
    num_comments = int(row[4]) # Index is 4 for number of comments
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

"""
Average Number of Comments for the Show HN Posts
"""
total_show_comments = 0

# Loop through the show_posts list
for row in show_posts:
    num_comments = int(row[4])  # Index is 4 for number of comments
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

13.744175951381855
9.810832180272781


On average, Ask HN posts receive more comments than the Show HN posts.

### Finding the Amount of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [7]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6] # Index is 6 for post creation date
    num_comments = int(row[4]) # Index is 4 for number of comments
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

# Loop through the result_list list
for row in result_list:
    date_str = row[0]
    comment = row[1]
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
comments_by_hour

{'00': 2277,
 '01': 2089,
 '02': 2996,
 '03': 2154,
 '04': 2360,
 '05': 1838,
 '06': 1587,
 '07': 1585,
 '08': 2362,
 '09': 1477,
 '10': 3013,
 '11': 2797,
 '12': 4234,
 '13': 7245,
 '14': 4972,
 '15': 18525,
 '16': 4466,
 '17': 5547,
 '18': 4877,
 '19': 3954,
 '20': 4462,
 '21': 4500,
 '22': 3372,
 '23': 2297}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

We'll use the two dictionaries we created to calculate the average number of comments for posts created during each hour of the day.

In [8]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_comment = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comment])
    
avg_by_hour

[['03', 10.160377358490566],
 ['22', 11.749128919860627],
 ['14', 13.153439153439153],
 ['07', 10.095541401273886],
 ['10', 13.757990867579908],
 ['15', 39.66809421841542],
 ['04', 12.688172043010752],
 ['23', 8.322463768115941],
 ['19', 9.414285714285715],
 ['08', 12.43157894736842],
 ['13', 22.2239263803681],
 ['05', 11.139393939393939],
 ['11', 11.143426294820717],
 ['18', 10.789823008849558],
 ['06', 9.017045454545455],
 ['12', 15.452554744525548],
 ['17', 13.73019801980198],
 ['21', 11.056511056511056],
 ['16', 10.76144578313253],
 ['00', 9.857142857142858],
 ['02', 13.198237885462555],
 ['01', 9.367713004484305],
 ['09', 8.392045454545455],
 ['20', 11.38265306122449]]

### Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[10.160377358490566, '03'], [11.749128919860627, '22'], [13.153439153439153, '14'], [10.095541401273886, '07'], [13.757990867579908, '10'], [39.66809421841542, '15'], [12.688172043010752, '04'], [8.322463768115941, '23'], [9.414285714285715, '19'], [12.43157894736842, '08'], [22.2239263803681, '13'], [11.139393939393939, '05'], [11.143426294820717, '11'], [10.789823008849558, '18'], [9.017045454545455, '06'], [15.452554744525548, '12'], [13.73019801980198, '17'], [11.056511056511056, '21'], [10.76144578313253, '16'], [9.857142857142858, '00'], [13.198237885462555, '02'], [9.367713004484305, '01'], [8.392045454545455, '09'], [11.38265306122449, '20']]


[[39.66809421841542, '15'],
 [22.2239263803681, '13'],
 [15.452554744525548, '12'],
 [13.757990867579908, '10'],
 [13.73019801980198, '17'],
 [13.198237885462555, '02'],
 [13.153439153439153, '14'],
 [12.688172043010752, '04'],
 [12.43157894736842, '08'],
 [11.749128919860627, '22'],
 [11.38265306122449, '20'],
 [11.143426294820717, '11'],
 [11.139393939393939, '05'],
 [11.056511056511056, '21'],
 [10.789823008849558, '18'],
 [10.76144578313253, '16'],
 [10.160377358490566, '03'],
 [10.095541401273886, '07'],
 [9.857142857142858, '00'],
 [9.414285714285715, '19'],
 [9.367713004484305, '01'],
 [9.017045454545455, '06'],
 [8.392045454545455, '09'],
 [8.322463768115941, '23']]

In [10]:
# Sorted list and print the five highest values
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask Posts Comments
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


According to the <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">dataset documentation</a>, the timezone is Eastern Time in US.

Based on my timezone, if we create a post in between 11am and 2pm (US Central Time), we are more likely to recieve comments. 3pm EST seems to be a peak time for comments with an average of 39.67 comments per post.

## Conclusion

In this project, we analyzed the Ask HN and Show HN posts. Users submit Ask HN posts to ask the Hacker News community a specific question.

Note: For the purpose of this project, we have removed the rows where there are no comments

We compared these two types of posts and found that Ask HN posts receive more comments as opposed to Show HN posts. Then, we looked into the Ask HN posts to determine the time frame where posts receive the highest number of comments. We found that if we created a post around 2pm (Central Time), we will be more likely to receive a lot of comments.