# Exploring Ask HN and Show HN Hacker News Posts

## Introduction
Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."[[1]](https://en.wikipedia.org/wiki/Hacker_News) Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
## Project Description
On this project, we're specifically interested in posts with titles that begin with either Ask HN or Show HN. 

Users submit `Ask HN` posts to ask the Hacker News community a specific question. Below are a few examples:

>Ask HN: How to improve my personal website?
<br>Ask HN: Am I the only one outraged by Twitter shutting down share counts?
<br>Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

>Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
<br>Show HN: Something pointless I made
<br>Show HN: Shanhu.io, a programming playground powered by e8vm
## Project Goal
On our analysis, we will try to compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

---

## Dataset

As we are working on a csv file from [Hacker News Posts](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), we are going to import csv in order to have an access with the reader() method. We are also going to import datetime with an alias of dt since we are going to work with date type of data. We should also note that the dataset was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. This process was done by [Dataquest](https://www.dataquest.io/) for the purpose of this guided project.

In [1]:
import csv
import datetime as dt

In [2]:
with open('hacker_news.csv', encoding = 'utf-8') as dataset:
    read_file = csv.reader(dataset)
    hn = list(read_file)
header = hn[0]
hn = hn[1:]

To visualize our dataset, we are going to print the first five posts including the header.

In [3]:
print('Header:',header, sep = '\n')
print('\nFirst five rows:')
for row in range(5):
    print(hn[row],'\n')

Header:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First five rows:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/

As we can see, our dataset includes 7 columns which consists of the following:

|Index|Column Name|Description|
|:---|:---|:---|
|0|id|id from the post|
|1|title|title of the post|
|2|url|URL that the post link to|
|3|num_points|number of points acquired, total upvote plus downvote|
|4|num_comments|number of comments on the post|
|5|author|username of the person who submitted|
|6|created_at|date and time of post's submission|


---

## Data Cleaning

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.
<br><br>To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True; otherwise, it will return False. Note that startswith method is case sensitive, the example below shows how it works:
>print('dataquest'.startswith('Data'))
<br>print('dataquest'.startswith('data'))
><br>False<br>True

In the following code below, we are going to extract the `Ask HN` and `Show HN` posts.

In [4]:
ask_posts = list()
show_posts = list()
other_posts = list()

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Since we want to answer the question *"Do posts created at a certain time receive more comments on average?"*, we know that we are going to work with datetime data. Our dataset has only one date column which is `create_at` where the data shows the date and time when the posts were created. Now that we are dealing with multiple datasets, we are going to define a function that is going to convert the string of date into a datetime object and apply that function on the datasets.

In [5]:
def convert_date(dataset,index):
    for row in dataset:
        created_at = row[index]
        date_format = '%m/%d/%Y %H:%M'
        row[index] = dt.datetime.strptime(created_at, date_format)

In [6]:
convert_date(ask_posts, 6)
convert_date(show_posts, 6)

In [7]:
print('Ask HN sample post:')
print(ask_posts[0])

print('\nShow HN sample post:')
print(show_posts[0])

Ask HN sample post:
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', datetime.datetime(2016, 8, 16, 9, 55)]

Show HN sample post:
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', datetime.datetime(2015, 11, 25, 14, 3)]


As we can observe, the last column of our dataset was converted to a datetime object. Now that we have a clean dataset we can start doing our analysis.

---

## Data Analysis

We want to answer the question *"Do Ask HN or Show HN receive more comments on average?"*. For now, we are going to look at the number of posts using the dataset that we extracted earlier.

In [8]:
print('Ask HN number of posts:', '{:,}'.format(len(ask_posts)))
print('Show HN number of posts:', '{:,}'.format(len(show_posts)))
print('Number of other posts:', '{:,}'.format(len(other_posts)))

Ask HN number of posts: 1,744
Show HN number of posts: 1,162
Number of other posts: 17,194


As we can see, `Ask HN` are more common than `Show HN` posts. This indicates that more users asks the Hacker News community for specific questions. Now we will try to see which posts gets more enagements. We can measure the engagement of the community by looking at the number of comments that each posts receive.

In [9]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = round((total_ask_comments / len(ask_posts)),2)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = round((total_show_comments / len(ask_posts)),2)

print('Average comments of Ask HN posts:', avg_ask_comments)
print('Average comments of Show HN posts:', avg_show_comments)

Average comments of Ask HN posts: 14.04
Average comments of Show HN posts: 6.87


We can observe that aside from the community posting more `Ask HN` posts, these kind of posts also recieves more engagement.

<br>Now we are going to answer the question *"Do posts created at a certain time receive more comments on average?"*.

To answer this question, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

To startwith, we are going to create an empty list with a variable name result list that will store the datetime that the posts is created at and the number of comments on that post.

In [10]:
result_list = list()

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at, num_comments]
    result_list.append(result)
print(result_list[:3])

[[datetime.datetime(2016, 8, 16, 9, 55), 6], [datetime.datetime(2015, 11, 22, 13, 43), 29], [datetime.datetime(2016, 5, 2, 10, 14), 1]]


Now that we have our list of datetime and the number of comments. We are going to create a frequency table by storing an empty dictionary on the variables called `counts_by_hour` which will store the number of posts and `comments_by_hour` which will store the number of comments on that specific time in hour.

In [11]:
counts_by_hour = dict()
comments_by_hour = dict()

for row in result_list:
    time = row[0].strftime('%H')
    num_comments = row[1]
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += num_comments
    elif time not in counts_by_hour:
        counts_by_hour[time] = counts_by_hour.get(time, 1)
        comments_by_hour[time] = comments_by_hour.get(time, num_comments)

We are going to print our two frequency table to have a vision on what it contains.

In [12]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [13]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Now that the two frequency table is complete, we are going to use it to calculate the average number of comments on specific time in terms of hours.

In [14]:
avg_comments_per_hour = list()

for hour, comments in comments_by_hour.items():
    comments_per_hour = round((comments / counts_by_hour[hour]),2)
    hour_comments = comments_per_hour, hour
    avg_comments_per_hour.append(hour_comments)

sorted_avg_comment = sorted(avg_comments_per_hour, reverse = True)
print('Top five most number of comments on a specific time in terms of hours')
print('Time in hour',':','Comments per hour')
for hour in range(1,6):
    time = sorted_avg_comment[hour - 1][1]
    avg_comments = sorted_avg_comment[hour - 1][0]
    print('{rank}. There are {num} number of comments during the time of {time}:00.'.format(rank = hour, num = avg_comments, time = time))

Top five most number of comments on a specific time in terms of hours
Time in hour : Comments per hour
1. There are 38.59 number of comments during the time of 15:00.
2. There are 23.81 number of comments during the time of 02:00.
3. There are 21.52 number of comments during the time of 20:00.
4. There are 16.8 number of comments during the time of 16:00.
5. There are 16.01 number of comments during the time of 21:00.


We can see the specific time where there are more engagements from the list above.

---

## Conclusion
<br>In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of engagement a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00.

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 received the most comments on average.