# Discover posts on Hacker News

Imagine that my company is trying to attract attention on [Hacker News](https://news.ycombinator.com/) so that more people talk about the company's brand and products. In this project, I need to analyze and find out what type of posts and time range attracts the most comments on Hacker News.

We'll use [the Hacker News Posts dataset shared on Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts) for this project. We need to clean the dataset, compute average comments per type and per hour.

**Summary of results**

**Ask HN** type and **15:00** (Eastern Time) attracts the most comments.

## Exploring dataset

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

In [2]:
# Load dataset
df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')
df.head(2)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24


In [3]:
df.shape

(293119, 7)

Our dataset has 293,119 rows. The header row is easy to understand its data, *num_points* is the difference of *Upvotes* and *Down-Votes*.

## Deleting posts with 0 comment

Our purpose is to analyze what posts attract comments, so we need to delete the posts with 0 comments.

In [4]:
# Delete posts with 0 comment
df = df[df.num_comments > 0]
print(f'Total posts have comments: {df.shape[0]}')

Total posts have comments: 80401


## Extracting Ask HN and Show HN Posts
Two popular types of posts on Hacker News are **Ask HN** and **Show HN**; these posts will have the title beginning with *Ask HN* or *Show HN*.

In [5]:
# Convert the post title to lowercase to make sure our filtering will be correct
df.title = df.title.str.lower()

# Extract Ask HN Posts
df_ask = df[df.title.str.startswith('ask hn')]
df_ask.head(2)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
10,12578908,ask hn: what tld do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
42,12578522,ask hn: how do you pass on your work when you ...,,6,3,PascLeRasc,9/26/2016 1:17


In [6]:
# Extract Show HN Posts
df_show = df[df.title.str.startswith('show hn')]
df_show.head(2)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
140,12577142,show hn: jumble essays on the go #paulinyourp...,https://itunes.apple.com/us/app/jumble-find-st...,1,1,ryderj,9/25/2016 20:06
177,12576813,show hn: learn japanese vocab via multiple cho...,http://japanese.vul.io/,1,1,soulchild37,9/25/2016 19:06


## Calculating the average number of comments for Ask HN and Show HN Posts
We'll determine if Ask posts or Show posts receive more comments on average.

In [7]:
#Calculate the average number of comments for Ask HN and Show HN Posts
avg_ask_comments = df_ask.num_comments.sum() / df_ask.shape[0]
avg_show_comments = df_show.num_comments.sum() / df_show.shape[0]

print('Average Ask posts comments:',avg_ask_comments)
print('Average Show posts comments',avg_show_comments)

Average Ask posts comments: 13.744175951381855
Average Show posts comments 9.810832180272781


The results show Ask posts attract more comments than Show posts. Since Ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Ask Posts's average number of comments by hour created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Add hour created column.
- Calculate the average number of comments ask posts receive by hour created.

In [8]:
# Add hour created column
df_ask.created_at = pd.to_datetime(df_ask.created_at)
df_ask['hour'] = df_ask.created_at.dt.hour

In [9]:
df_ask.head(2)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,hour
10,12578908,ask hn: what tld do you use for local developm...,,4,7,Sevrene,2016-09-26 02:53:00,2
42,12578522,ask hn: how do you pass on your work when you ...,,6,3,PascLeRasc,2016-09-26 01:17:00,1


In [10]:
# calculate the average number of comments by hour
df_avg_comments_by_hour = pd.DataFrame(df_ask.groupby('hour')['num_comments'].mean().reset_index(name = 'average_comments')).sort_values(by='average_comments', ascending=False).reset_index(drop=True)
df_avg_comments_by_hour

Unnamed: 0,hour,average_comments
0,15,39.668094
1,13,22.223926
2,12,15.452555
3,10,13.757991
4,17,13.730198
5,2,13.198238
6,14,13.153439
7,4,12.688172
8,8,12.431579
9,22,11.749129


Top 5 Hours for Ask Posts Comments

In [11]:
df_avg_comments_by_hour.nlargest(5,'average_comments',keep='all')

Unnamed: 0,hour,average_comments
0,15,39.668094
1,13,22.223926
2,12,15.452555
3,10,13.757991
4,17,13.730198


According to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the *created_at* column's time zone is Eastern Time (ET) in the US.

The best time to post is 15:00 - 16:00 ET with 39.67 comments per post, and the second time is 13:00 - 14:00 ET with 22.22 comments per post.

## Conclusion
To create attention for my company on Hacker News, the Marketing staff should focus on writing articles in the Ask HN category. The best time to post is 15:00 - 16:00 ET and 13:00 - 14:00 ET.

Please note that this analysis is based on data that has removed rows without comments.