# Discover posts on Hacker News

Imagine that my company is trying to attract attention on [Hacker News](https://news.ycombinator.com/) so that more people talk about the company's brand and products. In this project, I need to analyze and find out what type of posts and time range attracts the most comments on Hacker News.

We'll use [the Hacker News Posts dataset shared on Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts) for this project. We need to clean the dataset, compute average comments per type and per hour.

**Summary of results**

**Ask HN** type and **15:00** (Eastern Time) attracts the most comments.

## Exploring dataset

In [1]:
#Import and convert CSV to List of list
from csv import reader
open_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(open_file)
hn_posts = list(read_file)

hn_posts_header = hn_posts[0]
hn_posts = hn_posts[1:]

#Explore dataset
def explore_dataset(dataset,start=0,end=5):
    for row in dataset[start:end]:
        print(row)

print('Total posts:',len(hn_posts))
print(hn_posts_header)
explore_dataset(hn_posts)

Total posts: 293119
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Our dataset has 293,119 rows. The header row is easy to understand its data, *num_points* is the difference of *Upvotes* and *Down-Votes*.

## Deleting posts with 0 comment

Our purpose is to analyze what posts attract comments, so we need to delete the posts with 0 comments.

In [2]:
#Deleting posts with 0 comment
hn_posts_has_comments = []

for row in hn_posts:
    num_comments = int(row[4])
    if num_comments != 0:
        hn_posts_has_comments.append(row)

print('Total posts have comments:',len(hn_posts_has_comments))

Total posts have comments: 80401


## Extracting Ask HN and Show HN Posts
Two popular types of posts on Hacker News are **Ask HN** and **Show HN**; these posts will have the title beginning with *Ask HN* or *Show HN*.

In [3]:
#Extracting Ask HN and Show HN Posts
ask_posts = []
show_posts = []
other_posts = []

for row in hn_posts_has_comments:
    title = row[1]
    if title.lower().startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.lower().startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask HN:',len(ask_posts))
explore_dataset(ask_posts)
print('Show HN:',len(show_posts))
explore_dataset(show_posts)
print('Other HN:',len(other_posts))
explore_dataset(other_posts)

Ask HN: 6911
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']
['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']
['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']
Show HN: 5059
['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']
['12576090', 'Show HN: Markov chain Twitter bot. Trained on comm

## Calculating the Average Number of Comments for Ask HN and Show HN Posts
We'll determine if Ask posts or Show posts receive more comments on average.

In [4]:
#Calculating the Average Number of Comments for Ask HN and Show HN Posts
total_ask_posts_comments = 0
total_show_posts_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_posts_comments += num_comments

for row in show_posts:
    num_comments = int(row[4])
    total_show_posts_comments += num_comments

avg_ask_posts_comments = total_ask_posts_comments / len(ask_posts)
avg_show_posts_comments = total_show_posts_comments / len(show_posts)

print('Average Ask posts comments:',avg_ask_posts_comments)
print('Average Show posts comments',avg_show_posts_comments)

Average Ask posts comments: 13.744175951381855
Average Show posts comments 9.810832180272781


The results show Ask posts attract more comments than Show posts. Since Ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [5]:
#Import datetime module
import datetime as dt

#date time format of Hacker News
hn_datetime_format = '%m/%d/%Y %H:%M'

#Creating a list of created time and comments
result_list = []

for row in ask_posts:
    created_at = row[-1]
    created_at = dt.datetime.strptime(created_at,hn_datetime_format)
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

#Creating 2 dict for total comments and total posts by hour
comments_by_hour = {}
counts_by_hour = {}

for row in result_list:
    hour = row[0].strftime('%H')
    comment = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment

comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '23': 2297,
 '20': 4462,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [6]:
#calculating the average number of comments by hour
avg_by_hour = []

for key in comments_by_hour:
    avg_comment = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([avg_comment,int(key)])

avg_by_hour = sorted(avg_by_hour,reverse=True)

avg_by_hour

[[39.66809421841542, 15],
 [22.2239263803681, 13],
 [15.452554744525548, 12],
 [13.757990867579908, 10],
 [13.73019801980198, 17],
 [13.198237885462555, 2],
 [13.153439153439153, 14],
 [12.688172043010752, 4],
 [12.43157894736842, 8],
 [11.749128919860627, 22],
 [11.38265306122449, 20],
 [11.143426294820717, 11],
 [11.139393939393939, 5],
 [11.056511056511056, 21],
 [10.789823008849558, 18],
 [10.76144578313253, 16],
 [10.160377358490566, 3],
 [10.095541401273886, 7],
 [9.857142857142858, 0],
 [9.414285714285715, 19],
 [9.367713004484305, 1],
 [9.017045454545455, 6],
 [8.392045454545455, 9],
 [8.322463768115941, 23]]

I intentionally swapped the hour and average numbers positions. This helps us in sorting and printing the values from a list of lists.

In [7]:
print('Top 5 Hours for Ask Posts Comments')
for row in avg_by_hour[:5]:
    hour = dt.time(row[1]).strftime('%H:%M')
    avg_comment = row[0]
    print('{} has an average of {:,.2f} comments'.format(hour,avg_comment))

Top 5 Hours for Ask Posts Comments
15:00 has an average of 39.67 comments
13:00 has an average of 22.22 comments
12:00 has an average of 15.45 comments
10:00 has an average of 13.76 comments
17:00 has an average of 13.73 comments


According to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the *created_at* column's time zone is Eastern Time (ET) in the US.

The best time to post is 15:00 - 16:00 ET with 39.67 comments per post, and the second time is 13:00 - 14:00 ET with 22.22 comments per post.

## Conclusion
To create attention for my company on Hacker News, the Marketing staff should focus on writing articles in the Ask HN category. The best time to post is 15:00 - 16:00 ET and 13:00 - 14:00 ET.

Please note that this analysis is based on data that has removed rows without comments.