# Hacker News (Show/Ask)

In this project I'll be going over the show and ask section on hacker news and try to determine which of the two attracts more comments in general, and of all posts that receive atleast one comment. Also, attempt to answer if there is an optimal time to post to hacker news to gain the most amount of traction.

To start - we'll need to read into our csv file which is stored in the same directory as this file and named `'Hacker_News2016.csv'`

In [1]:
from csv import reader

file = open('Hacker_News2016.csv', encoding='utf-8')
hn = list(reader(file))

hn_header, hn_data =  hn[0], hn[1:] # splits data into its header and data portions

for data in hn_data[:3]:
    print(data) # displays the first few rows so we get a feel as to what the data looks like
    print()
    
print('\n')
hn_header

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']





['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [2]:
len(hn_data)

293119

Our dataset nearly contains 300,000 individual post data. Because this analysis only cares about whether the post is a ask/show hn post we'll have to filter this down.

- `ask_hn_posts` - all posts that contains 'ask hn' in the title
- `show_hn_posts` - all posts that contains 'show hn' in the title
- `other_posts` - everything else

while we don't need to include `other_posts` it may be useful later down the line to explore the dataset in another way.

Another thing we care about is whether or not a post has a comment, these will be filtered as,
`ask_hn_plus_comments` and `show_hn_plus_comments`.

In [3]:
# initialize variables
ask_hn_posts = []
ask_hn_plus_comments = []
show_hn_posts = []
show_hn_plus_comments = []
other_posts = []
other_plus_comments = []

In [4]:
for post in hn_data:
    title = post[1].lower()
    try:
        num_comments = int(post[4])
    except:
        num_comments = 0
        
    if title.startswith('ask hn'):
        ask_hn_posts.append(post)
        if num_comments > 0:
            ask_hn_plus_comments.append(post)
    elif title.startswith('show hn'):
        show_hn_posts.append(post)
        if num_comments > 0:
            show_hn_plus_comments.append(post)
    else:
        other_posts.append(post)
        if num_comments > 0:
            other_plus_comments.append(post)

In [5]:
print('ASK:', len(ask_hn_posts))
print('ASK W/ COMMENTS:', len(ask_hn_plus_comments))
print()
print('SHOW:', len(show_hn_posts))
print('SHOW W/ COMMENTS:', len(show_hn_plus_comments))
print()
print('OTHER:', len(other_posts))
print('OTHER W/ COMMENTS:', len(other_plus_comments))

ASK: 9139
ASK W/ COMMENTS: 6911

SHOW: 10158
SHOW W/ COMMENTS: 5059

OTHER: 273822
OTHER W/ COMMENTS: 68431


With this we have now reduced our data set from 300,000 to 20,000(12,000 if we only care about posts with comments)

With our newly filtered data we can perform some basic analysis on it. <br> <br>

Let's try to answer the following questions:
- What is the average number of comments on a ask post?
- What is the average number of comments on a show post?
- Average for ask posts with 1 or more comments?
- Average for show posts with 1 more more comments?
<br><br>
To do this, we first loop through our `ask_hn_posts` and `show_hn_posts` and extract the num_comments column out (5 -> index 4), convert it into an integer and then add that value to the `ask_posts_avg`or `show_posts_average` then take the average. (`ask_plus_comments_avg` and `show_plus_comments_avg` for posts with comments). 
<br><br>
There are many ways to take the average of something, but the way we're going to use is to initialize our variables as a list, sum up all the values in that list using the built-in `sum()` function, and then divide the sum by the length of that list.



In [6]:
# gather values
def gather_comments_avg(dataset, col, cond=lambda x: True):
    avg_s1 = []
    for data in dataset:
        val = data[col]
        try:
            val = int(val)
        except:
            val = 0
            
        if cond(val):
            avg_s1.append(val)
            
    avg_s2 = sum(avg_s1) / len(avg_s1)
    return avg_s2
        

In [7]:
# averages
ask_posts_avg = gather_comments_avg(ask_hn_posts, 4)
show_posts_average = gather_comments_avg(show_hn_posts, 4)
ask_plus_comments_avg = gather_comments_avg(ask_hn_plus_comments, 4)
show_plus_comments_avg = gather_comments_avg(show_hn_plus_comments, 4)

print(f'ASK AVERAGE {ask_posts_avg:.2f}')
print(f'ASK AVERAGE+ {ask_plus_comments_avg:.2f}')
print()
print(f'SHOW AVERAGE {show_posts_average:.2f}')
print(f'SHOW AVERAGE+ {show_plus_comments_avg:.2f}')

ASK AVERAGE 10.39
ASK AVERAGE+ 13.74

SHOW AVERAGE 4.89
SHOW AVERAGE+ 9.81


Ask posts receive an average of 10.4(13.7 for comments >= 1) comments while show posts only receive an average of 4.89(9.81 for comments >= 1). <br>

With this information we can conclude to our first question 'whether ask or show posts receive more comments', that ask posts receive more comments on average. Which makes sense considering 'ask' is like a request for information rather than something you show to people, where people may or may not have an opinion of the thing being shown. <br><br>

The next portion of our analysis will be focused on the time frame in which a post is published and the comments it receives, so with this in mind we'll be focusing on the `ask_hn_plus_comments` dataset.<br><br>
We'll first extract the date created and the number of comments from each datapoint into a list. Then we'll parse the list and create a dictionary for each hour of the day, incrementing the stored value of each hour by 1 each time we pass a value in our list created during that specific hour of the day.

In [8]:
ask_hn_plus_comments[20] # to avoid scrolling up

['12571287',
 "Ask HN: What's the best way to learn about the blockchain?",
 '',
 '247',
 '66',
 'm52go',
 '9/24/2016 15:29']

In [9]:
created_numcomments = []
hours_dict = {}
comments_dict = {}

In [10]:
import datetime
for post in ask_hn_plus_comments:
    date = post[6]
    date_time = datetime.datetime.strptime(date, '%m/%d/%Y %H:%M')
    num_comments = int(post[4]) # validated earlier
    created_numcomments.append([date_time, num_comments])

for date, comments in created_numcomments:
    if date.hour in hours_dict:
        hours_dict[date.hour] += 1
    else:
        hours_dict[date.hour] = 1
        
    if date.hour in comments_dict:
        comments_dict[date.hour] += comments
    else:
        comments_dict[date.hour] = comments
        

The following uses the two previously defined dictionaries and creates a list of lists containing the hour and the average number of comments for that hour.

In [20]:
average_posts_per_hour = []

for idx in range(len(hours_dict)):
    posts = hours_dict[idx]
    comments = comments_dict[idx]
    average_posts_per_hour.append([idx, comments / posts])

In [26]:
average_posts_per_hour = [[val[1], val[0]] for val in average_posts_per_hour] # flipping the values around
average_posts_per_hour = sorted(average_posts_per_hour, reverse=True)

Now that we've established the number of comments for each hour, let's display it.

In [28]:
print('AVERAGE NUMBER OF COMMENTS FOR EACH HOUR SORTED FROM GREATEST TO LEAST AMOUNT OF COMMENTS.')
for comments, hour in average_posts_per_hour:
    print(f'{hour}:00 - AVG Comments: {comments:.2f}')

AVERAGE NUMBER OF COMMENTS FOR EACH HOUR SORTED FROM GREATEST TO LEAST AMOUNT OF COMMENTS.
15:00 - AVG Comments: 39.67
13:00 - AVG Comments: 22.22
12:00 - AVG Comments: 15.45
10:00 - AVG Comments: 13.76
17:00 - AVG Comments: 13.73
2:00 - AVG Comments: 13.20
14:00 - AVG Comments: 13.15
4:00 - AVG Comments: 12.69
8:00 - AVG Comments: 12.43
22:00 - AVG Comments: 11.75
20:00 - AVG Comments: 11.38
11:00 - AVG Comments: 11.14
5:00 - AVG Comments: 11.14
21:00 - AVG Comments: 11.06
18:00 - AVG Comments: 10.79
16:00 - AVG Comments: 10.76
3:00 - AVG Comments: 10.16
7:00 - AVG Comments: 10.10
0:00 - AVG Comments: 9.86
19:00 - AVG Comments: 9.41
1:00 - AVG Comments: 9.37
6:00 - AVG Comments: 9.02
9:00 - AVG Comments: 8.39
23:00 - AVG Comments: 8.32


With this we see that to gain the most amount of comments we should post an ask hn at 15:00 or 3pm.