## Exploring Hacker News Posts
In this project, I will work with data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com)

I will work with posts whose title begin with either `Ask HN` or `Show HN` and analyze them

### Opening The Data
The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

In [1]:
from csv import reader
open_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding='utf-8')
read_file = reader(open_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

In [2]:
print(headers,'\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


### Extracting Ask HN and Show HN Posts
To find the posts that begin with either `Ask HN` or `Show HN`, I'll use the string method `startswith`. 

In [3]:
# testing = 'TesTing'
# testing = testing.lower()
# print(testing)

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


In [5]:
print(ask_posts[:3])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']]


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [6]:
## To find total number of comments in ask posts
total_ask_comments = 0
for row in ask_posts:
    num_ask_comments = int(row[4])
    total_ask_comments += num_ask_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Average Ask HN comments: {:.2f}'.format(avg_ask_comments))

## To find total number of comments in ask posts
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print('Average Show HN comments: {:.2f}'.format(avg_show_comments))

Average Ask HN comments: 10.39
Average Show HN comments: 4.89


We can see that the Ask HN posts receive more comments on average, over two times more than the Show HN posts

### Finding the Amount of Ask Posts and Comments by Hour Created
I'll determine if ask posts created at a certain time are more likely to attract comments. The following steps will be used to perform this analysis:
1. Calculate the amount of ask posts created in each hour of the day, along with th e number of comments received
2. Calculate the average number of comments ask posts receive by hour created

In [7]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[-1]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

# print(result_list[:10])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comment = num_comments
    date_and_time = row[0]
    date_and_time = dt.datetime.strptime(date_and_time,"%m/%d/%Y %H:%M")
    hour = date_and_time.strftime("%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment

print(counts_by_hour,'\n')
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209} 

{'02': 5380, '01': 5640, '22': 7660, '21': 10360, '19': 11040, '17': 11740, '15': 12920, '14': 10260, '13': 8880, '11': 6240, '10': 5640, '09': 4440, '07': 4520, '03': 5420, '23': 6860, '20': 10200, '16': 11580, '08': 5140, '00': 6020, '18': 12280, '12': 6840, '04': 4860, '06': 4680, '05': 4180}


### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [8]:
avg_by_hour = []
for time in counts_by_hour:
    avg_by_hour.append([time, round((comments_by_hour[hour]/
                                       counts_by_hour[time]),2)])
for row in avg_by_hour:   
    print(row[0],":",row[1])

02 : 17.4
01 : 16.6
22 : 12.22
21 : 9.03
19 : 8.48
17 : 7.97
15 : 7.24
14 : 9.12
13 : 10.54
11 : 15.0
10 : 16.6
09 : 21.08
07 : 20.71
03 : 17.27
23 : 13.64
20 : 9.18
16 : 8.08
08 : 18.21
00 : 15.55
18 : 7.62
12 : 13.68
04 : 19.26
06 : 20.0
05 : 22.39


In [9]:
## Sorting and Printing values from List of Lists
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
                        
print(swap_avg_by_hour)

[[17.4, '02'], [16.6, '01'], [12.22, '22'], [9.03, '21'], [8.48, '19'], [7.97, '17'], [7.24, '15'], [9.12, '14'], [10.54, '13'], [15.0, '11'], [16.6, '10'], [21.08, '09'], [20.71, '07'], [17.27, '03'], [13.64, '23'], [9.18, '20'], [8.08, '16'], [18.21, '08'], [15.55, '00'], [7.62, '18'], [13.68, '12'], [19.26, '04'], [20.0, '06'], [22.39, '05']]


In [11]:
### Sorting the average comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
# for row in sorted_swap:
#     print(row[1],':',row[0])
print(sorted_swap,'\n')
print("Top 5 Hours for Ask Posts Comments")
print(sorted_swap[:5])

[[22.39, '05'], [21.08, '09'], [20.71, '07'], [20.0, '06'], [19.26, '04'], [18.21, '08'], [17.4, '02'], [17.27, '03'], [16.6, '10'], [16.6, '01'], [15.55, '00'], [15.0, '11'], [13.68, '12'], [13.64, '23'], [12.22, '22'], [10.54, '13'], [9.18, '20'], [9.12, '14'], [9.03, '21'], [8.48, '19'], [8.08, '16'], [7.97, '17'], [7.62, '18'], [7.24, '15']] 

Top 5 Hours for Ask Posts Comments
[[22.39, '05'], [21.08, '09'], [20.71, '07'], [20.0, '06'], [19.26, '04']]


In [18]:
for row in sorted_swap[:5]:
    time = row[1]
    comment = row[0]
    time = dt.datetime.strptime(time,"%H").strftime("%I%p")
    print("{0:} {1} average comments per post".format(time,comment))

05AM 22.39 average comments per post
09AM 21.08 average comments per post
07AM 20.71 average comments per post
06AM 20.0 average comments per post
04AM 19.26 average comments per post
