# Hacker News Analysis
Download dataset here:
- https://www.kaggle.com/hacker-news/hacker-news-posts

Other:
- [Python's strftime directives](https://strftime.org/)

## Utility Functions

Function **explore_data** prints selected rows for a data set

In [10]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Function **explore_data** prints selected rows for a data set

In [11]:
def get_csv_list(filename):
    opened_file = open(filename)
    from csv import reader
    read_file = reader(opened_file)
    return list(read_file)

## Dataset Description

**hacker_news.csv**

| Column name | Description |
| ----------- | ----------- |
| 'id' | The unique identifier from Hacker News for the post |
| 'title' | The title of the post |
| 'url' | The URL that the posts links to, if it the post has a URL |
| 'num_points' | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| 'num_comments' | The number of comments that were made on the post |
| 'author' | The username of the person who submitted the post |
| 'created_at' | The date and time at which the post was submitted |

In [8]:
hn = get_csv_list('/Users/robdurkin/dev/Dataquest/projects/datasets/hacker_news.csv')

In [13]:
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows: 293120
Number of columns: 7


In [17]:
headers = hn[0]
hn = hn[1:]

In [19]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [20]:
explore_data(hn, 0, 5)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




## Filter Data
Get posts with titles of **Ask HN** or **Show HN**

In [22]:
ask_posts, show_posts, other_posts = [], [], []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print('# Ask Posts: {}'.format(len(ask_posts)))
print('# Show Posts: {}'.format(len(show_posts)))
print('# Other Posts: {}'.format(len(other_posts)))


# Ask Posts: 9139
# Show Posts: 10158
# Other Posts: 273822


## Data Analysis

**Do "Ask HN" or "Show HN" receive more comments on average?**

In [26]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Total # ask comments: {:,}'.format(total_ask_comments))
print('Average # ask comments: {:.2f}'.format(avg_ask_comments))

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)

print('Total # show comments: {:,}'.format(total_show_comments))
print('Average # show comments: {:.2f}'.format(avg_show_comments))

total_other_comments = 0
for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments
avg_other_comments = total_other_comments / len(other_posts)

print('Total # other comments: {:,}'.format(total_other_comments))
print('Average # other comments: {:.2f}'.format(avg_other_comments))

Total # ask comments: 94,986
Average # ask comments: 10.39
Total # show comments: 49,633
Average # show comments: 4.89
Total # other comments: 1,768,142
Average # other comments: 6.46


**Conclusion**: "Ask HN" posts receive more comments than "Show HN" posts

---

**Are "Ask HN" posts created at a certain time more likely to attract comments?**

In [37]:
import datetime as dt

result_list = []
for post in ask_posts:
    item = [post[6], int(post[4])]
    result_list.append(item)
    
    
posts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    created_at = result[0]
    created_dt = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour =  dt.datetime.strftime(created_dt, "%H")

    num_comments = result[1]
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    
print("# Posts and # Comments by Hour\n")    
num_comments_hour = []   
for hour in sorted(posts_by_hour):
    num_comments_hour.append((hour,comments_by_hour[hour])) 
    print("Hour: {}, # Posts: {}, # Comments: {}"
        .format(hour, posts_by_hour[hour], comments_by_hour[hour]))
    
print("\n\nSorted by # Comments Per Hour\n")
for item in sorted(num_comments_hour, key = lambda x: x[1], reverse = True):
    print("Hour: {}, # Comments: {}"
        .format(item[0], item[1]))

# Posts and # Comments by Hour

Hour: 00, # Posts: 301, # Comments: 2277
Hour: 01, # Posts: 282, # Comments: 2089
Hour: 02, # Posts: 269, # Comments: 2996
Hour: 03, # Posts: 271, # Comments: 2154
Hour: 04, # Posts: 243, # Comments: 2360
Hour: 05, # Posts: 209, # Comments: 1838
Hour: 06, # Posts: 234, # Comments: 1587
Hour: 07, # Posts: 226, # Comments: 1585
Hour: 08, # Posts: 257, # Comments: 2362
Hour: 09, # Posts: 222, # Comments: 1477
Hour: 10, # Posts: 282, # Comments: 3013
Hour: 11, # Posts: 312, # Comments: 2797
Hour: 12, # Posts: 342, # Comments: 4234
Hour: 13, # Posts: 444, # Comments: 7245
Hour: 14, # Posts: 513, # Comments: 4972
Hour: 15, # Posts: 646, # Comments: 18525
Hour: 16, # Posts: 579, # Comments: 4466
Hour: 17, # Posts: 587, # Comments: 5547
Hour: 18, # Posts: 614, # Comments: 4877
Hour: 19, # Posts: 552, # Comments: 3954
Hour: 20, # Posts: 510, # Comments: 4462
Hour: 21, # Posts: 518, # Comments: 4500
Hour: 22, # Posts: 383, # Comments: 3372
Hour: 23, # Posts: 343, 

**Conclusion**: Comments are more likely for "Ask HN" posts between the hours of 1300 and 1800, with a peak at 1500

In [46]:
avg_by_hour, swap_avg_by_hour = [],[]
for hour in posts_by_hour:
    num_posts = posts_by_hour[hour]
    avg_num_posts = comments_by_hour[hour] / num_posts
    avg_by_hour.append([hour, avg_num_posts])
    swap_avg_by_hour.append([avg_num_posts, hour])
    
print("Average number of comments per post by hour (sorted by hour)")
for rec in sorted(avg_by_hour):
    hour = rec[0]
    avg = rec[1]
    print("Hour: {}, Avg # Comments: {:.2f}".format(hour, avg))
    
print("\n\nTop 5 Average number of comments per post by hour (sorted by # comments)")
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for rec in sorted_swap[:5]:
    print("Hour: {}, Avg # Comments: {:.2f}".format(rec[1], rec[0]))    

Average number of comments per post by hour (sorted by hour)
Hour: 00, Avg # Comments: 7.56
Hour: 01, Avg # Comments: 7.41
Hour: 02, Avg # Comments: 11.14
Hour: 03, Avg # Comments: 7.95
Hour: 04, Avg # Comments: 9.71
Hour: 05, Avg # Comments: 8.79
Hour: 06, Avg # Comments: 6.78
Hour: 07, Avg # Comments: 7.01
Hour: 08, Avg # Comments: 9.19
Hour: 09, Avg # Comments: 6.65
Hour: 10, Avg # Comments: 10.68
Hour: 11, Avg # Comments: 8.96
Hour: 12, Avg # Comments: 12.38
Hour: 13, Avg # Comments: 16.32
Hour: 14, Avg # Comments: 9.69
Hour: 15, Avg # Comments: 28.68
Hour: 16, Avg # Comments: 7.71
Hour: 17, Avg # Comments: 9.45
Hour: 18, Avg # Comments: 7.94
Hour: 19, Avg # Comments: 7.16
Hour: 20, Avg # Comments: 8.75
Hour: 21, Avg # Comments: 8.69
Hour: 22, Avg # Comments: 8.80
Hour: 23, Avg # Comments: 6.70


Top 5 Average number of comments per post by hour (sorted by # comments)
Hour: 15, Avg # Comments: 28.68
Hour: 13, Avg # Comments: 16.32
Hour: 12, Avg # Comments: 12.38
Hour: 02, Avg # Com

Averages of comments by hour confirm that posts made between the hours of 1000-1500 are more likely to get comments.  An outlier is 0200.