# Guided Project: Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## What the data set is in this project

Resourse: https://www.kaggle.com/hacker-news/hacker-news-posts

This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

* title: title of the post (self explanatory)
* url: the url of the item being linked to
* num_points: the number of upvotes the post received
* num_comments: the number of comments the post received
* author: the name of the account that made the post
* created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Introduction & Removing Headers from a List of Lists

In [1]:
### Assign the result to the variable hn
from csv import reader
opened_file = open("/Users/derekyerkovich/desktop/my_datasets/hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

### Display headers and the first five rows
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Âthe-data-vaultÂ',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

## Extracting Ask HN and Show HN Posts

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

Let's use str.startwith() and str.lower() methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if (title.startswith('ask hn')):
        ask_posts.append(row)
    if (title.startswith('show hn')):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of total posts:', len(hn))    
print('Number of posts in ask_posts:', len(ask_posts))
print('Number of posts in show_posts:', len(show_posts))
print('Number of posts in other_posts:', len(other_posts))

Number of total posts: 293119
Number of posts in ask_posts: 9139
Number of posts in show_posts: 10158
Number of posts in other_posts: 282961


Next, let's determine if ask posts or show posts receive more comments on average.

Because the num_comments column is the fifth column in ask_posts, you'll need to get the element at index 4 in each row.

In [3]:
total_ask_posts = 0 

for row in ask_posts:
    total_ask_posts += int(row[4])
    
ave_ask_posts = total_ask_posts / len(ask_posts)
print("Average number of comments on ask posts:", ave_ask_posts)

total_show_posts = 0
for row in show_posts:
    total_show_posts += int(row[4])
    
ave_show_posts = total_show_posts / len(show_posts)
print("Average number of comments on ask posts:", ave_show_posts)


Average number of comments on ask posts: 10.393478498741656
Average number of comments on ask posts: 4.886099625910612


The result shows that there are more comments on average in ask posts than that in show posts.

##  Finding the Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received. (Useing datetime module to work with the data in the created_at column.)
2. Calculate the average number of comments ask posts receive by hour created.

In [4]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    result_list.append([row[6], int(row[4])]) # created_at, num_comments
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    # created_at: '9/25/2016 22:57'
    dt_object = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    dt_hour = dt_object.strftime("%H")
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1
        comments_by_hour[dt_hour] = row[1]
    else:
        counts_by_hour[dt_hour] += 1
        comments_by_hour[dt_hour] += row[1]
        
print(counts_by_hour)    
print('\n')
print(comments_by_hour)    

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


We can see that the hour with the most commenting activity is 10pm.

In [5]:
print(counts_by_hour['22'])
print(comments_by_hour['22'])

383
3372


## Calculating the Average Number of Comments for Ask HN Posts by Hour

Instructions:

1. Calculate the average number of comments per post for posts created during each hour of the day.
2. The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named avg_by_hour. Display the results.

In [6]:
avg_by_hour = list()
for key in counts_by_hour:
    avg_by_hour.append([key, comments_by_hour[key] / counts_by_hour[key]])

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

Create a list that equals avg_by_hour with swapped columns.

In [7]:
swap_avg_by_hour = list()
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


## Next Steps:

1. Use the sorted() function to sort swap_avg_by_hour in descending order.
2. Print the string "Top 5 Hours for Ask Posts Comments".
3. Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
4. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.

In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments (Time zone: Eastern Time in the US)')
for row in sorted_swap[:5]:
    # format the hours
    str1 = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    output = "{hour}: {avg:.2f} average comments per post".format(hour = str1, avg = row[0])
    print(output)

Top 5 Hours for Ask Posts Comments (Time zone: Eastern Time in the US)
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


According to the result, if we create a post during 3pm to 4pm (Eastern, US), it might have a higher chance of receiving comments.