# Project Overview

In this project, we'll work with a data set of submissions to popular technology site **Hacker News.**

## About the site Hacker News

**Hacker News is a site started by the startup incubator Y Combinator**, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

descriptions of the **columns in the dataset:**

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

## Objectives of the Project and it's Focus

We're specifically interested in posts whose **titles begin with either Ask HN or Show HN.** Users submit **Ask HN posts** to ask the Hacker News community a **specific question.**

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?


Users submit **Show HN posts** to show the Hacker News community **a project, product, or just generally something interesting.**


- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

**Primary Objective:**

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader
opened_file = open('C:\\Users\\Romit\\Desktop\\Jupyter Notebooks\\HN_posts_year_to_Sep_26_2016.csv', encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

In [2]:
headers = hn[0]
hn = hn[1:]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

**We're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.**

We'll use the string method **startswith.** Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object **string1.startswith('dq').** If string1 starts with dq, it will return True, otherwise it will return False.

For example : 

In [4]:
print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))

False
True


**We use these methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.**

1. Create three empty lists called ask_posts, show_posts, and other_posts.

2. Loop through each row in hn.
     - Assign the title in each row to a variable named title.
         - Because the title column is the second column, you'll need to get the element at index 1 in each row.
         
3. Implement the following steps:
   - If the lowercase version of title starts with ask hn, append the row to ask_posts.
   - Else if the lowercase version of title starts with show hn, append the row to show_posts.
   - Else append to other_posts.
   
4. Check the number of posts in ask_posts, show_posts, and other_posts.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ask_posts[:5]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57'],
 ['12577870',
  'Ask HN: Why join a fund when you can be an angel?',
  '',
  '1',
  '3',
  'anthony_james',
  '9/25/2016 22:48'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00taffe',
  '9/25/2016 21:50']]

In [6]:
show_posts[:5]

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44'],
 ['12577991',
  'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules',
  'https://github.com/jakebian/zeal',
  '2',
  '0',
  'dbranes',
  '9/25/2016 23:17'],
 ['12577142',
  'Show HN: Jumble  Essays on the go #PaulInYourPocket',
  'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8',
  '1',
  '1',
  'ryderj',
  '9/25/2016 20:06']]

## Comparison of Average comments between ask_posts and show_posts

**Steps on how to proceed:**

1. Find the total number of comments in ask posts and assign it to total_ask_comments.
   - Set total_ask_comments to 0.
   
   
2. Use a for loop to iterate over the ask posts.
   - Because the num_comments column is the fifth column in ask_posts, you'll need to get the element at index 4 in each row.
     - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
     - Add this value to total_ask_comments.
     
     
3. Compute the average number of comments on ask posts and assign it to avg_ask_comments.


4. Print avg_ask_comments.


5. Find the total number of comments in show posts and assign it to total_show_comments.
   - Set total_show_comments to 0.
   
   
6. Use a for loop to iterate over the show posts.
   - Because the num_comments column is the fifth column in show_posts, you'll need to get the element at index 4 in each row.
     - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
     - Add this value to total_show_comments.
     
     
7. Compute the average number of comments on show posts and assign it to avg_show_comments.

**Optional**
8. Print avg_show_comments.


In [7]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    total_ask_comments += int(num_comments)
    
avg_ask_comments = total_ask_comments/len(ask_posts)

total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    total_show_comments += int(num_comments)
    
avg_show_comments = total_show_comments/len(show_posts)

print(avg_ask_comments)

10.393478498741656


In [8]:
print(avg_show_comments)

4.886099625910612


We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [9]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
    
# result_list[:5]


counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

**Above we created two dictionaries:**

- **counts_by_hour:** contains the number of ask posts created during each hour of the day.
- **comments_by_hour:** contains the corresponding number of comments ask posts created at each hour received.

In [10]:
sample_dict = {
                'apple': 2, 
                'banana': 4, 
                'orange': 6
               }

fruits = []

for fruit in sample_dict:
    fruits.append([fruit, 10*sample_dict[fruit]])
    
fruits

[['apple', 20], ['banana', 40], ['orange', 60]]

**In the example above, we:**

- Initialized an empty list (of lists) and assigned it to fruits.
- Iterated over the keys of sample_dict and appended to fruits a list whose:
  - First element is the key from sample_dict.
  - Second element is the value corresponding to that key multiplied by ten.

In [11]:
# We calculate the average number of comments per post for posts created during each hour of the day.

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

We now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

# Sorting And Printing Values

1. Create a list that equals avg_by_hour with swapped columns.
   - Create an empty list and assign it to swap_avg_by_hour.
   - Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
   
   
2. Print swap_avg_by_hour.


3. Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
   - Set the reverse argument to True, so that the highest value in the first column appears first in the list.
   - Assign the result to sorted_swap.
   
   
4. Print the string "Top 5 Hours for Ask Posts Comments".


5. Loop through each average and each hour (in this order) in the first five lists of sorted_swap.


6. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.
   - To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time.
   - To format the average, you can use {:.2f} to indicate that just two decimal places should be used.

In [12]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [13]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")

for avg, hour in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(dt.datetime.strptime(hour,'%H').strftime("%H:%M"),avg))


Top 5 Hours for 'Ask HN' Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


## Conclusion

**The hour that receives the most comments per post on average is 15:00, with an average of 28.68 comments per post.**

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00.

However, it should be noted that the **data set we analyzed excluded posts without any comments.** Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on **average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.**