# Hacker News Exploration

Hacker News is a popular blog started by YCombinator where users submit posts and can receive comments.

There are two types of posts:
1. AskHN - a user asks the Hacker News community a question
2. ShowHN - a user shows the Hacker News community a project or something interesting



## Goal

There are 2 goals for this project:
1. Compare the two types of posts, and figure which one gets more comments on average
2. Figure out if posts at a certain time receive more comments on average


#### Dataset
The dataset can be found on [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts). It is a subset of data that has the following columns:
* id: the unique identifier from Hacker News for the post
* title: the title of the post
* url: the URL that the post links to, if the post has a URL
* num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: the number of comments on the post
* author: the username of the person who submitted the post
* created_at: the date and time of the post's submission

In [2]:
import csv

In [14]:
# open file using csv - mostly for practice purposes
with open("./hacker_news.csv", 'r') as file:
    read_file = csv.reader(file)
    hn = list(read_file)
    

In [15]:
#Display first 5 rows to get an idea of the data
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [16]:
headers = hn[0]
hn = hn[1:]

print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Let's start with looking at the number of each type of posts exists. The *title* column has the type of post. If it starts with *ask hn*, then we consider it an **Ask Post**. If it starts with *show hn* then it is a **Show Post**. If neither of these, then we consider it an **Other Post**.

In [21]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    if title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f'''Number of Ask Posts: {len(ask_posts)}\nNumber of Show Posts: {len(show_posts)}\nNumber of Other Posts: {len(other_posts)}''')


Number of Ask Posts: 1744
Number of Show Posts: 1162
Number of Other Posts: 18938


Since we want to first answer, on average which type of post, ask or show, gets more comments, let's compute the averages of each one.

In [30]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments/len(ask_posts)
print(f'Average Num of Comments on Ask Posts: {round(avg_ask_comments,2)}')


Average Num of Commments on Ask Posts: 14.04


In [31]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/len(show_posts)

print(f'Average Num of Comments on Show Posts: {round(avg_show_comments,2)}')




Average Num of Commments on Show Posts: 10.32


We can see that the **Ask Posts** get more comments than the **Show Posts**. People seem to like to show their expertise in something rather than commenting on how great something that someone else did is.

Since **Ask Posts** get more comments, we'll see if there's a time of the day that people are more likely to reply.

### Find Number of Ask Posts & Comments by hour

Let's take a look at the number of **Ask Posts** that happen per hour of the day. We'll build two frequency tables using dictionaries. The first will count how often that particular hour shows up in our **Ask Posts**. The second will count how many comments that particular how gets. We'll be able to use this to calculcate the average number of comments for **Ask Posts** per hour.

In [32]:
import datetime as dt

In [49]:
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])


In [56]:
counts_by_hour = {}
comments_by_hour = {}
i = 0
for row in result_list:

    date_created = row[0]
    num_comments = row[1]
    date = dt.datetime.strptime(date_created, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments


### Calculate the average number of comments for Ask posts per Hour

In [60]:
avg_posts_per_hour = []

for hour in counts_by_hour:
    avg_posts_per_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print(avg_posts_per_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Now that we have our list of averages. Let's sort it and see if we can isolate which hour of the day has the highest number of posts.

First we need to switch our hour and comments columns.

In [75]:
swap_avg_by_hour = [[row[1],row[0]] for row in avg_posts_per_hour]
print(swap_avg_by_hour[:5], end='\n')

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16']]


Now that that is swapped, we can use the *sorted()* function to sort in descending order.

In [76]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [82]:
print('Top 5 Hours of Ask Posts Comments')
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    time_convert = dt.datetime.strftime(hour, '%H:%M')
    print(f'''{time_convert}: {row[0]:.2f} average comments per post''')  
  

Top 5 Hours of Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Well would you look at that! Around 3 PM is your most likely time to get responses to your comments. Funny enough, 2 AM is the nexst most common time, which either suggests that folks in different time zones are particularly active during their afternoon or hacker news commenters don't get enough sleep.

Here's a quick summary of all that we did:

* We set a goal for the project.
* We collected and sorted the data.
* We reformatted and cleaned the data to prepare it for analysis.
* We analyzed the data.