# What make Hacker News posts most commented? Analyzing key factors.

## 1. Introduction

The Hacker News is a website widely used in technology circles where user submitted histories are voted and commented by other users. 

The posts have some variables associated with it - Title, user name, post identification, day and time, number of positive and negative votes and number of comments. 

**The objective of this work** is to identify which variables influence the net number of votes (number of positive votes subtracted by the negatives) and the number of comments. Two questions will be answered at the end of this project:

* Does **Ask HN** and **Show HN** receive more comments on average?
* The time of the post influences the number of comments?

I will use a 20,000 row [data set](://www.kaggle.com/hacker-news/hacker-news-postsfrom) in csv format from Hacker News containing posts the have at least one comment associated with them.

## 2. Methodology

### 2.1. Importing Data Set

In this section, the data set will be first opened and them converted in a list of lists by using the `reader()` function from _csv_ module.

In [1]:
# Importing data set
from csv import reader #Importing just the reader object
opened_file = open('hacker_news.csv')
read_file = reader(opened_file) #Instantiating a reader object and assigning to read_file
hn = list(read_file) #Assigning the list of lists to the hn variable

# Display few rows of data set
print(hn[0:4])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


In [2]:
#Spliting data set in header row and data rows
headers = hn[0]
hn = hn[1:]

#displaying new data sets
print(headers)
print("\n")
print(hn[0:3])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


### 2.2. Filtering **Show HN** and **Ask HN** titled posts

In this section, the following pieces of code will createa separate list the contains publications with **Show HN** or **Ask HN** titles.

To do this, the methods `string.startswith()` will be used inside a for loop to check in the title in the current iteration starts with **Show** or with **Ask**. 


* If `True`:
    append the entire row to show_posts or ask_posts lists
* If `False`: Append the entire row to other_posts list.

Obs: The `string.lower()` function is used to avoid dealing with mixed upper and lower cases

In [3]:
ask_posts  = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('The number of ask posts is',len(ask_posts))
print('\n')
print('The number of show posts is',len(show_posts))
print('\n')
print('The number of other posts is',len(other_posts))

        
    

The number of ask posts is 1744


The number of show posts is 1162


The number of other posts is 17194


### 2.3. Computing the average number of comments in ask and show posts

In [4]:
# Calculating the total number of comments in ask posts
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
#Calculating the average of comments in ask posts
avg_ask_comments = total_ask_comments/len(ask_posts)
print('The average number of comments on ask posts is',avg_ask_comments)
    

The average number of comments on ask posts is 14.038417431192661


In [5]:
# Calculating the total number of show posts
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments/len(show_posts)
print('The average number of comments on show posts is',avg_show_comments)
    

The average number of comments on show posts is 10.31669535283993


It can be seen that the **Show Posts** have lesser comments than **Ask Posts** on average. 

avg_show = 10.32

avg_ask = 14.03

One plausible explanation is that the comments related to ask posts engage people to actually answer something at comments. In Show posts, comments are not answers in fact, but additional content to what was already posted.

###  2.4. Analyzing what is the influence of time in the number of comments in Ask Posts.


This section is intended to analyze if the time influences the number of comments in posts.

In order to do it, only the **Ask Posts** will be considered as it showed a greater number of comments of average in the previous analysis.

The steps to conduct this analysis are:

1. Create a new list `result_list` contaning the hours in string format and the number of comments as integers.
2. Create two dictionaries `counts_by_hour` and `comments_by_hour` that will stores the number of times an hour occur and the corresponding number of comments
3. Do a for loop in results_list

    3.1. Convert the string hours in datetime objects using `datetine.strptime()`function
    
    3.2. Instantiate an object with the corresponding hour with `datetime.strftime()`method.
    
    3.3. Sum over the for loop for comments and count dictionaries.


In [6]:
## Importing datetime module
import datetime as dt

## Creating a new_list from ask_posts sub data set
result_list = []
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at,comments])

## Creating dictionaries that counts the number of times a hour key
## shows and the number of comments by each key hour

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comments = row[1]
    unformated_time = row[0] #'8/4/2016 11:52'
    formated_time = dt.datetime.strptime(unformated_time,"%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(formated_time,"%H") #Instantiating stftime object and assigning to hour variable 
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour]+= comments
        
print(counts_by_hour)


{'19': 110, '14': 107, '11': 58, '01': 60, '20': 80, '22': 71, '03': 54, '05': 46, '06': 44, '10': 59, '21': 109, '15': 116, '23': 68, '16': 108, '09': 45, '13': 85, '07': 34, '08': 48, '18': 109, '02': 58, '17': 100, '04': 47, '12': 73, '00': 55}


Now I have two dictionaries with the same keys, I can do a loop over one of them and stores the average values by hour in a list of lists.

In [7]:
avg_by_hour = []
for hour in counts_by_hour:
    comments_hour = comments_by_hour[hour]
    avg_by_hour.append([hour,comments_hour/counts_by_hour[hour]])

print(avg_by_hour)

[['19', 10.8], ['14', 13.233644859813085], ['11', 11.051724137931034], ['01', 11.383333333333333], ['20', 21.525], ['22', 6.746478873239437], ['03', 7.796296296296297], ['05', 10.08695652173913], ['06', 9.022727272727273], ['10', 13.440677966101696], ['21', 16.009174311926607], ['15', 38.5948275862069], ['23', 7.985294117647059], ['16', 16.796296296296298], ['09', 5.5777777777777775], ['13', 14.741176470588234], ['07', 7.852941176470588], ['08', 10.25], ['18', 13.20183486238532], ['02', 23.810344827586206], ['17', 11.46], ['04', 7.170212765957447], ['12', 9.41095890410959], ['00', 8.127272727272727]]


Eventhough we now have a list with the **average comments per hour**, it is not in a easy to read format. In order to make it visual, the next code will print the top five most commented hours in a text like format. 

In [11]:
# Swapping the entries in avg_by_hour to sort by descending average descending order
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
#print(swap_avg_by_hour[0:2])

# Sorting the swap_avg_by_hour list and assigning to sorted_swap variable
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

#Printing the top five hours 
#print(sorted_swap[0:4])

for row in sorted_swap[0:5]:
    avg = row[0]
    hour = row[1]
    unformated_hour = dt.datetime.strptime(hour,"%H")
    formated_hour = dt.datetime.strftime(unformated_hour,"%H:%M")
    print("{time}: {average:.2f} comments per post".format(time=formated_hour,average=avg))
    

    

15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.80 comments per post
21:00: 16.01 comments per post


Using the average number of comments/perhour.pernumberofposts, which is sort of an hourly comment density in ask posts, we can see that posting between **15:00 and 15:59** gives one more chance to have higher comments in one's post.