# Hacker News Website

[Hacker news](https://news.ycombinator.com/) is a website extremely popular in technology and startup circles. User submitted stories (known as "posts") are voted and commented upon by the site visitors. 

## Aim

The aim is to analyze Hacker news website, to find the most popular post category and the suitable time to submit the post, in order to receive greater comment engagement. As popular posts belong to `Ask HN` or `Show HN`, we will only consider these for further analysis.

* Users submit `Ask HN` posts to ask the Hacker News community a specific question.
* Users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

The following questions has to be answers before we arrive at a conclusion.

* Do `Ask HN` or `Show HN` posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Input dataset

Dataset used by this project can be downloaded [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Below are descriptions of the columns present in the dataset.

* `id`: The unique identifier from Hacker News for the post.
* `title`: The title of the post.
* `url`: The URL that the post links to, if the post has a URL.
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes.
* `num_comments`: The number of comments that were made on the post.
* `author`: The username of the person who submitted the post.
* `created_at`: The date and time at which the post was submitted.

Let's start out analysis by importing the libraries we need and reading the data set into a list of lists. This will help us to identify the format in which each element is stored in the data set.

In [1]:
# Importing reader from csv module to read the file. It is then converted into a list of list - hn.

from csv import reader
opened_file = open("hacker_news.csv",encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Let's find the number of rows present in the data set and seperate the header row from rest for further analysis. 

In [2]:
# Seperating header row from the data set. Headers contains the header row and hn contains the data set.

headers = hn[0]   
hn = hn[1:]
print("The number of rows in the data set: {}".format(len(hn)),"\n")
print(headers)
print("----------------------------------------------------------------------------","\n")
for row in hn[:5]:
    print(row,"\n")

The number of rows in the data set: 293119 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
---------------------------------------------------------------------------- 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] 

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] 

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] 

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] 

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://ww

## Data Cleaning

### Removing posts that received no comments.

Our analysis focus on the number of comments received by each posts. `num_comments` represents the number of comments received by each post. It represents the fifth column and thus has an index number of 4. The above listed 5 rows received no comments and thus we need to remove rows where post received zero comments.

In [3]:
# To remove posts with zero comments.

print("The number of rows before deletion of posts with zero comments: ",len(hn))
temp_hn = []
for row in hn:
    num_comments = int(row[4])
    if num_comments != 0:
        temp_hn.append(row)
hn = temp_hn
print("The number of rows after deletion of posts with zero comments: ",len(hn))

The number of rows before deletion of posts with zero comments:  293119
The number of rows after deletion of posts with zero comments:  80401


### Filtering Ask HN and Show HN posts from the dataset.

As we are only concerned with `Ask HN` or `Show HN` posts. We will try to create three new lists to store the following information:

* ask_posts[] to store rows where title starts with `Ask HN`.
* show_posts[] to store rows where title starts with `Show HN`.
* other_posts[] to store other information.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

# The number of rows present in each list.

print('Number of Ask HN posts:',len(ask_posts))
print('Number of Show HN posts:',len(show_posts))
print('Number of Other posts:',len(other_posts)) 

# Let's look at the first 5 rows from each list. To verfiy our data cleaning process.

print("\n\n")
print(ask_posts[:5])
print("\n")
print(show_posts[:5])
print("\n")
print(other_posts[:5])

Number of Ask HN posts: 6911
Number of Show HN posts: 5059
Number of Other posts: 68431



[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']]


[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/201

## Data Analysis

### Average number of comments for Ask HN and Show HN posts

Let's determine if ask posts or show posts receive more comments on average. For this we will compute the average comments received by both ask_posts[] and show_posts[]. Number of comments is present in the index value of 4 for both lists.

In [5]:
# To compute the average number of comments for ask posts

total_ask_post = len(ask_posts)
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / total_ask_post 

# To compute the average number of comments for show posts

total_show_post = len(show_posts)
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / total_show_post

# To display ask post results

print("*"*10,"Ask HN Post","*"*10)
print("The total number of Ask HN post is ",total_ask_post)
print("The total number of comments for Ask HN post is ",total_ask_comments)
print("The average number of comments for Ask HN post is {:.2f}".format(avg_ask_comments))

# To display show post results

print("\n","*"*10,"Show HN Post","*"*10)
print("The total number of Show HN post is ",total_show_post)
print("The total number of comments for Show HN post is ",total_show_comments)
print("The average number of comments for Show HN post is {:.2f}".format(avg_show_comments))

********** Ask HN Post **********
The total number of Ask HN post is  6911
The total number of comments for Ask HN post is  94986
The average number of comments for Ask HN post is 13.74

 ********** Show HN Post **********
The total number of Show HN post is  5059
The total number of comments for Show HN post is  49633
The average number of comments for Show HN post is 9.81


Let's summerize the above results.

* The total number of `Ask HN` post is 6911 versus 5059 for `Show HN` post. 
* The total number of comments for `Ask HN` post is 94986 versus 49633 for `Show HN` post.  
* The average number of comments for `Ask HN` post is 13.74 versus 9.81 for `Show HN` post. 

> Thus `Ask HN` posts receive more comments on average than `Show HN` posts.

### Analyze the average number of comments received by post based on its hour of creation

We'll determine if posts created at a certain time are more likely to attract more comments. We'll use the following steps to perform this analysis:

* Calculate the number of posts created in each hour of the day, along with the number of comments received by them.
* Calculate the average number of comments post receive based on the hour of its creation.

We can find the above using `created_at` and `num_comments` column in the data set. `created_at` column is present in the seventh column and thus have an index number of 6. `num_comments` column is present in the fifth column and thus have an index number of 4. 

Note: According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the timezone used is Eastern Time in the US.

#### Let's perform the first step to calculate the number of posts created in each hour of the day, along with the number of comments received by them.

In [6]:
# To make a list that contains only the created time and number of comments for each ask post.

time_comment_ask_post = []
for row in ask_posts:
    time_comment_ask_post.append([row[6],int(row[4])])

# To make a list that contains only the created time and number of comments for each show post.

time_comment_show_post = []
for row in show_posts:
    time_comment_show_post.append([row[6],int(row[4])])                       

Let's write a generic function to find the number of posts created in each hour of the day and the total number of comments received by them. We will need to convert the timezone from eastern time in USA to Indian Standard Time(IST).

In [7]:
import pytz 
import datetime as dt

# counts_by_hour dictionary counts the number of posts created in each hour of the day. 
# comments_by_hour dictionary contians the total number of comments received by post based on its hour of creation.

def freq_table_time(input_list): 
    
    counts_by_hour = dict()
    comments_by_hour = dict()
    date_format = "%m/%d/%Y %H:%M"
    
    for element in input_list:     
        date_time = dt.datetime.strptime(element[0],date_format)
        timezone = pytz.timezone('America/New_York')
        date_time = timezone.localize(date_time)
    
        # Converting to indian IST timezone for further analysis
    
        date_time = date_time.astimezone(pytz.timezone('Asia/Kolkata'))
        hour = date_time.strftime("%H")
    
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += element[1]
        else:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = element[1]  
    
    return  counts_by_hour, comments_by_hour

we created two dictionaries:

* `counts_by_hour` dictionary counts the number of posts created in each hour of the day. 
* `comments_by_hour` dictionary contians the total number of comments received by posts based on its hour of creation.

#### Let's perform the second step to calculate the average number of comments post receive based on its hour of creation.

To implement this, let's write `display_by_hour` function to display the average number of comments post receive based on its hour of creation with the help of `freq_table_time` function.

In [8]:
# To convert from dictionary to a list type for sorting in descending order

def dictionary_to_list(input_dictionary):
    
    to_list = []
    
    for key,val in input_dictionary.items():
        to_list.append([val,key])
    
    to_list = sorted(to_list, reverse = True)

    return to_list

# To display the results in a formated manner

def display_by_hour(input_list):
    
    counts_by_hour, comments_by_hour = freq_table_time(input_list)
    
    # To output the number of posts created on hourly basis in descending order
    
    counts_by_hour_list = dictionary_to_list(counts_by_hour)
    for row in counts_by_hour_list:
        print("At {}:00 - {} posts where created.".format(row[1],row[0]))
    print("\n")
    
    # To output the number of comments received by posts based on its hour of creation in descending order
    
    comments_by_hour_list = dictionary_to_list(comments_by_hour)
    for row in comments_by_hour_list:
        print("For post created at {}:00 - {} comments was received.".format(row[1],row[0]))
    print("\n")
    
    # To output the average number of comments received by posts based on its hour of creation in descending order
    
    avg_by_hour = []
    for key,val in comments_by_hour.items():
        avg_by_hour.append([(val/counts_by_hour[key]),key])
    avg_by_hour = sorted(avg_by_hour, reverse = True)
    
    for row in avg_by_hour:
        print("For post created at {}:00, an average of {:.2f} comments was received.".format(row[1],row[0]))
    print("\n")
        
    # The combined analysis of hacker news post
    
    for row in avg_by_hour:
        print("At {}:00, an average of {:.2f} comments is received from a total of {} comments and {} posts.".format(
                                                row[1], row[0], comments_by_hour[(row[1])], counts_by_hour[(row[1])] ))  

Let's display the analysis of `Ask HN` post with the help of above function. 

In [9]:
print("\n","*"*20,"Ask HN Post","*"*20,"\n")
display_by_hour(time_comment_ask_post)


 ******************** Ask HN Post ******************** 

At 03:00 - 445 posts where created.
At 05:00 - 426 posts where created.
At 06:00 - 417 posts where created.
At 02:00 - 413 posts where created.
At 01:00 - 412 posts where created.
At 00:00 - 410 posts where created.
At 04:00 - 409 posts where created.
At 07:00 - 342 posts where created.
At 23:00 - 327 posts where created.
At 08:00 - 295 posts where created.
At 22:00 - 289 posts where created.
At 09:00 - 272 posts where created.
At 21:00 - 256 posts where created.
At 10:00 - 249 posts where created.
At 20:00 - 234 posts where created.
At 13:00 - 217 posts where created.
At 11:00 - 212 posts where created.
At 12:00 - 211 posts where created.
At 19:00 - 200 posts where created.
At 15:00 - 192 posts where created.
At 18:00 - 180 posts where created.
At 14:00 - 178 posts where created.
At 17:00 - 168 posts where created.
At 16:00 - 157 posts where created.


For post created at 00:00 - 12676 comments was received.
For post created at

From the above result we can infer that it is better to create a `Ask HN` post from 00:00 - 01:00, 01:00 - 02:00 or 22:00 - 23:00 respectively.

> Thus we can conclude that `Ask HN` posts created between 00:00 and 01:00 IST have better chance of receiving more attention.

In [10]:
print("\n","*"*20,"Show HN Post","*"*20,"\n")
display_by_hour(time_comment_show_post)


 ******************** Show HN Post ******************** 

At 02:00 - 383 posts where created.
At 00:00 - 368 posts where created.
At 01:00 - 362 posts where created.
At 03:00 - 347 posts where created.
At 23:00 - 337 posts where created.
At 04:00 - 323 posts where created.
At 22:00 - 289 posts where created.
At 05:00 - 259 posts where created.
At 21:00 - 248 posts where created.
At 06:00 - 247 posts where created.
At 07:00 - 203 posts where created.
At 08:00 - 186 posts where created.
At 20:00 - 172 posts where created.
At 18:00 - 164 posts where created.
At 19:00 - 154 posts where created.
At 10:00 - 146 posts where created.
At 09:00 - 144 posts where created.
At 17:00 - 123 posts where created.
At 11:00 - 122 posts where created.
At 12:00 - 119 posts where created.
At 15:00 - 97 posts where created.
At 16:00 - 96 posts where created.
At 13:00 - 94 posts where created.
At 14:00 - 76 posts where created.


For post created at 23:00 - 4125 comments was received.
For post created at 01:

For `Show HN` post there isn't much difference between average comment value as shown above. Thus the time of post creation is not as critical for `Show HN` post as compared to `Ask HN` post. It's still better to create `Show HN` post between 23:00 and 24:00 IST.

> Thus we can conclude that `Show HN` post created between 23:00 and 24:00 IST receive marginally more comments than others.

## Conclusion

`Ask HN` posts are more popular by comments engagement with an average `Ask HN` post receiving 13.74 comments per post versus 9.81 comments per post for `Show HN`. Best hour for post creation based on average comments received is as follows.
* For `Ask HN` post: Between 00:00 - 01:00 (12 am - 1 am IST), with an average of 30.92 comments per post.
* For `Show HN` post: Between 23:00 - 24:00 (11 pm - 12 am IST), with an average of 12.24 comments per post.
    
Hacker news website is more widely used by Americans, the suitable time for post creation is presented in IST(Indian Standard Time) and this points to the late hours mentioned above. EST(Eastern Standard Time) in USA is about 09 hours 30 minutes behind IST. This corresponds to mid afternoon in USA.