# Hacker News Project
In this project, we will be using the data set from [Hacker News](https://www.kaggle.com/hacker-news/hacker-news-posts)
We will be exploring the dataset to arrive at several insights.


# Preparing the data
Firstly, let us prepare the data by extracting the data set and removing the header.

In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file= reader(opened_file)
hn = list(read_file)

#print the first five rows
for row in hn[:5]:
    print(row,"\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



In [2]:
headers = hn[:1]
del(hn[0])
print(headers)
for row in hn[:5]:
    print(row,"\n")

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



# Filtering the dataset
Now we let us look for the posts that we are interested in: `Ask HN` and `Show HN` posts.
We will use the lower() method followed by startswith() method to filter out posts that start with `ask hn` and `show hn` respectively.

In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of ask posts:",len(ask_posts))
print("Number of show posts:",len(show_posts))
print("Number of other posts:",len(other_posts))


Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


# Find the average number of comments on ask/show posts

Let us find out if there are more average number of comments on `ask posts` or `show posts`

In [4]:
total_ask_comments=0
for row in ask_posts:
    num_comments=int(row[4])
    if num_comments!=0:
        total_ask_comments+=num_comments
avg_ask_comments=total_ask_comments/len(ask_posts)
print("The average number of comments on ask posts is: {:.2f}".format(avg_ask_comments))

total_show_comments=0
for row in show_posts:
    num_comments=int(row[4])
    if num_comments!=0:
        total_show_comments+=num_comments

avg_show_comments = total_show_comments/len(show_posts)
print("The average number of comments on show posts is: {:.2f}".format(avg_show_comments))


The average number of comments on ask posts is: 14.04
The average number of comments on show posts is: 10.32


It seems like `ask posts` receive more comments on average. This might be due to a genuinely helpful community on Hacker News that try to help people with questions more than people that are showing something to the community. 

# Is there a prime time of posting to receive the most number of comments?
Since ask posts are more likely to receive comments, we will focus our remaining analysis on `ask posts`. Let us now dive deeper into discovering if there is a time of the day that attract more comments.

We can look at `created_at` column and convert the strings to datetime objects

In [68]:
import datetime as dt

result_list=[]

for post in ask_posts:
    result_list.append([post[6],int(post[4])])

count_by_hour={}
comments_by_hour={}
for row in result_list:
    comment=row[1]
    #convert string to datetime obj
    time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    #extract hour from datetime obj formatted into string type
    hour= time.strftime("%H")
    #insert into count_by_hour to generate frequency table
    if hour in count_by_hour:
        count_by_hour[hour]+=1
        comments_by_hour[hour]+=comment
    else:
        count_by_hour[hour]=1
        comments_by_hour[hour]=comment
       
print("-------------Posts by hour---------------")
#sort the count_by_hour list
countlist=[]
for key,value in count_by_hour.items():
    key_val_as_tuple=(key,value)
    countlist.append(key_val_as_tuple)
countlist=sorted(countlist)

for item in countlist:
    print("Hour:",item[0],"  "+"Posts:",item[1])
    
#sort the comments_by_hour list
print("-------------Comments by hour---------------")
commentlist=[]
for key,value in comments_by_hour.items():
    key_val_as_tuple=(key,value)
    commentlist.append(key_val_as_tuple)
commentlist=sorted(commentlist)

for item in commentlist:
    print("Hour:",item[0],"  "+"Comments:",item[1])

-------------Posts by hour---------------
Hour: 00   Posts: 55
Hour: 01   Posts: 60
Hour: 02   Posts: 58
Hour: 03   Posts: 54
Hour: 04   Posts: 47
Hour: 05   Posts: 46
Hour: 06   Posts: 44
Hour: 07   Posts: 34
Hour: 08   Posts: 48
Hour: 09   Posts: 45
Hour: 10   Posts: 59
Hour: 11   Posts: 58
Hour: 12   Posts: 73
Hour: 13   Posts: 85
Hour: 14   Posts: 107
Hour: 15   Posts: 116
Hour: 16   Posts: 108
Hour: 17   Posts: 100
Hour: 18   Posts: 109
Hour: 19   Posts: 110
Hour: 20   Posts: 80
Hour: 21   Posts: 109
Hour: 22   Posts: 71
Hour: 23   Posts: 68
-------------Comments by hour---------------
Hour: 00   Comments: 447
Hour: 01   Comments: 683
Hour: 02   Comments: 1381
Hour: 03   Comments: 421
Hour: 04   Comments: 337
Hour: 05   Comments: 464
Hour: 06   Comments: 397
Hour: 07   Comments: 267
Hour: 08   Comments: 492
Hour: 09   Comments: 251
Hour: 10   Comments: 793
Hour: 11   Comments: 641
Hour: 12   Comments: 687
Hour: 13   Comments: 1253
Hour: 14   Comments: 1416
Hour: 15   Comments: 447

Now let us calculate the average number of comments per post for each hour of the day.

In [54]:
avg_by_hr=[]

for item in countlist:
    avg_list=[]
    hour=item[0]
    posts=item[1]
    for x in range(24):
        if hour==commentlist[x][0]:
            avg= commentlist[x][1]/posts
            avg_list.append(hour)
            avg_list.append(avg)
            break
    avg_by_hr.append(avg_list)
    
print("---------------Average Comments Per Post By Hour---------------")
for item in avg_by_hr:
    print("Hour:",item[0], "   Avg Comments Per Post:", item[1])

---------------Average Comments Per Post By Hour---------------
Hour: 00    Avg Comments Per Post: 8.127272727272727
Hour: 01    Avg Comments Per Post: 11.383333333333333
Hour: 02    Avg Comments Per Post: 23.810344827586206
Hour: 03    Avg Comments Per Post: 7.796296296296297
Hour: 04    Avg Comments Per Post: 7.170212765957447
Hour: 05    Avg Comments Per Post: 10.08695652173913
Hour: 06    Avg Comments Per Post: 9.022727272727273
Hour: 07    Avg Comments Per Post: 7.852941176470588
Hour: 08    Avg Comments Per Post: 10.25
Hour: 09    Avg Comments Per Post: 5.5777777777777775
Hour: 10    Avg Comments Per Post: 13.440677966101696
Hour: 11    Avg Comments Per Post: 11.051724137931034
Hour: 12    Avg Comments Per Post: 9.41095890410959
Hour: 13    Avg Comments Per Post: 14.741176470588234
Hour: 14    Avg Comments Per Post: 13.233644859813085
Hour: 15    Avg Comments Per Post: 38.5948275862069
Hour: 16    Avg Comments Per Post: 16.796296296296298
Hour: 17    Avg Comments Per Post: 11.46


Let us sort the results so we can easily identify the top 5 hours with the highest average comments per post.

In [69]:
swap_avg_by_hr=[]
for item in avg_by_hr:
    swapped_list=[item[1],item[0]]
    swap_avg_by_hr.append(swapped_list)

#sort the swap_avg_by_hr list in descending order
sorted_swap = sorted(swap_avg_by_hr, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for x in range(5):
    #convert str to datetime obj
    hour=dt.datetime.strptime(sorted_swap[x][1], "%H")
    avgcommentsperpost=sorted_swap[x][0]
    #format datetime obj to str
    hourstr = hour.strftime("%H:%M")
    template="{}: {:.2f} average comments per post"
    print(template.format(hourstr, avgcommentsperpost))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion
According to the results, it would be best to create a post at 15:00hrs Eastern Time in the US to have a higher chance of receiving comments. The results may not be sufficient to draw an accurate conclusion as the number of average comments per post can be affected by a few other factors:
1. Relevancy of the question (a misclassified question may receive less comments)
2. Topic of the question (new and hot topics would receive more comments)
3. There is no proof that there is a direct relation between average comments per post with the timing (we could investigate this if the data had timestamps on the comments. Then, we can get a better metric by looking at the number of comments within the first hour of posting)