***Hacker News Project***


The objective of this project is to compare and analyze two types of posts: "Ask HN" posts versus "Show HN" posts. We want to see which types of posts get more comments on average. We also want to see if posts created at a certain time recieve more comments on average. 

We will start by importing the libraries we need to read and manipulate the dataset. 


In [18]:
from csv import reader
opened_file = open('hacker_news.csv',encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

#Display the 1st 5 rows of the data 

for row in hn[0:5]:
    print(row)
    print('\n')


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Now remove the 1st row (the header) in order for us to analyze our data. 


In [17]:
headers = hn[0]
hn = hn[1:]

print(headers)

#We successfully removed the header. Here is a quick verification:
print(hn[0:5])


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http:

Now we will separate posts starting with Ask HN and Show HN into two different lists:

In [4]:
ask_posts = []
show_posts = []
other_posts = []

In [16]:
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)  

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

3488
2324
34388


Next, we'll determine if ask posts or show posts receive more comments on average.

In [15]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    element = int(row[4])
    total_ask_comments += element

avg_ask_comments = total_ask_comments/len(ask_posts)    
print("Avg number of comments on ask posts is: ", avg_ask_comments)

for row in show_posts:
    element = int(row[4])
    total_show_comments += element
    
avg_show_comments = total_show_comments/len(show_posts)    
print("Avg number of comments on show posts is: ", avg_show_comments)




Avg number of comments on ask posts is:  14.038417431192661
Avg number of comments on show posts is:  10.31669535283993


Here, it is clear that ask posts recieve a greater number of comments on average. It makes sense. If you are asking the communnity something, more people will be compelled to comment and answer. 

Now we will calculate the number of ask posts created per hour, along with the total number of comments each hour. We will put this information in a list of lists: 

In [14]:
import datetime as dt
result_list = [] 

for row in ask_posts:
    temp_list = []
    temp_list.append(row[6])
    temp_list.append(int(row[4]))
    result_list.append(temp_list)   
    
counts_by_hour = {} #contains the number of ask posts created during each hour of the day.
comments_by_hour = {} #contains the corresponding number of comments ask posts created at each hour received.


for row in result_list:
    comment = row[1]
    date = row[0]
    formatted_time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    if formatted_time in counts_by_hour:
        counts_by_hour[formatted_time] += 1
        comments_by_hour[formatted_time] += comment
    else:
        counts_by_hour[formatted_time] = 1
        comments_by_hour[formatted_time] = comment 
        

    

Now we will create a list of lists containing the hours during which posts were created and the average number of comments those posts received. First we will get the average number of comments per post for posts created during each hour of the day:

In [13]:
avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])       

Here are the results:    


In [9]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

This format makes it difficult to see the hours with the highest values. Lastly, we will sort the list of lists and print the five highest values in a format that's easier to read:

In [12]:
swap_avg_by_hour = []

for sublist in avg_by_hour:
    swap_avg_by_hour.append([sublist[1], sublist[0]])
    
#Here we swapped the order of the sublist elements in avg_by_hour

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

Now we will sort and print the 5 highest values: 

In [19]:
sorted_swap = sorted(swap_avg_by_hour, key= None, reverse= True)


print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap[:5]:
    row[1] = dt.datetime.strptime(row[1],"%H").strftime("%H:%M")
    row[0] = "{:.2f}".format(row[0])
    
    print(row[1], ":", row[0], "average comments per post")

Top 5 Hours for Ask Posts Comments:
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


You should post at around 3PM since that is the hour you'll recieve most comments per post. The average for hour 15:00 is much higher than the the other 4 times in the code above. 