# Analysis of HackerNews Posts

This is a guided project provided by [Dataquest.io](https://dataquest.io/). In this project we will be analysing the posts of [HackerNews](https://news.ycombinator.com/) Platform, a popular site where technology related stories (or 'posts') are voted and commented upon.It will answer the following questions:

-   Do Ask HN or Show HN receive more comments on average?<br>
-   Do posts created at a certain time receive more comments on average?

where, <br>
**Ask HN posts** = The questions on the HackerNews Community <br>
**Show HN posts** = The posts to show community a project, product or something interesting. <br><br>
It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Apart from what was covered in the guided project, following additional analysis was also performed:

- Determining if show or ask posts receive more points on average
- Finding the blockbuster (most popular) type of Post based on average points 
- Determining if posts created at a certain time are more likely to receive more points


# Introduction
First we'll read and remove the headers from the dataset

In [2]:
from csv import reader
import datetime as dt

#reading the file and converting it into usable list of lists format
read_file = reader(open("hacker_news.csv"))
hn = list(read_file)
print("Data With Headers:\n")
print(hn[:5])
print("\n")

#first row contains headers
headers = hn[0]
print("Headers:\n")
print(headers)
print("\n")


Data With Headers:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Headers:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']




# Removing Headers from the dataset ( list of lists)


In [121]:
#excluding headers from dataset
hn = hn[1:]
print("Data Without Headers:\n")
for i in hn[:5]:
    print(i)
    print("\n")

Data Without Headers:

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutral

### Conclusions drawn from high level view of data

We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let's start by exploring the number of comments for each type of post.<br>

# Extracting Ask HN and Show HN Posts
First we will identify the posts starting with Ask HN or Show HN and categorize the data extracted into two different lists according to the type of post. Separating the list makes it easier to further analyse the data.

In [12]:
ask_posts =[]
show_posts =[]
other_posts =[]

#assign the posts according to their categorization from hn list of lists
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.lower().startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)

#Printing number of posts of each types
print("Ask Posts: "+str(len(ask_posts))+"\n")
print("Show Posts: "+str(len(show_posts))+"\n")
print("Other Posts: "+str(len(other_posts))+"\n")

#Sample data in ask_posts and show_posts lists
print("Sample data in Ask Posts list: \n")
for i in ask_posts[:5]:
    print(i)
    print("\n")
print("Sample data in Show Posts list: \n")
for i in show_posts[:5]:
    print(i)
    print("\n")

Ask Posts: 1744

Show Posts: 1162

Other Posts: 17194

Sample data in Ask Posts list: 

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


Sample data in Show Posts list: 

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/2

# Calculating the average number of comments for Ask HN and Show HN Posts
Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.

In [9]:
total_ask_comments = 0
total_show_comments = 0

#calculating total number of comments of each category
for row in ask_posts:
    total_ask_comments += int(row[4])    #row[4] contains number of comments in the dataset row
    
for row in show_posts:
    total_show_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("\nAverage Number of Comments on Ask HN Category: " + '{:0.2f}'.format(avg_ask_comments) +"\n")
print("Average Number of Comments on Show HN Category: " + '{:0.2f}'.format(avg_show_comments) +"\n")

if(avg_ask_comments > avg_show_comments):
    print("Ask Posts recieve more comments on average")

elif(avg_ask_comments == avg_show_comments):
    print("Ask Posts has same comments on average as Show Posts") 
    
else:
    print("Show Posts recieve more comments on average")
    
    


Average Number of Comments on Ask HN Category: 14.04

Average Number of Comments on Show HN Category: 10.32

Ask Posts recieve more comments on average


On average, Ask HN Posts in our sample recieve approximately 14 comments whereas, Show HN Posts recieve only 10. Since Ask posts are mre likely to recieve comments, we will focus our remaining analysis on these type of posts. 


# Finding the Amount of Ask Posts and Comments by Hour Created
Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

In [42]:
result_list = []

for row in ask_posts:
    created_at = dt.datetime.strptime(row[6],"%m/%d/%Y %H:%M")   #row[6] contains created date of comments
    num_comments = int(row[4])  #row[4] contains number of comments
    result_list.append([created_at,num_comments])
    
counts_by_hour ={} #contains the number of ask posts created during each hour of the day.
comments_by_hour ={} #contains the corresponding number of comments ask posts created at each hour received

for row in result_list:
    date = row[0]
    hr = date.hour
    comments = row[1]
    if hr not in counts_by_hour:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = comments 
    else:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += comments
print("Counts by Hour:\n")
for i in sorted(counts_by_hour.items()):
    print(i[0],":",i[1])
print("\n Comments by Hour:\n")
for i in sorted(comments_by_hour.items()):
    print(i[0],": ",i[1])


Counts by Hour:

0 : 55
1 : 60
2 : 58
3 : 54
4 : 47
5 : 46
6 : 44
7 : 34
8 : 48
9 : 45
10 : 59
11 : 58
12 : 73
13 : 85
14 : 107
15 : 116
16 : 108
17 : 100
18 : 109
19 : 110
20 : 80
21 : 109
22 : 71
23 : 68

 Comments by Hour:

0 :  447
1 :  683
2 :  1381
3 :  421
4 :  337
5 :  464
6 :  397
7 :  267
8 :  492
9 :  251
10 :  793
11 :  641
12 :  687
13 :  1253
14 :  1416
15 :  4477
16 :  1814
17 :  1146
18 :  1439
19 :  1188
20 :  1722
21 :  1745
22 :  479
23 :  543


# Finding the Average number of Comments on Ask HN posts by Hour

In [32]:
avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([key,(comments_by_hour[key]/counts_by_hour[key])])

avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

# Sorting and Printing values from List of Lists

In [122]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
sorted_swap

[[38.5948275862069, 15],
 [23.810344827586206, 2],
 [21.525, 20],
 [16.796296296296298, 16],
 [16.009174311926607, 21],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [13.20183486238532, 18],
 [11.46, 17],
 [11.383333333333333, 1],
 [11.051724137931034, 11],
 [10.8, 19],
 [10.25, 8],
 [10.08695652173913, 5],
 [9.41095890410959, 12],
 [9.022727272727273, 6],
 [8.127272727272727, 0],
 [7.985294117647059, 23],
 [7.852941176470588, 7],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [6.746478873239437, 22],
 [5.5777777777777775, 9]]

In [40]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments:\n")

for row in sorted_swap[:5]:
    hour = str(row[1])
    hour = dt.datetime.strptime(hour,"%H") #converting hour into datetime object
    hour = dt.datetime.strftime(hour, "%H:%M:%S") #formatting hour as per output requirement
    comments = '{:0.2f}'.format(row[0])
    template = "{hour} -> {comments} average comments per hour".format(hour =hour,comments = comments)
    print(template)

Top 5 Hours for 'Ask HN' Comments:

15:00:00 -> 38.59 average comments per hour
02:00:00 -> 23.81 average comments per hour
20:00:00 -> 21.52 average comments per hour
16:00:00 -> 16.80 average comments per hour
21:00:00 -> 16.01 average comments per hour


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.




# Additional Analysis

## Determining whether ask posts or show posts recieve more points on average.

### Calculating the total number of points accumulated by Ask HN and Show HN posts
We will now find the total number of points corresponsding to Ask HN and Show HN posts. We will also prepare two dictionaries storing the points corresponding to each Ask and Show Post to find the **blockbuster** (post with highest number of points) post of each type.<br>

In [145]:
ask_posts_points ={} #dictionary to contain points corresponding to each ask post
show_posts_points ={} #dictionary to contain points corresponding to each show post
total_ask_points =0
total_show_points = 0

for row in ask_posts:
    total_ask_points += int(row[3]) #row[3] contains number of points of a post
    ask_posts_points[int(row[3])] = row[1] #row[1] contains the name of the post 
    
        
for row in show_posts:
    total_show_points += int(row[3]) #row[3] contains number of points of a post
    show_posts_points[int(row[3])] = row[1]

#print(ask_posts_points)
#print(show_posts_points)
print("Total number of points accumulated by Ask HN Posts:",total_ask_points)
print("Total number of points accumulated by Show HN Posts:",total_show_points)

Total number of points accumulated by Ask HN Posts: 26268
Total number of points accumulated by Show HN Posts: 32019


### Calculating the average number of points corresponding to Ask HN and Show HN Posts

In [146]:
avg_ask_posts_points = total_ask_points/len(ask_posts)
avg_show_posts_points = total_show_points/len(show_posts)

print("Average Points of Ask Posts:{:0.3f}".format(avg_ask_posts_points))
print("Average Points of Show Posts:{:0.3f}".format(avg_show_posts_points))

if avg_ask_posts_points > avg_show_posts_points:
    print("Ask HN Post category has {:0.3f} more points on an average than Show HN Post category".format(avg_ask_posts_points-avg_show_posts_points))
else:
    print("Show HN Post category has {:0.3f} more points on an average than Ask HN Post category".format(avg_show_posts_points-avg_ask_posts_points))

Average Points of Ask Posts:15.062
Average Points of Show Posts:27.555
Show HN Post category has 12.493 more points on an average than Ask HN Post category


## Finding the Blockbuster Ask HN 

As Show HN Posts have more points on an average as compared to Ask HN Posts, we'll focus on the former to complete the points analysis.

In [147]:
from itertools import islice
import operator

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

# print("Sample data in ask_posts_points:\n")
# five_ask_posts =take(5,ask_posts_points.items())
# print(five_ask_posts)

print("Sample data in show_posts_points:\n")
five_show_posts = take(5,show_posts_points.items())
print(five_show_posts)

#Finding the maximum point and post corresponding to it

#max_ask_posts_point, max_point_ask_post = max(ask_posts_points.items(), key=operator.itemgetter(0))
max_show_posts_point, max_point_show_post = max(show_posts_points.items(), key=operator.itemgetter(0))

start = "\033[1m" #bold starts
end = "\033[0;0m" #bold ends
#print("\n\nBlockbuster Post for type Ask HN is "+start+str(max_point_ask_post)+end+" having "+start+str(max_ask_posts_point)+end+" points")
print("\nBlockbuster Post for type Show HN is "+start+str(max_point_show_post)+end+" having "+start+str(max_show_posts_point)+end+" points")

Sample data in show_posts_points:

[(26, 'Show HN: iPipeTo, Yeoman ui as a standalone composable cli tool'), (747, 'Show HN: Something pointless I made'), (1, 'Show HN: Raspberry PI Zero Docker/Swarm on QuickStart'), (3, 'Show HN: Decorating: Animated pulsed for your slow functions in Python'), (4, 'Show HN: ExtractorApp Convert Excel / CSV to API, SQL and Other Formats')]

Blockbuster Post for type Show HN is [1mShow HN: New calendar app idea[0;0m having [1m825[0;0m points


## Determining if posts created at a certain time are more likely to receive more points

Now we'll find out whether posts that are posted on the forum at a perticular time are more likely to recieve ore points or not.
For this, we'll create a frequency dictionary for Show Posts which will map aggregated points to hours of the posts.
### Finding the points recieved by Show Posts by Hour


In [158]:
show_point_by_hour = {}
show_counts_by_hour ={}

for row in show_posts:
    hour = dt.datetime.strptime(row[6],"%m/%d/%Y %H:%M").hour  # 3-> points , 6->datetime
    points = int(row[3])
    if hour in show_point_by_hour:
        show_point_by_hour[hour]+=points
        show_counts_by_hour += 1
    else:
        show_point_by_hour[hour] = points
        show_counts_by_hour = 1
        
print("\nShow Posts points' mapping with hours:")
print(show_point_by_hour)


Show Posts points' mapping with hours:
{14: 2187, 22: 1856, 18: 2215, 7: 494, 20: 1819, 5: 104, 16: 2634, 19: 1702, 15: 2228, 3: 679, 17: 2521, 6: 375, 2: 340, 13: 2438, 8: 519, 21: 866, 4: 386, 11: 1480, 12: 2543, 23: 1526, 9: 553, 1: 700, 10: 681, 0: 1173}


### Calculating Average number of points per hour in a list of lists for Show HN Posts

In [180]:
avg_points_by_hour_show = []

for key in show_point_by_hour:
    avg_points_by_hour_show.append([key,show_point_by_hour[key]/show_counts_by_hour]) 
    
avg_points_by_hour_show

[[14, 2.025],
 [22, 1.7185185185185186],
 [18, 2.050925925925926],
 [7, 0.45740740740740743],
 [20, 1.6842592592592593],
 [5, 0.0962962962962963],
 [16, 2.438888888888889],
 [19, 1.575925925925926],
 [15, 2.062962962962963],
 [3, 0.6287037037037037],
 [17, 2.3342592592592593],
 [6, 0.3472222222222222],
 [2, 0.3148148148148148],
 [13, 2.2574074074074075],
 [8, 0.48055555555555557],
 [21, 0.8018518518518518],
 [4, 0.3574074074074074],
 [11, 1.3703703703703705],
 [12, 2.35462962962963],
 [23, 1.412962962962963],
 [9, 0.5120370370370371],
 [1, 0.6481481481481481],
 [10, 0.6305555555555555],
 [0, 1.086111111111111]]

### Sorting and printing values from list of lists

In [182]:
swapped_list_av_pnt_hr = []
for i in avg_points_by_hour_show:
    swapped_list_av_pnt_hr.append([i[1],i[0]])
    
sorted_swapped_list_av_pnt_hr = sorted(swapped_list_av_pnt_hr, reverse=True)
sorted_swapped_list_av_pnt_hr

[[2.438888888888889, 16],
 [2.35462962962963, 12],
 [2.3342592592592593, 17],
 [2.2574074074074075, 13],
 [2.062962962962963, 15],
 [2.050925925925926, 18],
 [2.025, 14],
 [1.7185185185185186, 22],
 [1.6842592592592593, 20],
 [1.575925925925926, 19],
 [1.412962962962963, 23],
 [1.3703703703703705, 11],
 [1.086111111111111, 0],
 [0.8018518518518518, 21],
 [0.6481481481481481, 1],
 [0.6305555555555555, 10],
 [0.6287037037037037, 3],
 [0.5120370370370371, 9],
 [0.48055555555555557, 8],
 [0.45740740740740743, 7],
 [0.3574074074074074, 4],
 [0.3472222222222222, 6],
 [0.3148148148148148, 2],
 [0.0962962962962963, 5]]

In [185]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours when Show HN Posts accumulated maximum points:\n")

for row in sorted_swapped_list_av_pnt_hr[:5]:
    hour = str(row[1])
    hour = dt.datetime.strptime(hour,"%H") #converting hour into datetime object
    hour = dt.datetime.strftime(hour, "%H:%M:%S") #formatting hour as per output requirement
    points = '{:0.2f}'.format(row[0])
    template = "{hour} -> {points} average points per hour".format(hour =hour,points = points)
    print(template)

Top 5 Hours when Show HN Posts accumulated maximum points:

16:00:00 -> 2.44 average points per hour
12:00:00 -> 2.35 average points per hour
17:00:00 -> 2.33 average points per hour
13:00:00 -> 2.26 average points per hour
15:00:00 -> 2.06 average points per hour



# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average. 

Also, an additional analysis was performed to find which type of post and the time recieves more points. Based on it, to maximise the points collected by posts, we'd recommend the post to be categorized as show post and created between 16:00 and 17:00 (4:00 pm - 5:00 pm est). The blockbuster post was also found using this points analysis.



