# Exploring Trends in Liked/Commented Post in Hacker News website.

### This project is all about to get an idea that what are those kind of posts on the Hacker news website which are above then average in number of comments.

[Hacker News](https://news.ycombinator.com/)

### We are interested in the these 2 types of posts namely:-

    1). Ask HN
    2). Show HN
    
### Ask HN means when someone put question about any thing, they put this keyword and then write question e.g. 

```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?``
```
### Show HN means when someone shows their work, project, product to Hacker News. e.g.

```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm``
```

### We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?


## Openning the dataset "hacker_news.csv" & removing headers

In [48]:
from csv import reader
file = open('hacker_news.csv')
file_read = reader(file)
hn = list(file_read)

In [49]:
hn[:2]                        # Preview of first 2 posts

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52']]

In [50]:
#removing headers
headers = hn[0]
hn = hn[1:] # do not run this again.

print(headers)
print(hn[:2])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


## Extracting Ask HN & Show HN posts in seperate list

In [51]:
ask_posts = []                                   # list having Ask posts
show_posts = []                                  # list having Show posts
other_posts = []                                 # list having Other posts

for row in hn:
    title = row[1]
    title = title.lower()                       # changed the strings to lower case.
    if title.startswith("ask hn"):              # used Startswith string method on each post.
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [52]:
print("No. of Ask posts:", len(ask_posts))
print("No. of Show posts:",len(show_posts))
print("No. of posts other then Ask & Show:",len(other_posts))

No. of Ask posts: 1744
No. of Show posts: 1162
No. of posts other then Ask & Show: 17194


## Avg no. of Comments for Ask & Show HN posts

In [53]:
ask_posts[:2]                 # first two Ask posts

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43']]

In [54]:
show_posts[:2]                  # first two Show posts

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46']]

## To find the Average we need the total & counts of comments in posts ( total and counts of column "num_comments" having index 4) 

In [55]:
# Total no. of comments in ASK HN posts
total_ask_comments = 0

for post in ask_posts:
    comment = post[4]
    total_ask_comments += int(comment)           # total Ask comments added
    
avg_ask_comments = total_ask_comments/len(ask_posts)

# Total no. of comments in Show HN posts
total_show_comments = 0

for post in show_posts:
    comment = post[4]
    total_show_comments += int(comment)            # total Show comments added
    
avg_show_comments = total_show_comments/len(show_posts)

In [56]:
print("Average no. of Comments for each Ask posts: ",int(avg_ask_comments))
print("Average no. of Comments for each Show posts: ",int(avg_show_comments))

Average no. of Comments for each Ask posts:  14
Average no. of Comments for each Show posts:  10


##  Analysis:- 
### So, we can see that the on an average Ask HN posts have more comments per posts than the Show HN posts

## Now, we will analyse that if ask posts created at a certain time are more likely to attract comments.

### We'll use the following steps to perform this analysis:

    1). Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    
    2). Calculate the average number of comments ask posts receive by hour created.

## Ask posts created in each hour of the day, along with the number of comments received.

In [57]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [58]:
import datetime as dt               # datetime package imported to use methods related to date and time.

### Seprated two columns "created_at" & "num_comments"  in another list

In [59]:
result_list = []              # List of having "created_at" & "num_comments" columns of Ask posts

for row in ask_posts:
    created_at = row[-1]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])        

In [60]:
result_list[:2]              # data in this list have date and comments 

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29]]

###  Using above list will create below dictionaries:-

   - counts_by_hour: contains the number of ask posts created during each hour of the day. 

   - comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

In [61]:
counts_by_hour = {}                    # Dictionary will having counts by Hour
comments_by_hour = {}                   # Dictionary will having comments by Hour

for i in result_list:
    date = dt.datetime.strptime(i[0], "%m/%d/%Y %H:%M")   #used strptime() to Parse date from string
    hr = date.hour                                        # used hour method on datetime object.
    comment_n = i[1]
    
    if hr in counts_by_hour:                      
        counts_by_hour[hr] += 1                    # for each post updating counts_by_hour dict with counts by hours
        comments_by_hour[hr] +=comment_n            # for each post updating comment_by_hour dict with comment no. by hours
    else:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] =comment_n

In [62]:
print("Counts by Hour:\n",counts_by_hour)
print("\n")
print("Comments by Hour:\n",comments_by_hour)

Counts by Hour:
 {0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


Comments by Hour:
 {0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}


## Average number of comments ask posts receive by hour will create using above 2 dictionaries.

In [63]:
avg_by_hour = []                                        # List will have average no. of comments per hour

for hr in counts_by_hour.keys():                        # iterate over counts by hours dict 
    avg = comments_by_hour[hr]/counts_by_hour[hr]       # calculating Avg per hour using comment and count by hour.
    avg_by_hour.append([hr, avg])

print("average no. of comments per hour : \n",avg_by_hour[:5])         #first five hours

average no. of comments per hour : 
 [[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447]]


### Now we have list of lists having average no. of comments per hour, but its is not sorted. we need to sort this list on comments average, so that the maximum average come on top and we can make decision.

### We will use sorted function on this list to sort on 2nd index value and mark Reverse as 'True'

In [64]:
sort_avg_by_hour = sorted(avg_by_hour, key=lambda x:x[1], reverse=True)

print("Sorted List of Average comments by hours:-\n ",sort_avg_by_hour)

Sorted List of Average comments by hours:-
  [[15, 38.5948275862069], [2, 23.810344827586206], [20, 21.525], [16, 16.796296296296298], [21, 16.009174311926607], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [18, 13.20183486238532], [17, 11.46], [1, 11.383333333333333], [11, 11.051724137931034], [19, 10.8], [8, 10.25], [5, 10.08695652173913], [12, 9.41095890410959], [6, 9.022727272727273], [0, 8.127272727272727], [23, 7.985294117647059], [7, 7.852941176470588], [3, 7.796296296296297], [4, 7.170212765957447], [22, 6.746478873239437], [9, 5.5777777777777775]]


In [65]:
print("Top 5 Hours for Ask Posts Comments::")
      
for i in sort_avg_by_hour[:5]:                           # Top Five post comment
    h = dt.datetime.strptime(str(i[0]),"%H")             # Parsing datetime object from Hour in the List
    h = h.strftime("%H:%M")                              # Formatting datetime object in specified Hour format.
    avg = i[1]
    
    print("{hour}: {average:.2f} average comments per post".format(hour=h , average=avg))

Top 5 Hours for Ask Posts Comments::
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


##  Conclusion:- 

### We can see that if a learner or person want to have most comments on their ASK post, they should post on Hacker News website at around _"03.00 PM"_

### Thanks