# Analysing 'Hacker Post' submissions
In this project, we'll work with a data set of submissions to popular technology site Hacker News. Data set can be found here : [Link](https://www.kaggle.com/hacker-news/hacker-news-posts).
We'll compare post submission to analyse whether user are more likely to 'ask a question' or 'share something interesting'. We will also analyse if the posts created at a certain time receive more comments on average than others.

## Step 1 :  Reading data from the dataset

In [4]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)

hn = list(read_file)

print(hn[:3]) #Printing to check

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


In [5]:
#Removing first(header) row
hn = hn[1:] #hn is from first row to last

print(hn[0]) #Checking again

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


## Step 2 : Filtering Data
Since, we only need post submissions where either a question is asked (title starting with : "Ask HN") or something is shared (title starting with : "Show HN"), we will filter our data accordingly

In [10]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):  #converts to lower then checks
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

#Total posts in each list        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Step 3 : Determining which receives more average comments

In [21]:
total_ask_comments = 0

#Total comments in "Ask HN"
for post in ask_posts:
    comm = int(post[4])
    total_ask_comments += comm
    
#Average comments in "Ask HN"
avg_ask_comments = total_ask_comments/len(ask_posts)    
print("Average 'Ask HN' posts : {:.2f}".format(avg_ask_comments))

total_show_comments = 0

#Total comments in "Show HN"
for post in show_posts:
    comm = int(post[4])
    total_show_comments += comm
    
#Average comments in "Show HN"
avg_show_comments = total_show_comments/len(show_posts)
print("Average 'Show HN' posts : {:.2f}".format(avg_show_comments))


Average 'Ask HN' posts : 14.04
Average 'Show HN' posts : 10.32


>As seen from the above findings, "Ask HN" posts have total of 24483 comments averaging to 14.04 comments per post while "Show HN" posts have 11988 comments in total, averaging to 10.32 comments per post.
> <p>Hence, ** *Ask HN* ** recevies more comments on average, i.e., users are more intrested in asking questions rather than sharing their posts and findings.

## Step 4: Determining time to attract comments
 We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [26]:
import datetime as dt

result_list = [] #[ [time, no. of comments] ]

for post in ask_posts:
    result_list.append([post[6],int(post[4])]) 
    
count_by_hour = {} #{hour, no. of times comment made at that hour}
comments_by_hour = {} #{hour, total comments made that hour}

for row in result_list:
    date = row[0]
    conv_date = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    hr = conv_date.strftime("%H")
    
    if hr in count_by_hour:
        count_by_hour[hr] += 1
        comments_by_hour[hr] += row[1]
    else:
        count_by_hour[hr] = 1
        comments_by_hour[hr] = row[1]
        
print("{Hour : Number of times} ",count_by_hour)
print('\n')
print("{Hour : Total comments} ",comments_by_hour)

{Hour : Number of times}  {'23': 68, '04': 47, '03': 54, '19': 110, '09': 45, '21': 109, '00': 55, '20': 80, '15': 116, '13': 85, '02': 58, '22': 71, '05': 46, '16': 108, '14': 107, '01': 60, '06': 44, '11': 58, '08': 48, '10': 59, '17': 100, '12': 73, '07': 34, '18': 109}


{Hour : Total comments}  {'23': 543, '04': 337, '03': 421, '19': 1188, '09': 251, '21': 1745, '00': 447, '20': 1722, '15': 4477, '13': 1253, '02': 1381, '22': 479, '05': 464, '16': 1814, '14': 1416, '01': 683, '06': 397, '11': 641, '08': 492, '10': 793, '17': 1146, '12': 687, '07': 267, '18': 1439}


##  Step 5 : Calculating the average number of comments per post for posts created during each hour of the day

In [41]:
avg_by_hour = [] #[ [time, avg comment]]

#Dividing total comments by hour frequency to determine avg for each hour
for time,comment in comments_by_hour.items():
        ntimes = count_by_hour[time]
        avg_comment = comment/ntimes
        avg_by_hour.append([time, avg_comment])
        
swap_avg_by_hour = [] # [ [avg_comment, time]]

#Swaping avg_by_hour so that sorted() can be used based on average comment
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True) 

print(sorted_swap)


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


## Step 6 : Deducing "Top 5 Hours for Ask Posts Comments"


In [44]:
import datetime as dt

for element in sorted_swap[:5]:
    comments = element[0]
    date = dt.datetime.strptime(element[1],"%H")
    hr = date.strftime("%H")
    template = "{}:00: {:.2f} average comments per post.".format(hr,comments)
    print(template)
    
    

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


> ## Conclusion
As seen from the above conclusion, the best hours to post on **Hacker Post**  are : ** *3:00pm* , *2:00am* , *8:00pm* , *4:00pm* , *9:00pm* **