In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

We are focusing on the posts aimed at the Hacker New community, identified by Ask HN and Show HN. Ask HN posts are questions to the community, while Show HN can be something to promote in the forum, be it a project, product or something interesting.

## Questions we hope to answer:
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?
3. Determine if show or ask posts receive more points on average.
4. Determine if posts created at a certain time are more likely to receive more points.
5. Compare results to the average number of comments and points other posts receive.

In [None]:
hn=pd.read_csv('/kaggle/input/hacker-news-posts/HN_posts_year_to_Sep_26_2016.csv', low_memory=True)
hn.head()

In [None]:
hn.info()

## The top 10 Authors are:

In [None]:
hn.author.value_counts().head(10)

To find our posts of interest, we need to filter out the posts with titles that start with "Ask HN"or "Show HN".

We can do this easily using the method startswith.

However, Strings are notoriously finicky.

In [None]:
print('Data science'.startswith('data'))
print('data science'.startswith('data'))

## Capitalization matters!
As humans, we know both the above text start with the word 'data'; however, a machine does not recognize this. 

Lets make the title lowercase before filtering.

In [None]:
# creating empty lists to store the filtered data in
ask_posts=[]
show_posts=[]
other_posts=[]

In [None]:
for i,x in hn.iterrows():
    if x.title.lower().startswith('ask hn'):
        ask_posts.append(x)
    elif x.title.lower().startswith('show hn'):
        show_posts.append(x)
    else:
        other_posts.append(x)

In [None]:
print ('Number of Ask HN posts are:', len(ask_posts))
print ('Number of Show HN posts are:', len(show_posts))
print ('Number of Other posts are:', len(other_posts))

In [None]:
ask_posts[:5]

## Question 1: Do Ask HN or Show HN receive more comments on average?
We need to find the average number of votes and comments on an 'ask' post and a 'show' post to determine whether they garner more audience

In [None]:
total_ask_comments=0
for i in ask_posts:
    total_ask_comments+= i[4]
avg_ask_comments= total_ask_comments/ len(ask_posts)
print ('Total comments on Ask HN posts are:', total_ask_comments)
print ('Average comments on Ask HN posts are:', avg_ask_comments)

In [None]:
total_show_comments=0
for i in show_posts:
    total_show_comments+= i[4]
avg_show_comments= total_show_comments/ len(show_posts)
print ('Total comments on Show HN posts are:', total_show_comments)
print ('Average comments on Show HN posts are:', avg_show_comments)

So, seems like people are more interested in giving solutions than hearing others out. Seems true enough!

While the average number of show comments were close to 5, the averge number of ask comments is nearly double, 10

## Question 2: Do posts created at a certain time receive more comments on average?
We need to check whether the timing of these posts play a role in attracting comments. To determine this, we will perform two steps:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

Let's first calculate the number of ask posts and comments generated per hour. Two steps to go about this:
- Applying the <b>datetime.strptime()</b> constructor of the datetime module, we can parse dates stored as strings. (NOTE: m is for month and M is for minute)


In [None]:
import datetime as dt
result_list=[]
for x in ask_posts:
    result_list.append([x[6], x[4]])
result_list[:5]

In [None]:
counts_by_hour={}
comment_by_hour={}
for i in result_list:
    x=dt.datetime.strptime(i[0], "%m/%d/%Y %H:%M")
    hour=dt.datetime.strftime(x, '%H')
    if hour in counts_by_hour.keys():
        counts_by_hour[hour]+=1
        comment_by_hour[hour]+= i[1]
    else:
        counts_by_hour[hour]=1
        comment_by_hour[hour]=i[1]

In [None]:
counts_by_hour

Iterating over the values in these two dictionary, we can calculate the average number of commments.

In [None]:
avg_by_hour=[]
for comment in comment_by_hour:
        avg_by_hour.append([comment, round(comment_by_hour[comment]/counts_by_hour[comment], 2)])

In [None]:
print('The average number of comments per post are:')
sorted(avg_by_hour)

In [None]:
swap_avg_by_hour=[[value, key] for key, value in avg_by_hour]

In [None]:
sorted_swap=sorted(swap_avg_by_hour, reverse=True)

In [None]:
print("Top 5 Hours for Ask Posts Comments")
for i in sorted_swap[:5]:
    time=dt.datetime.strptime(i[1], '%H')
    hour= dt.datetime.strftime(time,'%H:%M')
    print(hour, i[0])

#### Our findings show the best time to ask questions is 3 pm. As per the description of the dataset, the time zone is Eastern Time in the US. 
#### So, for fellow Indians interested in getting alot of traffic to their questions, you may have to burn the midnight oil and post at 1:30 AM. 



## Question 3: Determine whether Show or Ask posts receive more points on average.

In [None]:
total_ask_votes=0
for i in ask_posts:
    total_ask_votes+= i[3]
avg_ask_votes= total_ask_votes/ len(ask_posts)
print ('Total votes on Ask HN posts are:', total_ask_votes)
print ('Average votes on Ask HN posts are:', avg_ask_votes)

In [None]:
total_show_votes=0
for i in show_posts:
    total_show_votes+= i[3]
avg_ask_votes= total_show_votes/ len(ask_posts)
print ('Total votes on Show HN posts are:', total_show_votes)
print ('Average votes on Show HN posts are:', avg_ask_votes)

### The average votes on a Show post are more than that of an Ask post

## Question 4: Determine if posts created at a certain time are more likely to receive more points.

In [None]:
import datetime as dt
result_list=[]
for x in ask_posts:
    result_list.append([x[6], x[3]])
print(result_list[:5])

## Converting to hour
count_by_hour={}
vote_by_hour={}
for i in result_list:
    time=dt.datetime.strptime(i[0], '%m/%d/%Y %H:%M')
    hour=dt.datetime.strftime(time, '%H')
    if hour in count_by_hour.keys():
        count_by_hour[hour]+=1
        vote_by_hour[hour]+=i[1]
    else:
        count_by_hour[hour]=1
        vote_by_hour[hour]=i[1]

In [None]:
vote_by_hour

In [None]:
avgvote_by_hour=[]
for vote in vote_by_hour:
        avgvote_by_hour.append([vote, round(vote_by_hour[vote]/count_by_hour[vote], 2)])

sort_list=[[value, key] for key, value in avgvote_by_hour ]
sorted_swap=sorted(sort_list, reverse=True)

In [None]:
print("Top 5 Hours for Ask Posts Comments")
for i in sorted_swap[:5]:
    time=dt.datetime.strptime(i[1], '%H')
    hour= dt.datetime.strftime(time,'%H:%M')
    print(hour, i[0])

Apparently the best time to post queries is also the best time to get votes on your posts. In fact, the top three timings, i.e. 3pm, 1 pm. and 12 pm see the most activity on the Ask HN questions.

## Question 5: How do the other category compare to Ask/ Show HN in terms of average number of comments and votes

In [None]:
total_other_comments=0
total_other_votes=0
for i in other_posts:
    total_other_comments+= i[4]
    total_other_votes+=i[3]
avg_other_comments= total_other_comments/ len(other_posts)
avg_other_votes=total_other_votes/ len(other_posts)
print ('Average comments on Other posts are:', avg_other_comments)
print ('Average votes on Other HN posts are:', avg_other_votes)

Earlier we sw that Ask HN has a comment average of 10.4, while the Show HN posts have an average of 4.88. The Other posts lies in between this range.

Let's see how the votes compre for the three categories

In [None]:
total_ask_votes=0
total_show_votes=0
for i in ask_posts:
    total_ask_votes+=i[3]
for show in show_posts:
    total_show_votes+= show[3]
avg_ask_votes= total_ask_votes/ len(ask_posts)
avg_show_votes=total_show_votes/ len(show_posts)
print ('Average votes on Ask posts are:', avg_ask_votes)
print ('Average votes on Show HN posts are:', avg_show_votes)