# Comments Received per Hacker News Post

In this project, I will compare two types of posts, Ask HN and Show HN, that are commonly found on the popular technology-themed website [Hacker News](https://news.ycombinator.com/). Ask HN posts enable the website's users to ask the Hacker News community specific questions whereas Show HN enable users to show the community projects.

I will be attempting to answer the following questions in my analysis:
- Does either type of post receive more comments on average?
- Does the time in which you publish the post affect how many comments it receives?

## 1. First Look at the Dataset

The [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) I am analyzing is from Kaggle and can be accessed by clicking the link. It contains over a million entries. First, I will import all of the appropriate libraries for my investigation as well as read in the dataset and convert it into a list.

In [1]:
from csv import reader
hacker = open("hacker_news.csv")
hacker = reader(hacker)
hn = list(hacker)

### a) Determining Important Columns

Next I will display the headers as well as the first five posts to get a better sense of the manner in which the dataset is organized, which will aide our analysis. I will remove the header from the dataset so it is easier to analyze in loops.

In [2]:
header = hn[0]
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
hn_body = hn[1:]
hn_body[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

The most important columns to analyze for answering my main questions are:
- title = determines if post is Ask HN, Show HN, or normal
- num_comments = helps determine the comment's popularity
- created_at = determines which time the post was made

### b) Sorting the Data by Type of Post

There are three categories of posts that we will be analyzing: Ask HN, Show HN, and other. In order to analyze the subset of posts individually, we will separate the subsets into individual lists. Note that the titles of all Ask HN posts will start with Ask HN and the titles of all Show HN posts will start with Show HN. This will make it easy to separate the posts from one another.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

In [5]:
for post in hn_body:
    title = post[1].lower() # converts the title to lowercase to make comparison easier
    
    if title[:6] == "ask hn": # checks to see if the first 7 characters of the title are ask hn
        ask_posts.append(post)
    elif title[:7] == "show hn": # checks to see if the first 8 characters of the title are show hn
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print("There are", len(ask_posts), "Ask HN posts in our dataset.")
print("There are", len(show_posts), "Show HN posts in our dataset.")
print("There are", len(other_posts), "other posts in our dataset.")

There are 9139 Ask HN posts in our dataset.
There are 10158 Show HN posts in our dataset.
There are 273822 other posts in our dataset.


## 2. Comparing Ask HN and Show HN Posts

Once again, the two main questions I will be attempting to answer in my analysis are:
- Does either type of post receive more comments on average?
- Does the time in which you publish the post affect how many comments it receives?

### a) Do Ask HN or Show HN Posts Receive More Comments on Average?

First, I must determine whether Ask HN or Show HN posts have more comments on average. The function below will take in a dataset (as a list) as well as an column index and return the average value of that index within the dataset.

In [6]:
def average_index(dataset, index):
    try:  # ensures the column index represents an object that can be converted into an integer
        index_total = 0
        for row in dataset:
            num_row = int(row[index])
            index_total += num_row
        average = index_total / len(dataset)
        return average
    except:
        return "Index column must represent an object that can be converted into an integer."

avg_ask_comments = average_index(ask_posts, 4)
avg_show_comments = average_index(show_posts, 4)
avg_other_comments = average_index(other_posts, 4)

print("Ask HN posts have", round(avg_ask_comments, 2), "comments on average.")
print("Show HN posts have", round(avg_show_comments, 2), "comments on average.")
print("Other posts have", round(avg_other_comments, 2), "comments on average.")

Ask HN posts have 10.39 comments on average.
Show HN posts have 4.89 comments on average.
Other posts have 6.46 comments on average.


Ask HN posts have over twice as many comments on average as Show HN posts do. Other posts have more average comments than Show HN posts but less than Ask HN posts.

### b) Average Number of Comments by Hour

Now that I have determined which types of posts are created more often, I can attempt to determine if the time of day in which the post is created affect the average number of comments it receives. My overall strategy for determining this will involve a series of nested functions that each break down tasks into smaller, more manageable steps. 
- extract_dates_and_comments = takes in a dataset and return a list of lists containing the date and number of comments for each of our data entries
- avg_comments by hour = takes in a dataset and return a dictionary that contains the average number of comments per post categorized by hour the data entry was posted
- hour_organizer = will convert a time (as a string) from format '00' - '23' to format '12:00 AM' - '11:00 PM'
- sort_by_most_avg_comments = takes in a dataset and use the above functions to return a list of lists sorted by number of comments (descending) that contains the hour and the average number of comments

Note that I will be using the Ask HN data in order to develop the intermediary functions, but once the final function is completed, I will apply it to both the Ask HN and Show HN subsets. Additionally, I will create another function later that will display the data more neatly. Before I start working on these functions, I need to import that datatime library as it will be needed in all functions involving dates.

In [7]:
from datetime import datetime as dt

#### i) Extracting the Dates and Number of Comments

This function will take a dataset (so long as it is one of the three subsets ask_posts, show_posts, or other_posts) and return a list of lists containing the date and the number of comments for each data entry within the dataset.

In [8]:
def extract_dates_and_comments(dataset):
    dates_and_comments = []
    
    for data in dataset:
        date_created = data[6] # date and time the post was created
        num_comments = int(data[4]) # number of comments on the data entry
        dates_and_comments.append([date_created, num_comments])
        
    return dates_and_comments
    
print(extract_dates_and_comments(ask_posts)[:5])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


#### ii) Average Number of Comments Per Post By Hour

Now I will use the extract_dates_and_comments function within another function to find the average number of comments in each hour a post was created. This function will use the dt.strptime method to select the hour part of the date in the dataset and will use loops and dictionaries to calculate the average number of comments per post per hour.

In [9]:
def avg_comments_by_hour(dataset):
    dates_and_comments = extract_dates_and_comments(dataset)
    posts_by_hour = {} # will contain the total number of posts per hour
    tot_com_by_hour = {} # will contain the total number of comments per hour
    avg_com_by_hour = {} # will use the two dictionaries above to calculate the average comments per post per hour
    
    for data in dates_and_comments:
        date = dt.strptime(data[0], "%m/%d/%Y %H:%M") # will convert string into a datetime object (easier to manipulate)
        comments = data[1] 
        hour = date.strftime("%H") # selects hour from datetime object
        
        if hour not in posts_by_hour:
            posts_by_hour[hour] = 1
            tot_com_by_hour[hour] = comments
        else:
            posts_by_hour[hour] += 1
            tot_com_by_hour[hour] += comments
    
    # Uses dictionaries already created to create a new dictionary
    for key in posts_by_hour:
        avg_com_by_hour[key] = round(tot_com_by_hour[key] / posts_by_hour[key], 2)
        
    return avg_com_by_hour
        
        
print(avg_comments_by_hour(ask_posts))

{'02': 11.14, '01': 7.41, '22': 8.8, '21': 8.69, '19': 7.16, '17': 9.45, '15': 28.68, '14': 9.69, '13': 16.32, '11': 8.96, '10': 10.68, '09': 6.65, '07': 7.01, '03': 7.95, '23': 6.7, '20': 8.75, '16': 7.71, '08': 9.19, '00': 7.56, '18': 7.94, '12': 12.38, '04': 9.71, '06': 6.78, '05': 8.79}


#### iii) Hour Organizer

Currently, the time for each hour is displayed in a two digit, military format. I will create a function that converts the two-digit, military format to a four-digit, 12-hour format so that it is readable.

In [10]:
def hour_organizer(hour):
    if int(hour) >= 22:
        hour_str = str(int(hour) - 12)
    elif int(hour) >= 13:
        hour_str = "0" + str(int(hour) - 12)
    elif int(hour) >= 1:
        hour_str = hour
    else:
        hour_str = "12"
    
    new_hour = hour_str + ":00 - " + hour_str + ":59"
    
    if int(hour) >= 12:
        new_hour += " PM"
    else:
        new_hour += " AM"
        
    return new_hour

print(hour_organizer("22"))

10:00 - 10:59 PM


#### iv) Sorting the Hours by Largest Average Number of Comments

Now that the hour can easily be converted into a more readable format, I will apply this function to the hours in our dictionary. I will sort the hours by highest average number of comments. Unfortunately, dictionaries are not easily sortable, so the function will need to convert the dictionary above into a list of lists and then sort the data accordingly.

In [11]:
def sort_by_most_avg_comments(dataset):
    avg_com_by_hour = avg_comments_by_hour(dataset)
    avg_com_by_hour_sorted = []
    avg_com_by_hour_switched = [] # Necessary to switch the order of the comments and the hour later
    
    for hour in avg_com_by_hour:
        avg_comments = avg_com_by_hour[hour]
        formated_hour = hour_organizer(hour) # Converts two-digit, military time into four digit, 12-hour time
        
        # Average number of comments is first so that the data can be sorted (order will be switched after sorting)
        avg_com_by_hour_sorted.append([round(avg_comments, 2), formated_hour])
        
    avg_com_by_hour_sorted.sort(reverse=True) # Sorts the hours by average number of comments from highest to lowest
    
    # Reverses the order of average number of comments and the hour of the day (for readability purposes)
    for row in avg_com_by_hour_sorted:
        avg_com_by_hour_switched.append([row[1], row[0]])
        
    return avg_com_by_hour_switched
    
print(sort_by_most_avg_comments(ask_posts))

[['03:00 - 03:59 PM', 28.68], ['01:00 - 01:59 PM', 16.32], ['12:00 - 12:59 PM', 12.38], ['02:00 - 02:59 AM', 11.14], ['10:00 - 10:59 AM', 10.68], ['04:00 - 04:59 AM', 9.71], ['02:00 - 02:59 PM', 9.69], ['05:00 - 05:59 PM', 9.45], ['08:00 - 08:59 AM', 9.19], ['11:00 - 11:59 AM', 8.96], ['10:00 - 10:59 PM', 8.8], ['05:00 - 05:59 AM', 8.79], ['08:00 - 08:59 PM', 8.75], ['09:00 - 09:59 PM', 8.69], ['03:00 - 03:59 AM', 7.95], ['06:00 - 06:59 PM', 7.94], ['04:00 - 04:59 PM', 7.71], ['12:00 - 12:59 AM', 7.56], ['01:00 - 01:59 AM', 7.41], ['07:00 - 07:59 PM', 7.16], ['07:00 - 07:59 AM', 7.01], ['06:00 - 06:59 AM', 6.78], ['11:00 - 11:59 PM', 6.7], ['09:00 - 09:59 AM', 6.65]]


I will examine this data in greater detail (as well as create a function to display the results more effectively) later in my analysis. For now, I will move on to the next step.

### c) Average Number of Comments by Time Period

In order to simplify the results, I will divide the day into eight different time periods of equal length and calculate the average number of comments created per post within these different periods. The names of each time frame will be:
- Mid Night = 12:00 - 2:59 AM
- Late Night = 3:00 - 5:59 AM
- Early Morning = 6:00 - 8:59 AM
- Late Morning = 9:00 - 11:59 AM
- Early Afternoon = 12:00 - 2:59 PM
- Late Afternoon = 3:00 - 5:59 PM
- Evening = 6:00 - 8:59 PM
- Early Night = 9:00 - 11:59 PM

I will create a function that takes in and hour (in format '00' - '23') and returns on of the time periods listed above. Then another function will takes in the dataset and returns the average number of comments per post per period using the extract_dates_and_comments and this new function. Luckily, the code for this second function will be very similar to the code in the avg_comments_by_hour function, so it shouldn't be difficult to create.

#### i) Time Period Oranizer

This function will take a string that represents an hour (in the format '00'-'23') and return the time period that hour belongs to. I will use this function later on to calculate the average number of comments per post per time period.

In [12]:
def time_frame_organizer(hour):
    new_hour = int(hour)
    
    if new_hour in range(0, 3):
        period = "Mid Night"
    elif new_hour in range(3, 6):
        period = "Late Night"
    elif new_hour in range(6, 9):
        period = "Early Morning"
    elif new_hour in range(9, 12):
        period = "Late Morning"
    elif new_hour in range(12, 15):
        period = "Ear. Afternoon"
    elif new_hour in range(15, 18):
        period = "Late Afternoon"
    elif new_hour in range(18, 21):
        period = "Evening"
    else:
        period = "Early Night"
        
    return period

print(time_frame_organizer("14"))

Ear. Afternoon


#### ii) Average Number of Comments Per Post Per Period

I've edited a lot of the other functions from the *Average Number of Comments By Hour* section and merged them into a single function in order to save time. This function will take a dataset and return a list of lists containing the average number of comments per time period, sorted by descending number of comments.

Although this function does not have all of the nested functions within it that the avg_comments_by_hour function does, it will work just as well.

In [13]:
def avg_comments_by_period(dataset):
    dates_and_comments = extract_dates_and_comments(dataset)
    posts_by_period = {} # will contain the total number of posts per period
    tot_com_by_period = {} # will contain the total number of comments per period
    avg_com_by_period = {} # will use the two dictionaries above to calculate the average comments per post per period
    avg_com_by_period_sort = []
    avg_com_by_period_reverse = []
    
    for data in dates_and_comments:
        date = dt.strptime(data[0], "%m/%d/%Y %H:%M") # will convert string into a datetime object (easier to manipulate)
        comments = data[1] 
        hour = date.strftime("%H") # selects hour from datetime object
        period = time_frame_organizer(hour)
        
        if period not in posts_by_period:
            posts_by_period[period] = 1
            tot_com_by_period[period] = comments
        else:
            posts_by_period[period] += 1
            tot_com_by_period[period] += comments
    
    # Uses dictionaries already created to create a new dictionary
    for key in posts_by_period:
        avg_com_by_period[key] = round(tot_com_by_period[key] / posts_by_period[key], 2)
    
    # Converts the dictionary into a list of lists so it can be sorted (switches num_comments, period for same reason)
    for key, value in avg_com_by_period.items():
        avg_com_by_period_sort.append([value, key])
    
    # Sorts the list of lists from largest to smallest
    avg_com_by_period_sort.sort(reverse=True)
    
    # Reverses the sorted list back to [period, num_comments] format
    for period in avg_com_by_period_sort:
        avg_com_by_period_reverse.append([period[1], period[0]])
    
    return avg_com_by_period_reverse

print(avg_comments_by_period(ask_posts))


[['Late Afternoon', 15.75], ['Ear. Afternoon', 12.66], ['Late Morning', 8.93], ['Late Night', 8.79], ['Mid Night', 8.64], ['Early Night', 8.17], ['Evening', 7.93], ['Early Morning', 7.72]]


### d) Displaying the Average Comments Per Post Per Hour/Period Results

This function will take in one of the HN data subsets (ask_posts, show_posts, other_posts) and one of comments per post functions (avg_comments_by_hour, avg_comments_by_period) and print the results in a readable format.

In [14]:
def display_results(dataset, func):
    sorted_list = func(dataset)
    
    for item in sorted_list:
        print(item[0] + ":\t" + str(item[1]))

#### i) Ask HN: Average Number of Comments Per Post Per Period

In [15]:
display_results(ask_posts, avg_comments_by_period)

Late Afternoon:	15.75
Ear. Afternoon:	12.66
Late Morning:	8.93
Late Night:	8.79
Mid Night:	8.64
Early Night:	8.17
Evening:	7.93
Early Morning:	7.72


#### ii) Ask HN: Average Number of Comments Per Post Per Hour

In [16]:
display_results(ask_posts, sort_by_most_avg_comments)

03:00 - 03:59 PM:	28.68
01:00 - 01:59 PM:	16.32
12:00 - 12:59 PM:	12.38
02:00 - 02:59 AM:	11.14
10:00 - 10:59 AM:	10.68
04:00 - 04:59 AM:	9.71
02:00 - 02:59 PM:	9.69
05:00 - 05:59 PM:	9.45
08:00 - 08:59 AM:	9.19
11:00 - 11:59 AM:	8.96
10:00 - 10:59 PM:	8.8
05:00 - 05:59 AM:	8.79
08:00 - 08:59 PM:	8.75
09:00 - 09:59 PM:	8.69
03:00 - 03:59 AM:	7.95
06:00 - 06:59 PM:	7.94
04:00 - 04:59 PM:	7.71
12:00 - 12:59 AM:	7.56
01:00 - 01:59 AM:	7.41
07:00 - 07:59 PM:	7.16
07:00 - 07:59 AM:	7.01
06:00 - 06:59 AM:	6.78
11:00 - 11:59 PM:	6.7
09:00 - 09:59 AM:	6.65


#### iii) Show HN: Average Number of Comments Per Post Per Period

In [17]:
display_results(show_posts, avg_comments_by_period)

Ear. Afternoon:	5.91
Early Morning:	5.72
Late Morning:	4.92
Evening:	4.73
Mid Night:	4.6
Late Afternoon:	4.52
Late Night:	4.38
Early Night:	4.13


#### iv) Show HN: Average Number of Comments Per Post Per Hour

In [18]:
display_results(show_posts, sort_by_most_avg_comments)

12:00 - 12:59 PM:	6.99
07:00 - 07:59 AM:	6.68
11:00 - 11:59 AM:	6.0
08:00 - 08:59 AM:	5.6
02:00 - 02:59 PM:	5.52
01:00 - 01:59 PM:	5.43
02:00 - 02:59 AM:	5.15
04:00 - 04:59 AM:	5.04
07:00 - 07:59 PM:	5.02
06:00 - 06:59 PM:	4.94
06:00 - 06:59 AM:	4.71
04:00 - 04:59 PM:	4.71
09:00 - 09:59 AM:	4.67
12:00 - 12:59 AM:	4.65
03:00 - 03:59 PM:	4.57
11:00 - 11:59 PM:	4.53
03:00 - 03:59 AM:	4.53
05:00 - 05:59 PM:	4.25
08:00 - 08:59 PM:	4.16
09:00 - 09:59 PM:	4.09
01:00 - 01:59 AM:	4.07
10:00 - 10:59 PM:	3.85
10:00 - 10:59 AM:	3.8
05:00 - 05:59 AM:	3.44


## 3. Results

The two most important conclusions from this analysis are that Ask HN posts are much more likely to get comments than Show HN posts (10.39 vs. 4.89) and that the time of day drastically affects how many comments a post is likely to receive.

03:00 - 03:59 PM is by far the best hour to post an Ask HN query if you want to recieve answers to your query. The average Ask HN post created in this time period receives 28.68 comments on average, while the next best time to post (01:00 - 01:59 PM) only receives 16.32 comments on average. The range between possible values is also very drastic, as the least best hour to post an Ask HN query (09:00 - 09:59 AM) receives only 6.65 comments on average, less than a quarter of comments a post at 03:00 - 03:59 PM is likely to receive. However, note that even if an Ask HN post is posted at this time, it will still likely receive more comments than a Show HN post (no matter the time of day), which receives 4.89 comments on average.

In general, Ask HN posts published in the afternoon are most likely to receive comments, as the top four comment-receiving hours are all in the afternoon. This makes some logistical sense as people usually get of work at around mid-afternoon and then have several hours of leisure time in the evening in which they can respond to comments. Understandably, very few comments get answered late at night or in the very early morning, when most people are sleeping.

Although Show HN posts published at 12:00 - 12:59 PM have twice as many comments as those publisehd at 05:00 - 05:59 AM, the time of day is much less likely to affect how many comments you recieve than it is for the Ask HN posts. However, it is still notebable that posts that are published between Early Morning and Early Afternoon are much more likley to receive comments than those published outside of that time frame.