## Analyzing posts on "Reddit like" website Hacker News - Ryan Lacarne
based on Hacker News posts from September 2015-2016.

In this project, I analyze different posts to see what trends I can spot between posts and the number of comments, ratings, and attention they receive on the website "Hacker News".
In this project I used the datetime module quite extensively, as well was the random module for the first time.

First up, I will read in the data and get an overview of the first few rows.

In [33]:
from csv import reader
opened_file = open('hacker_news.csv',encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

Next, as customary, I remove the header.

In [34]:
hn_header = (hn[0])
hn = (hn[1:])
(hn[:5])
print ("The length of the dataset is", len(hn), "rows.")

The length of the dataset is 293119 rows.


## Data Cleaning and Wrangling 

To get a better sample, I will now remove all the submissions that did not receive any comments. 

In [35]:
hn_commented = []
hn_nocomments = []
for row in hn:
    comments = row[4]
    if comments != "0":
        hn_commented.append(row)
    else:
        hn_nocomments.append(row)

hn_commented_len = len(hn_commented)
hn_nocomments_len = len(hn_nocomments)
print (hn_commented_len)
print (hn_nocomments_len)



80401
212718


That is still a heck of a lot of data to work from. I have decided that I would like to work from 20,000 articles. Therefore, I import the random library to create a random sample from my dataset (which now has no articles with 0 comments).

In [36]:
import random
hn_final = random.sample(hn_commented,20000)
hn_final_len = len(hn_final)
print (hn_final_len)


20000


I have successfully done this. The problem is, every time I run the entire code, I actually import a new random set of 20000 articles from the set of 212718 articles with no comments. In the future I would like to revisit this project and fix this so this does not happen, as this is obviously very problematic.

The two types of posts we'll explore begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question, and users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

First, let's divide these posts up out of our sample of 20000 articles.

In [37]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn_final:
    title = row [1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ask_posts_len = len(ask_posts)
show_posts_len= len(show_posts)
other_posts_len = len(other_posts)

print ("There are" ,ask_posts_len, "Ask HN posts.")
print ("There are" ,show_posts_len ,"Show HN posts.")
print ("There are" ,other_posts_len ,"posts that are neither Ask HN or Show HN.")


There are 1728 Ask HN posts.
There are 1324 Show HN posts.
There are 16948 posts that are neither Ask HN or Show HN.


## Analysing the Attributes of the Posts

In [38]:
total_ask_comments = 0
for i in ask_posts:
    total_ask_comments += float(i[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print ("The average number of comments on Ask HN posts is ",round(avg_ask_comments,0),".")

The average number of comments on Ask HN posts is  14.0 .


In [39]:
total_show_comments = 0
for i in show_posts:
    total_show_comments += float(i[4])
avg_show_comments = total_ask_comments / len(show_posts)
print ("The average number of comments on Show HN posts is ",round(avg_show_comments,0),".")

The average number of comments on Show HN posts is  18.0 .


Since there is a higher average number of "Show HN" posts, I will focus on Show HN for the rest of these projects.

I'm going to try to determine if ask posts created at a certain time are more likely to attract comments. To do so, I will:
1. Calculate the amount of show posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments show posts receive by hour created.

For this, I use the datetime module.

First up, I calculate the amount of show posts created in each hour of the day, along with the number of comments received.


In [40]:
import datetime as dt
result_list = []
for row in show_posts:
    creationtime = row[6]
    comments = float(row [4])
    result_list.append([creationtime,comments])

    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time]+= comment
        counts_by_hour[time] += 1
    else: 
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
        
comments_by_hour
    

{'04': 357.0,
 '20': 629.0,
 '19': 995.0,
 '21': 345.0,
 '18': 1039.0,
 '12': 1215.0,
 '17': 762.0,
 '05': 329.0,
 '22': 343.0,
 '16': 970.0,
 '14': 967.0,
 '15': 1024.0,
 '07': 302.0,
 '02': 443.0,
 '13': 650.0,
 '09': 374.0,
 '23': 599.0,
 '10': 382.0,
 '11': 764.0,
 '00': 338.0,
 '01': 310.0,
 '08': 440.0,
 '06': 307.0,
 '03': 221.0}

Next , I calculate the average number of comments show posts receive by hour created.

In [41]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr,comments_by_hour[hr]/counts_by_hour[hr]])
    
avg_by_hour

[['04', 17.0],
 ['20', 9.530303030303031],
 ['19', 12.922077922077921],
 ['21', 6.2727272727272725],
 ['18', 12.670731707317072],
 ['12', 13.651685393258427],
 ['17', 7.855670103092783],
 ['05', 11.344827586206897],
 ['22', 6.125],
 ['16', 8.899082568807339],
 ['14', 11.114942528735632],
 ['15', 11.636363636363637],
 ['07', 10.413793103448276],
 ['02', 15.275862068965518],
 ['13', 8.125],
 ['09', 7.333333333333333],
 ['23', 17.114285714285714],
 ['10', 8.681818181818182],
 ['11', 13.89090909090909],
 ['00', 9.135135135135135],
 ['01', 12.4],
 ['08', 12.571428571428571],
 ['06', 10.964285714285714],
 ['03', 11.05]]

This gives me the hour, followed by the average number of posts received in that hour, but this list of list is pretty horrendous to read, therefore I use the sorted function to return the hours with the most comments.

In [42]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[17.114285714285714, '23'],
 [17.0, '04'],
 [15.275862068965518, '02'],
 [13.89090909090909, '11'],
 [13.651685393258427, '12'],
 [12.922077922077921, '19'],
 [12.670731707317072, '18'],
 [12.571428571428571, '08'],
 [12.4, '01'],
 [11.636363636363637, '15'],
 [11.344827586206897, '05'],
 [11.114942528735632, '14'],
 [11.05, '03'],
 [10.964285714285714, '06'],
 [10.413793103448276, '07'],
 [9.530303030303031, '20'],
 [9.135135135135135, '00'],
 [8.899082568807339, '16'],
 [8.681818181818182, '10'],
 [8.125, '13'],
 [7.855670103092783, '17'],
 [7.333333333333333, '09'],
 [6.2727272727272725, '21'],
 [6.125, '22']]

Finally, I print the Top 5 "Prime Time" Hours for Show HN posts, where there were the most comments, in a readable and easy to undersatnd format.

In [43]:
print("Top 5 Hours for 'Show HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Show HN' Comments
23:00: 17.11 average comments per post
04:00: 17.00 average comments per post
02:00: 15.28 average comments per post
11:00: 13.89 average comments per post
12:00: 13.65 average comments per post


## Conclusion

The biggest challenge in this project was definitely getting used to the rather complex syntax of the datetime module, as well as trying to figure out how to use the random module in a way that doesn't hange the dataset I'm working with every time I run this program in Jupyter. I hope to be able to revisit this project in the future and fix these things. Again, just like my first project, I believe I would definitely be able to benefit from using numpy and pandas when working with the csv, to save both programmer and machine time, but this will come in future projects. 

Beyond that, I think it's pretty cool that I managed to figure out the "Prime Time" of the Hacker News website for a specific type of post. This type of information is definitely something that would be useful in a business and more specifically marketing setting, where a company may want to figure out at what time their audience is most active on a particular platform, when to launch a certain product, discussion, etc. I look forward to using this kind of methodology again.