# Hacker News Posts' Review
In this project we'll analyse data of user posts on popular tech platform **Hacker News**. Hacker News is a site started by the startup Y Combinator where user submit stories which are then read by, commented upon, and upvoted/downvoted by the community. It serves a medium for sharing news, developments, and personal experiences relating to tech with similar-minded people on internet.

The data we're going to look at comes from a much larger dataset containing around 300,000 rows. But for simplicity, we've removed those entries that did not receive any comments and randomly sampled from the rest. Our dataset has thus been distilled to 20,000 entries. 

## Columns & What they mean

Here's a brief description of columns in our dataset.

| title | description |
|-----------|------------|
|id | A unique numeric value assigned by HN to each post|
|title| the title of the post|
|url| the url post links to, if any| 
|num_points|the number of points the post received|
|num_comments|the number of comments made on the post|
|author| the username of the poster|
|created_at| the date & time of post's creation|


### Reading in the dataset

In [8]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn_org = list(read_file)



### Length & the first five rows of our dataset

In [9]:
print('no. of entries in our dataset:', len(hn_org), '\n')

for post in hn_org[:5]:
    print(post)

no. of entries in our dataset: 20101 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Now, since our dataset's first entry comprises column headers, we'll store it in a new variable __headers__ and save the rest of dataset in a new variable __hn__. 

In [10]:
headers = hn_org[0]
hn = hn_org[1:]

Printing the first five rows of **hn** to verify we've successfully removed the header. 

In [11]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Seperating Ask HN & Show HN entries
Now, after having sorted our data for analysis, we'll move on to performing core operations. The first is to seperate entries with titles strting with Ask HN or Show HN (irrespective of their capitalization).

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print('no. of ask posts:',len(ask_posts), '\n'
      'no. of show posts:',len(show_posts), '\n'
      'no. of other posts:',len(other_posts))
print('\n','total posts:', len(ask_posts) + len(show_posts) + len(other_posts))

no. of ask posts: 1744 
no. of show posts: 1162 
no. of other posts: 17194

 total posts: 20100


### Calculating avg comment score:
Now we'll write a piece of code to calculate average number of comments for each of the __ask hn__, __show hn__, & __others__ posts. 

In [13]:
def avg_comments(dataset):
    
    total = 0
    
    for index, entry in enumerate(dataset):
        comments = int(entry[4])
        total += comments
        index = index + 1
        
    avg = round(total / index, 2)
    return avg

In [14]:
avg_ask_comments = avg_comments(ask_posts)
print('avg no. of ask comments:', avg_ask_comments)

avg_show_comments = avg_comments(show_posts)
print('avg no. of show comments:', avg_show_comments)

avg_other_comments = avg_comments(other_posts)
print('avg no. of other comments:', avg_other_comments)



avg no. of ask comments: 14.04
avg no. of show comments: 10.32
avg no. of other comments: 26.87


It appears from calculating average number of comments for ask and show posts that, 

- ask posts receive more comments than show posts.
- posts other than ask or show receive the highest comments on average.

### Time-wise analysis of posts & comments
Since ask posts have the highest number of comments on average, we'll focus solely on them. Moving on, we'll try to see which time segments are likely to yield the highest no. of comments and posts in a day.


In [15]:
from datetime import datetime

result_list = []

for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    temp_list = [created_at, comments]
    result_list.append(temp_list)


counts_by_hour, comments_by_hour = {}, {}

for entry in result_list:
    dt = entry[0]
    comments = entry[1]
    dt_object = datetime.strptime(dt, "%m/%d/%Y %H:%M")
    hour = dt_object.strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    

In [16]:
print(counts_by_hour)
print(comments_by_hour)

{'11': 58, '12': 73, '10': 59, '13': 85, '08': 48, '21': 109, '04': 47, '22': 71, '20': 80, '09': 45, '05': 46, '07': 34, '19': 110, '00': 55, '02': 58, '06': 44, '15': 116, '23': 68, '14': 107, '03': 54, '01': 60, '16': 108, '17': 100, '18': 109}
{'11': 641, '12': 687, '10': 793, '13': 1253, '08': 492, '21': 1745, '04': 337, '22': 479, '20': 1722, '09': 251, '05': 464, '07': 267, '19': 1188, '00': 447, '02': 1381, '06': 397, '15': 4477, '23': 543, '14': 1416, '03': 421, '01': 683, '16': 1814, '17': 1146, '18': 1439}


In [27]:
avg_by_hour = []
comments_list = []
posts_list = []

for key in counts_by_hour:
    for element in comments_by_hour:
        if element == key:
            avg = comments_by_hour[element] / counts_by_hour[key]
            a_list = [element, avg]
            avg_by_hour.append(a_list)

print(avg_by_hour)

[['11', 11.051724137931034], ['12', 9.41095890410959], ['10', 13.440677966101696], ['13', 14.741176470588234], ['08', 10.25], ['21', 16.009174311926607], ['04', 7.170212765957447], ['22', 6.746478873239437], ['20', 21.525], ['09', 5.5777777777777775], ['05', 10.08695652173913], ['07', 7.852941176470588], ['19', 10.8], ['00', 8.127272727272727], ['02', 23.810344827586206], ['06', 9.022727272727273], ['15', 38.5948275862069], ['23', 7.985294117647059], ['14', 13.233644859813085], ['03', 7.796296296296297], ['01', 11.383333333333333], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532]]


We now have time-wise average comments per post from our __hn__ dataset. 

However, in order to make this data more readable, we'll sort the elements of this list in ascending order of their average comment values. 


In [55]:
print(avg_by_hour[:4])

[['11', 11.051724137931034], ['12', 9.41095890410959], ['10', 13.440677966101696], ['13', 14.741176470588234]]


In [63]:
swap_avg_by_hour = []

for list in avg_by_hour:
    small_list = [list[1], list[0]]
    swap_avg_by_hour.append(small_list)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

def top_5(list):
    print('Top 5 Hours for Ask Post Comments: ')
    for list in sorted_swap[:5]:
        dt_object = datetime.strptime(list[1], "%H")
        dt_string = dt_object.strftime("%H:%M")
        print("{}: {:.2f} average comments per post.".format(dt_string, list[0]))
    

As we can see, the top time slots for highest comments' likelyhood are 15:00, 12:00,...
But we're not sure what timezone these figures were recorded in. So, to find out, we've consulted the dataset's documentation availble at kaggle (https://www.kaggle.com/hacker-news/hacker-news-posts). From this description, we've learned that the time zone is US Eastern. 

But since our point of reference for this analysis is Pakistan, we'll convert these values to PST (Pakistan Standard Time).

__Pakistan Standard Time = Eastern Time + 9 Hours__

In [65]:
sorted_swap[:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

In [66]:
from datetime import timedelta, time, datetime

for row in sorted_swap:
    string_time = row[1]
    dt_object = datetime.strptime(string_time, "%H")
    corrected_time = dt_object + timedelta(hours=9)
    corrected_time = corrected_time.strftime("%H")
    row[1] = corrected_time

In [67]:
print(sorted_swap[:5])

[[38.5948275862069, '00'], [23.810344827586206, '11'], [21.525, '05'], [16.796296296296298, '01'], [16.009174311926607, '06']]


### Top hours w/ highest avg Comments for Pak
We can call upon our __top_5(list)__ function to list the hours (in Pakistan Standard Time) that generate the most comments on average when posts are created in them.

In [70]:
top_5(sorted_swap)

Top 5 Hours for Ask Post Comments: 
00:00: 38.59 average comments per post.
11:00: 23.81 average comments per post.
05:00: 21.52 average comments per post.
01:00: 16.80 average comments per post.
06:00: 16.01 average comments per post.


The findings from this basic analysis tell us that if someone living in Pakistan Standard Time Zone wants to maximize the visibility & user engagement over his posts, he or she should,

1. start them with _ask_ rather them _show_ keyword.

2. make them just after midnight, at 11 in the morning, 5 in the afternoon, 1 in the afternoon, or 6 in the morning - in that order. 