# Hacker News Analysis By Popularity

In this project, we'll compare two different types of posts from [Hacker News]('https://news.ycombinator.com/'), a popular site where technology related stories(or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit `Show HN` posts to show the Hacker News Community a project, product, or just generally something interesting.

We'll specifically compares these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

**The Following Dataset was reduced from 300,000 to 20,000 rows for removing all submissions that did not receive any comments**

In [2]:
import csv

file = open('hacker_news.csv', encoding = 'utf-8')

read_file = csv.reader(file)

hn = list(read_file)

print(hn[:2])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']]


As with any other machine learning projects, we'll first remove the header from the read file for the sake easing the process of iterating through the data

In [3]:
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
print('\n--------------------------------------------------------\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

--------------------------------------------------------

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', 

## Extracting Ask HN and Show HN Posts

Now that the header and dataset has been sorted, loop through the dataset to look for posts that start with `Ask HN` or `Show HN`.

**※The Following dataset can be classifieid by the title, the beginning ask post starts with `Ask HN` and show posts starting with `Show HN` while the remainder does not have such classifier※**

This can be achieved by using two in built methods of python, `lower` and `startswith()`.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Ask_Posts Length: ', len(ask_posts))
print('Show_Posts Length: ', len(show_posts))
print('Other_Posts length: ', len(other_posts))


Ask_Posts Length:  1744
Show_Posts Length:  1162
Other_Posts length:  17194

 -------------------------------- 

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


Now that posts that classify as `Ask HN`, `Show HN` and `Other` has been sorted, we can loop through each of the category to see how many comments have been set on each category to determine which category is more popular amongst the posts.

In [10]:
total_ask_comments = 0
total_show_comments = 0
avg_ask_comment = 0
avg_show_comment = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
for row in show_posts:
    total_show_comments += int(row[4])

avg_ask_comment = total_ask_comments / len(ask_posts)
avg_show_comment = total_show_comments / len(show_posts)

print('In total {tab} has total of {tot} comments and avg of {avg:.2f} comments'.format(tab = 'Ask HN', tot = total_ask_comments, avg = avg_ask_comment))
print('In total {tab} has total of {tot} comments and avg of {avg:.2f} comments'.format(tab = 'Show HN', tot = total_show_comments, avg = avg_show_comment))

In total Ask HN has total of 24483 comments and avg of 14.04 comments
In total Show HN has total of 11988 comments and avg of 10.32 comments


As Shown as the result above, `Ask HN` posts have on average `4%` more than `Show HN` posts. As such, we'll be focusing on the more popular `Ask HN` posts for the remaining analysis.*(Will be making further analysis on the less popular `Show HN` Posts)*