# Hacker news - Post Anlaysis

In this project, we'll compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either **Ask HN** or **Show HN.**

Users submit **Ask HN posts** to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit **Show HN posts** to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

* Do _Ask HN_ or _Show HN_ receive more comments on average?
* Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.


## What the data set is in this project
Resourse: https://www.kaggle.com/hacker-news/hacker-news-posts

This data set is Hacker News posts from September 2015 to September 2016

It includes the following columns:
- **title**: title of the post (self explanatory)
- **url**: the url of the item being linked to
- **num_points**: the number of upvotes the post received
- **num_comments**: the number of comments the post received
- **author**: the name of the account that made the post
- **created_at**: the date and time the post was made (the time zone is Eastern Time in the US)



## 1 - Introduction
First, we'll read in the data and remove the headers.

In [1]:
open_file = open("hacker_news.csv", encoding="utf8")
from csv import reader
read_file = reader(open_file) 
hn = list(read_file) 

for i in hn[:5]:
    print(i)
    print("\n")
    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




In [2]:
hn_headers = hn[0]

hn = hn[1:]

print(hn_headers)
for i in hn[:5]:
    print("\n")
    print(i)
    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


## 2. Extracting Ask HN and Show HN Posts
Since we're only concerned with post titles beginning with ***Ask HN*** or ***Show HN***, we'll create new lists of lists containing just the data for those titles.

Let's use ***str.startwith()*** and ***str.lower()*** methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

**Instructions:**
1. Create three empty lists called **ask_posts**, **show_posts**, and **other_posts**.
2. Loop through each row in **hn**.
    - Assign the **title** in each row to a variable named **title**.
    - Because the **title** column is the second column, you'll need to get the element at index **1** in each row.
3. Implement the following steps:
    - If the lowercase version of **title** starts with **ask hn**, append the row to **ask_posts**.
    - Else if the lowercase version of **title** starts with **show hn**, append the row to **show_posts**.
    - Else append to **other_posts**.
4. Check the number of posts in **ask_posts**, **show_posts**, and **other_posts**.

In [3]:
ask_posts = []
show_posts = []
other_posts =  []

for post in hn:
    title = post[1]
    if (title.lower().startswith('ask hn')): 
        ask_posts.append(post)
    elif (title.lower().startswith('show hn')):
        show_posts.append(post)
    else:
        other_posts.append(post)
        

print("Number of 'Ask HN' posts:",len(ask_posts))
print("Number of 'Show HN' posts:",len(show_posts))
print("Number of 'Other posts':",len(other_posts)) 
        

Number of 'Ask HN' posts: 9139
Number of 'Show HN' posts: 10158
Number of 'Other posts': 273822


# 3.Calculating the Average Number of Comments for Ask HN and Show HN Posts


Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.

**SUGESTÃO: acrescentar à análise também os pontos. Poderá ter menos comentários mas ser mais "popular**

In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])

    avg_ask_comments = total_ask_comments /len(ask_posts)    
print("The 'ASK Posts' have, in average, " + str(round(avg_ask_comments,2)) +
     " comments")

total_ask_points = 0
for row in ask_posts:
    total_ask_points += int(row[3])

    avg_ask_points = total_ask_points /len(ask_posts)    
print("The 'ASK Posts' have, in average, " + str(round(avg_ask_points,2)) +
     " points")

print(5*"*")

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments /len(show_posts)    
print("The 'SHOW Posts' have, in average, " + str(round(avg_show_comments,2)) +
     " comments")

total_show_points = 0
for row in show_posts:
    total_show_points += int(row[3])

    avg_show_points = total_show_points /len(ask_posts)    
print("The 'SHOW Posts' have, in average, " + str(round(avg_show_points,2)) +
     " points")


The 'ASK Posts' have, in average, 10.39 comments
The 'ASK Posts' have, in average, 11.31 points
*****
The 'SHOW Posts' have, in average, 4.89 comments
The 'SHOW Posts' have, in average, 16.5 points


The "ASK Posts" has, on average, more comments, than the "SHOW Posts" but less upvotes. As the difference is not significant and the goal of the project was just to know which type of posts receives more comments we will continue to analyze only the **"ASK Posts"**.

## 4.. Finding the Amount of Ask Posts and Comments by Hour Created
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    (Useing _datetime module_ to work with the data in the created_at column.)
2. Calculate the average number of comments ask posts receive by hour created.

**Instructions:**
1. Import the **datetime module** as **dt**.
2. Create an empty list and assign it to **result_list**. This will be a list of lists.
3. Iterate over **ask_posts** and append to **result_list** a list with two elements:
    - The first element shall be the column **created_at**.
        - Because the **created_at** column is the seventh column in **ask_posts**, you'll need to get the element at index **6** in each row.
    - The second element shall be the number of comments of the post.
        - You'll also need to convert the value to an integer.
4. Create two empty dictionaries called **counts_by_hour** and **comments_by_hour**.
5. Loop through each row of **result_list**.
6. Extract the hour from the date, which is the first element of the row.
7. Use the **datetime.strptime()** method to parse the date and create a datetime object.
8. Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
    - Use the **datetime.strftime()** method to select just the hour from the datetime object.
    - **If the hour isn't** a key in counts_by_hour:
        - Create the key in counts_by_hour and set it equal to 1.
        - Create the key in comments_by_hour and set it equal to the comment number.
    - **If the hour is** already a key in counts_by_hour:
        - Increment the value in counts_by_hour by 1.
        - Increment the value in comments_by_hour by the comment number.

In [5]:
import datetime as dt

results_list = []

for post in ask_posts:
    time_post = post[6]
    n_comments = int(post[4])
    results_list.append([time_post,n_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for post in results_list:
    date = post[0]
    comment = post[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1


listofTuples = sorted(comments_by_hour.items() ,  key=lambda x: x[1], reverse = True)
for elem in listofTuples :
    print(elem[0]+"h:00 -" , elem[1], "comments" )

15h:00 - 18525 comments
13h:00 - 7245 comments
17h:00 - 5547 comments
14h:00 - 4972 comments
18h:00 - 4877 comments
21h:00 - 4500 comments
16h:00 - 4466 comments
20h:00 - 4462 comments
12h:00 - 4234 comments
19h:00 - 3954 comments
22h:00 - 3372 comments
10h:00 - 3013 comments
02h:00 - 2996 comments
11h:00 - 2797 comments
08h:00 - 2362 comments
04h:00 - 2360 comments
23h:00 - 2297 comments
00h:00 - 2277 comments
03h:00 - 2154 comments
01h:00 - 2089 comments
05h:00 - 1838 comments
06h:00 - 1587 comments
07h:00 - 1585 comments
09h:00 - 1477 comments


Knowing the total comments, and the number of postings per hour we can know the number of comments that each post receives on average:

In [6]:
avg_by_hours = []

for hour in comments_by_hour:
    avg_by_hours.append([hour,round(comments_by_hour[hour]/counts_by_hour[hour],2)])
    
avg_by_hours.sort(key=lambda x: x[1], reverse=True)


class color:
   BOLD = '\033[1m'
   END = '\033[0m'

print('\033[1m' + 'TOP 5 HOURS' + '\033[0m')
for hour in avg_by_hours[:5]:
    print(hour[0]+"h:00 - ",hour[1],"comments by hour")

print(5*"*") 
print('\033[1m' + 'BOTTOM 5 HOURS' + '\033[0m') 
for hour in avg_by_hours[-5:]:
    print(hour[0]+"h:00 - ",hour[1],"comments by hour")

[1mTOP 5 HOURS[0m
15h:00 -  28.68 comments by hour
13h:00 -  16.32 comments by hour
12h:00 -  12.38 comments by hour
02h:00 -  11.14 comments by hour
10h:00 -  10.68 comments by hour
*****
[1mBOTTOM 5 HOURS[0m
19h:00 -  7.16 comments by hour
07h:00 -  7.01 comments by hour
06h:00 -  6.78 comments by hour
23h:00 -  6.7 comments by hour
09h:00 -  6.65 comments by hour


The hour that receives the most comments per post on average is 15:00, with an average of 28.68 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

## 6.Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.