
#  Exploring Hacker News Posts 


In this project, we analyse a data set from [Hacker News](https://news.ycombinator.com/), a popular website where technology related posts are published. 

Our analysis will be focused on two major type of posts, The Ask HN and Show HN posts.

* Ask HN Post : These are posts that users use to ask technology related questions to the Hacker New Community.                                 **Any post that starts with Ask HN is an Ask HN Post**. Example: Ask HN: Which programming language is advisable for beginners ?


* Show HN Post : Users use this post to show case any form of project they worked on.                                                             **These posts start with Show HN**

#  Goal of this Analysis

## We seek to answer the following questions
- Which type of post is more frequent on Hacker News between Ask and Show Post
- Which post between Ask HN and Show HN receive more points ?

- Do Ask HN post receive more comments reactions than Show HN post ?

- At what time of the day does these posts receive the highest number of comments on average ?



The dataset for this analysis was obtained from [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts)



# Getting Startted

### Points to note:

The original data obtained from [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts) contained 300,000 rows.

This data has been cleaned and reduce to approximately 20,000rows. rows containing posts with no comments has been stripped from the data set, and the remaining rows were gotten through random sampling.

### Information about the columns

The data set contains seven columns as listed below.

1. **id**: The unique identifier from Hacker News for the post
2. **title**: The title of the post
3. **url**: The URL that the posts links to, if the post has a URL
4. **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
5. **num_comments**: The number of comments that were made on the post
6. **author**: The username of the person who submitted the post
7. **created_at**: The date and time at which the post was submitted


Let us start by reading the data, storing it into a list of lists and removing the header from it.

In [11]:
# Import csv and read the data set
import csv

open_file = open("hacker_news.csv", encoding="utf8")
hn = csv.reader(open_file)
hn = list(hn)
header = hn[0]

#Removing the header
hn = hn[1:]

print(header)


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Next we will print the first 5 rows of our data to visualize the structure

In [12]:
# First five rows
for row in hn[:5]:
    print(row)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


# Seperating rows with post that starts with Ask HN, Show HN and Others

We will loop through the list of list hn and check if Ask HN or Show HN is included in the title. If included, create two distict lists and add each post type respectively in their list

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    if row[1].lower().startswith("ask hn"):
        ask_posts.append(row)
        
    elif row[1].lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(f"There are {len(ask_posts)} Ask HN posts")
print(f"There are {len(show_posts)} Show HN posts")
print(f"There are {len(other_posts)} Other posts")

There are 1744 Ask HN posts
There are 1162 Show HN posts
There are 17194 Other posts


From the above result, the number of Ask HN posts are almost two times more than that of Show HN. The difference in number is not surprising to me, I too spend most time asking questions than showcasing my projetcs.

## Determining which comments type receive more points on average.


**Number of Points** = calculated as the total number of upvotes minus the total number of downvotes

we will start by calculating the total number of points then divide by the total number of comments for each post type.

In [16]:
def total_points(data):
    s = 0
    for row in data:
        s += int(row[3])
    return s

def total_comments(data):
    s = 0
    for row in data:
        s += int(row[4])
    return s

total_ask_points = total_points(ask_posts)
total_show_points = total_points(show_posts)
total_ask_comments = total_comments(ask_posts)
total_show_comments = total_comments(show_posts)

average_point_per_ask_post = total_ask_points/total_ask_comments
average_point_per_show_post = total_show_points/total_show_comments

print("There are {0:.2f} points per ask post on average".format(average_point_per_ask_post))

print("There are {0:.2f} points per show post on average".format(average_point_per_show_post))

There are 1.07 points per ask post on average
There are 2.67 points per show post on average


The number of points for Show posts are two times that of Ask Post. Which could be due to the fact that users appriacte more show post and gives them more upvotes.

# Calculating the average number of comments per post for Ask HN and Show HN

In [17]:
average_ask_comments = total_ask_comments/len(ask_posts)

average_show_comments = total_show_comments/len(show_posts)

print("There approximately {} comments per ask post".format(average_ask_comments ))
print("There approximately {} comments per ask post".format(average_show_comments ))

There approximately 14.038417431192661 comments per ask post
There approximately 10.31669535283993 comments per ask post


On average each ask post receives 14 comments. Like wise, eact show post receive about 10comments. 

They rest of our analysis will be focused on Ask Post since they receive more comments and therefore draws more attention.

# Calculate the amount of ask post  and comments per hour

from our previous analysis, ask post receive more comments than show posts on average. So we will focus our analysis on ask posts.

Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

In [18]:
# Import datetime module
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
comments_by_hour = {}
counts_by_hour = {}

for row in result_list:
    # Getting the time for each post
    time = row[0].split(' ')[1]
    # converting time into a datetime object for easy manipulation and formatiing
    dateObj = dt.datetime.strptime(time, "%H:%M")
    hour = dateObj.strftime("%H")
    
    # Counting the numbers of comments by hour
    if hour not in comments_by_hour:
        comments_by_hour[hour] = int(row[1])
    else:
        comments_by_hour[hour] += int(row[1])
    
    # Counting the number of posts by hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
    else:
        counts_by_hour[hour] += 1
    
print(comments_by_hour)
print(counts_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


## Calculating the average number of comments per post using the two dictions above

In [19]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Sorting the list of lists (avg_by_hour) in descending order and getting the five first hours of the day with the highest number of comments per post

In [None]:
avg_by_hour.sort(key=lambda val: val[1], reverse= True)
five_hours_with_highest_comments = avg_by_hour[:5]

for hrs, avg in five_hours_with_highest_comments:
    hour = dt.datetime.strptime(hrs, "%H").strftime("%H:%M")
    print("{0}: {1:.2f} average comments per post".format(hour, avg))