# Hacker News posts analysis.

### Hacker News.

### Goals of the project.

1) Determine, if the posts starting with "Ask HN" (posts asking the communit a question) or "Show HN" (posts sharing a project, product, etc. with the community) receive more comments than others.

2) Determine if posts published in a certain time collect more comments.

### Plan of work.

### Dataset description.

<table>
    <tr>
        <th>Index</th>
        <th>Name</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>0</td>
        <td>id</td>
        <td>Unique identifier of the post on Hacker News</td>
    </tr>
    <tr>
        <td>1</td>
        <td>title</td>
        <td>Title of the post</td>
    </tr>
     <tr>
        <td>2</td>
        <td>url</td>
        <td>URL the post is linked to</td>
    </tr>
    <tr>
        <td>3</td>
        <td>num_points</td>
        <td>The number of points of the post</td>
    </tr>
    <tr>
        <td>4</td>
        <td>num_comments</td>
        <td>The number of comments under the post</td>
    </tr>
    <tr>
        <td>5</td>
        <td>author</td>
        <td>The username of the author of the post</td>
    </tr>
    <tr>
        <td>6</td>
        <td>created_at</td>
        <td>The date and time when the post was created</td>
    </tr>

</table>

#### Pleparing the dataset.

In [2]:
#Importing dataset and converting it to list.
import csv

hn = list(csv.reader(open("hacker_news.csv")))
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
#Removing the heading row.

headers = hn[0] # headers of the dataset
hn = hn[1:] # dataset for work

print("Headers of the dataset: ", headers)
print('\n')
print("Typical row of the dataset: ",hn[0])

Headers of the dataset:  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Typical row of the dataset:  ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


#### Filtering "ask" and "show" posts.

In this part, we will filter "ask", "show" and all others posts in separate variables for future analysis. 

In [4]:
ask_posts = list()
show_posts = list()
other_posts = list()

for post in hn:
    title = post[1].lower() # to lowercase
    if (title.startswith("ask hn")):
        ask_posts.append(post)
    elif (title.startswith("show hn")):
        show_posts.append(post)
    else:
        other_posts.append(post)
    

In [5]:
print(f"The number of 'ask' posts: {len(ask_posts)}")
print(f"The number of 'show' posts: {len(show_posts)}")
print(f"The number of other posts: {len(other_posts)}")

The number of 'ask' posts: 1744
The number of 'show' posts: 1162
The number of other posts: 17194


#### Finding the average number of comments in posts of each section.

In [6]:
# Calculate the average number of comments under the posts
# of the given subset of the dataset
# Takes list the subset 
# Returns float the average number of comments

def average_number_of_comments(data: list) -> float:
    total_comments = 0
    total_posts = len(data)
    
    for post in data:
        total_comments += int(post[4])
    average_comments = total_comments/total_posts
    
    return average_comments

In [8]:
average_comments_ask = average_number_of_comments(ask_posts)
print(f"The average number of comments under the 'ask' post is: {round(average_comments_ask, 2)}")

The average number of comments under the 'ask' post is: 14.04


In [9]:
average_comments_show = average_number_of_comments(show_posts)
print(f"The average number of comments under the 'show' post is: {round(average_comments_show, 2)}")

The average number of comments under the 'show' post is: 10.32


In [10]:
average_comments_other = average_number_of_comments(other_posts)
print(f"The average number of comments under the 'regular' post is: {round(average_comments_other, 2)}")

The average number of comments under the 'regular' post is: 26.87


#### Conslusion

* Average number of comments under non-ask and non-show post is significantly higher. 

* The average number of comments under 'ask' posts is slightly higher that the number of comments under 'show' posts.

#### Finding the number of comments by hours when a post was created

In [16]:
# Create a list of all date-number of comments information
import datetime as dt

time_comments_list = list() 

for row in hn:
    time_comments_list.append([row[6], row[4]])

print("There is an entry for each post of the dataset: ", len(time_comments_list) == len(hn))
print("Typical entry is: ", time_comments_list[2])

There is an entry for each post of the dataset:  True
Typical entry is:  ['6/23/2016 22:20', '1']


Create a map hour -> number of posts created at the hour

Create a map hour -> total number of comments under the posts created at the hour

In [25]:
counts_by_hour = dict() # The number of posts created at particular time
comments_by_hour = dict() #The total number of comments under the posts created at particular time

for entry in time_comments_list:
    date_parsed = dt.datetime.strptime(entry[0], "%m/%d/%Y %H:%M")
    hour = date_parsed.hour
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(entry[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(entry[1])
        
# Correctness check
# The number of entries of both dictionaries are the same:
assert(len(counts_by_hour) == len(comments_by_hour))

# The number of entries of both dictionaries <= 24
assert(len(counts_by_hour) <= 24 )
