# Exploring Hacker News Posts
In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has almost 300,000 rows. We need only "full" rows, that has comments, links, dates of creating, etc.

## Clearing data
The first step of our data working is clearing data.
Below are descriptions of the columns:

- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted
***
For the first we need to open our data set.

In [1]:
from csv import *
opened_file = open('hacker_news.csv', encoding = 'utf-8')
read_file = reader(opened_file)
hn = list(read_file)
len(hn)

293120

We see that our dataset has 293120 rows. 
***
The second step will be removing non-comments rows. The number of comments is 5-th row in dataset.

In [2]:
ind = 1
for row in hn[1:]:
    num_com = int(row[4])
    if num_com == 0:
        del hn[ind]
        ind -= 1
    ind += 1
len(hn)

80402

After clearing non-comments records we have 80402 rows.
***
Now we have only "full" recordings. But its too much to our analizes. Let's remove random 75% of our dataset. We will have almost 20000 records. This will be enougth to our analize.

In [3]:
from random import randint
ind = 1
for row in hn[1:]:
    rand_num = randint(0,99)
    title = row[1]
    if rand_num <= 70: #and not title.lower().startswith('ask hn'):
        del hn[ind]
        ind -= 1
    ind += 1
len(hn)

23348

Now let's look on the first five rows of **hs**

In [4]:
hn[0:6]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12578311',
  'Americas Lost Boys: Men who choose video games over work',
  'https://www.firstthings.com/blogs/firstthoughts/2016/08/americas-lost-boys',
  '5',
  '1',
  'jseliger',
  '9/26/2016 0:31'],
 ['12577883',
  'LXQt 0.11 Released',
  'http://lxqt.org/release/2016/09/24/lxqt-011-et-al/',
  '40',
  '13',
  'Tsiolkovsky',
  '9/25/2016 22:51'],
 ['12577787',
  'UnGoogled Chromium: Chromium with enhanced privacy, control and transparency',
  'https://github.com/Eloston/ungoogled-chromium',
  '251',
  '120',
  'kawera',
  '9/25/2016 22:26'],
 ['12577652',
  "Chelsea Manning's 14 days in solitary for suicide attempt is 'cruel and inhuman'",
  'https://www.amnesty.org.uk/press-releases/chelsea-mannings-14-days-solitary-suicide-attempt-cruel-and-inhuman',
  '53',
  '34',
  'robin_reala',
  '9/25/2016 21:51'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00

***
Split our dataset on head and data:

In [5]:
headers = hn[0]
hn = hn[1:]
print('*****head*****\n', headers, '\n*****new hn*****')
hn[0:5]

*****head*****
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 
*****new hn*****


[['12578311',
  'Americas Lost Boys: Men who choose video games over work',
  'https://www.firstthings.com/blogs/firstthoughts/2016/08/americas-lost-boys',
  '5',
  '1',
  'jseliger',
  '9/26/2016 0:31'],
 ['12577883',
  'LXQt 0.11 Released',
  'http://lxqt.org/release/2016/09/24/lxqt-011-et-al/',
  '40',
  '13',
  'Tsiolkovsky',
  '9/25/2016 22:51'],
 ['12577787',
  'UnGoogled Chromium: Chromium with enhanced privacy, control and transparency',
  'https://github.com/Eloston/ungoogled-chromium',
  '251',
  '120',
  'kawera',
  '9/25/2016 22:26'],
 ['12577652',
  "Chelsea Manning's 14 days in solitary for suicide attempt is 'cruel and inhuman'",
  'https://www.amnesty.org.uk/press-releases/chelsea-mannings-14-days-solitary-suicide-attempt-cruel-and-inhuman',
  '53',
  '34',
  'robin_reala',
  '9/25/2016 21:51'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00taffe',
  '9/25/2016 21:50']]

***
## Filter data
Now that we've removed the headers from **hn**, we're ready to filter our data. Since we're only concerned with post titles beginning with *Ask HN* or *Show HN*, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either *Ask HN* or *Show HN*, we'll use the string method *startswith*. Given a string object, say, *string1*, we can check if starts with, say, *dq* by inspecting the output of the object **string1.startswith('dq')**. If *string1* starts with *dq*, it will return *True*, otherwise it will return *False*.

In [6]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Let's check the number of posts **ask_posts**, **show_posts**, and **other posts**

In [7]:
print('ask_posts number: ', len(ask_posts))
print('show_posts number: ', len(show_posts))
print('other_posts number: ', len(other_posts))

ask_posts number:  2064
show_posts number:  1455
other_posts number:  19828


***
## Comments range
Next, let's determine if ask posts or show posts receive more comments on average.

In [8]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('average ask_posts comments: ', avg_ask_comments)

average ask_posts comments:  13.570251937984496


In [9]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('average show_posts comments: ', avg_show_comments)

average show_posts comments:  9.094158075601374


As we can see, the average value of **ask_posts** comments is greater than average value of **show_posts**. 

I think that this difference between the average number of comments is due to the fact that the posts with the request are most often commented out because people will want to help the person who created the message in contrast to the show post where there may be 0 comments under the post.
***

## Ask posts analysis
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [10]:
import datetime as dt
result_list = []
for row in ask_posts:
    num_comments = int(row[4])
    result_list.append([row[6], num_comments])
counts_by_hour = {}
comments_by_hour = {}
for each in result_list:
    date_str = each[0]
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_dt, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = each[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += each[1]

We created two dictionaries:

**counts_by_hour**: contains the number of ask posts created during each hour of the day.
**comments_by_hour**: contains the corresponding number of comments ask posts created at each hour received.
***
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [11]:
avg_by_hour = []
for each in counts_by_hour:
    avg_by_hour.append([each, comments_by_hour[each]/counts_by_hour[each]])
print('[hour, avg_by_hour]')
avg_by_hour

[hour, avg_by_hour]


[['21', 10.313868613138686],
 ['13', 14.710280373831775],
 ['07', 8.127659574468085],
 ['22', 14.291666666666666],
 ['08', 14.352941176470589],
 ['20', 9.022222222222222],
 ['18', 10.884615384615385],
 ['16', 8.807142857142857],
 ['19', 8.535714285714286],
 ['17', 16.03361344537815],
 ['12', 24.185185185185187],
 ['02', 10.75],
 ['11', 8.5],
 ['06', 9.314814814814815],
 ['23', 9.697368421052632],
 ['15', 42.148936170212764],
 ['14', 15.991452991452991],
 ['09', 4.1063829787234045],
 ['00', 9.7],
 ['03', 7.775862068965517],
 ['10', 12.203125],
 ['05', 9.211538461538462],
 ['01', 10.791044776119403],
 ['04', 10.23404255319149]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [12]:
swap_avg_by_hour = []
for each in avg_by_hour:
    swap_avg_by_hour.append([each[1], each[0]])
print('[avg_by_hour, hour]')
swap_avg_by_hour

[avg_by_hour, hour]


[[10.313868613138686, '21'],
 [14.710280373831775, '13'],
 [8.127659574468085, '07'],
 [14.291666666666666, '22'],
 [14.352941176470589, '08'],
 [9.022222222222222, '20'],
 [10.884615384615385, '18'],
 [8.807142857142857, '16'],
 [8.535714285714286, '19'],
 [16.03361344537815, '17'],
 [24.185185185185187, '12'],
 [10.75, '02'],
 [8.5, '11'],
 [9.314814814814815, '06'],
 [9.697368421052632, '23'],
 [42.148936170212764, '15'],
 [15.991452991452991, '14'],
 [4.1063829787234045, '09'],
 [9.7, '00'],
 [7.775862068965517, '03'],
 [12.203125, '10'],
 [9.211538461538462, '05'],
 [10.791044776119403, '01'],
 [10.23404255319149, '04']]

In [14]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for each in sorted_swap[0:5]:
    date_str = each[1]
    date_dt = dt.datetime.strptime(date_str, "%H")
    time = dt.datetime.strftime(date_dt, "%H:%M")
    print('{}: {:.2f} average comments per post'.format(time, each[0]))

Top 5 Hours for Ask Posts Comments
15:00: 42.15 average comments per post
12:00: 24.19 average comments per post
17:00: 16.03 average comments per post
14:00: 15.99 average comments per post
13:00: 14.71 average comments per post


## Conclusion
As we can see, your post is more likely to be commented on, and your question was answered in a timely manner, you need to post posts and ask questions at 15:00. If you do not have such an opportunity, then try to manage to do what is planned between 12:00 and 17:00 - at this time the most discussed and most active posts are published