#  Exploring Hacker News Posts 
---
__Columns description:__
- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

## Preparing and exploring data
---

In [1]:
opened_file = open('hacker_news.csv', encoding="utf8")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

import datetime as dt

In [2]:
def explore_data(dataset, start, end, rows=False, columns=False):
    '''This funstion helps to exprlore dataset. 
    dataset – name of the dataset
    start, end – put indices of rows that you want to look at
    rows – shows number of rows in the dataset
    columns – shows number of columns in the dataset
    '''
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        
    if rows:
        print('')
        print('Number of rows:', len(dataset))
    if columns:
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(hn, 0, 2, True, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

Number of rows: 20100
Number of columns: 7


In [4]:
# split dataset up into headlist and list with values
headers = hn[0]
hn = hn [1:]

In [5]:
explore_data(hn, 0, 2, True, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

Number of rows: 20099
Number of columns: 7


## Problem 1: which posts are more engaging among community?
---
We know that there are two major categories of the posts 'Ask HN' and 'Show HN'.
We can split up our dataset by three categories and create three differents lists: 
- __`ask_posts`__ – ssers submit Ask HN posts to ask the Hacker News community a specific question
- __`show_posts`__ – users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. 
- __`other_posts`__ – other posts

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

----
#### Cheking that the data was splited up correctly:

In [7]:
explore_data(ask_posts, 0, 0, True, True)


Number of rows: 1744
Number of columns: 7


In [8]:
explore_data(show_posts, 0, 0, True, True)


Number of rows: 1162
Number of columns: 7


In [9]:
explore_data(other_posts, 0, 0, True, True)


Number of rows: 17193
Number of columns: 7


---
#### Anayzing data:

In [10]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments

print('Total number of comments on ask posts: {: .0f}'.format(total_ask_comments) )

Total number of comments on ask posts:  24483


In [11]:
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments on ask posts: {: .0f}'.format(avg_ask_comments) )

Average number of comments on ask posts:  14


In [12]:
total_show_comments = 0 

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

In [13]:
print('Total number of comments in show posts: {: .0f}'.format(total_show_comments) )

Total number of comments in show posts:  11988


In [14]:
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of comments in show posts: {: .0f}'.format(avg_show_comments) )

Average number of comments in show posts:  10


#### Hypothesis:
We see that __`Ask hn` posts are more engaging than `Show hn`.__ <br>
We can suggest that users love to help others and share their experince more than just express emotions and ideas about something new.

## Problem 2: what time is more preferable for posting to gain more comments?
---

Firstly we can find number of commnets per hour and number of post per hour 
To do that I'll create a two dictionaries:

#### option 1
---

In [15]:
dt_format_hn = '%m/%d/%Y %H:%M'

result_list =[]

for row in ask_posts:
    
    time = row[6]
    num_com = row[4]
    num_com = str(num_com)
    result_list.append([time, num_com])
    
### dictionaries woth results    
counts_by_hour = {}
comments_by_hour = {}

### lopping prepared list woth data    
for i in result_list:
    time = i[0]
    time = dt.datetime.strptime(time, dt_format_hn)
    hour = dt.datetime.strftime(time, '%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1    
    else:
        counts_by_hour[hour] = 1
        
    num_comm = int(i[1])
    
    if hour in comments_by_hour:
        comments_by_hour[hour] += num_comm
    else:
        comments_by_hour[hour] = num_comm

#### option 2
---

In [16]:
dt_format_hn = '%m/%d/%Y %H:%M'

### dictionaries woth results    

counts_by_hour = {}
comments_by_hour = {}

### ask_posts dataset list woth data    
for i in ask_posts:
    time = i[6]
    time = dt.datetime.strptime(time, dt_format_hn)
    hour = dt.datetime.strftime(time, '%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1    
    else:
        counts_by_hour[hour] = 1
        
    num_comm = int(i[4])
    
    if hour in comments_by_hour:
        comments_by_hour[hour] += num_comm
    else:
        comments_by_hour[hour] = num_comm

The next step is to find __the average number of comments per post for posts created during each hour of the day__. 

To do that I'll transfer data from dictionaries to one list `avg_by_hour` with two collumns:
- hour
- number of comments

In [17]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, 
                        comments_by_hour[hour] / counts_by_hour[hour] ])

In [18]:
avg_by_hour[:3]

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696]]

After I've decided to sort data in `avg_by_hour` by number of comments:

In [19]:
swap_avg_by_hour = []

for i in avg_by_hour:
    first = i[0]
    second = i[1]
    swap_avg_by_hour.append([second, first])

In [20]:
swap_avg_by_hour[:3]

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10']]

In [21]:
sorted_swap = sorted(swap_avg_by_hour, key = lambda comments: comments[0], reverse = True)

In [22]:
sorted_swap[:3]

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20']]

# Top 5 Hours for Ask Posts Comments
---

Here I showed _'ideal hours'_ for posting in case if user wants to make a high-engaging post. 

In [23]:
sorted_avg = sorted(avg_by_hour)

for i in sorted_avg:
    time = i[0]
    time = dt.datetime.strptime(time, '%H')
    hour = dt.datetime.strftime(time, '%H:%M')
    com = i[1]
    print(hour, '●' * int(com))

00:00 ●●●●●●●●
01:00 ●●●●●●●●●●●
02:00 ●●●●●●●●●●●●●●●●●●●●●●●
03:00 ●●●●●●●
04:00 ●●●●●●●
05:00 ●●●●●●●●●●
06:00 ●●●●●●●●●
07:00 ●●●●●●●
08:00 ●●●●●●●●●●
09:00 ●●●●●
10:00 ●●●●●●●●●●●●●
11:00 ●●●●●●●●●●●
12:00 ●●●●●●●●●
13:00 ●●●●●●●●●●●●●●
14:00 ●●●●●●●●●●●●●
15:00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
16:00 ●●●●●●●●●●●●●●●●
17:00 ●●●●●●●●●●●
18:00 ●●●●●●●●●●●●●
19:00 ●●●●●●●●●●
20:00 ●●●●●●●●●●●●●●●●●●●●●
21:00 ●●●●●●●●●●●●●●●●
22:00 ●●●●●●
23:00 ●●●●●●●


If we look at the data cerefuly, we may notice that there are three hour peaks, when users leave comments on posts:

In [24]:
for row in sorted_swap[:5]:
    time = row[1]
    time = dt.datetime.strptime(time, '%H')
    hour = dt.datetime.strftime(time, '%H:%M')
    com = row[0]
    print(hour, '●' * int(com))

15:00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
02:00 ●●●●●●●●●●●●●●●●●●●●●●●
20:00 ●●●●●●●●●●●●●●●●●●●●●
16:00 ●●●●●●●●●●●●●●●●
21:00 ●●●●●●●●●●●●●●●●


In [25]:
for row in sorted_swap[:5]:
    time = row[1]
    time = dt.datetime.strptime(time, '%H')
    hour = dt.datetime.strftime(time, '%H:%M')
    com = row[0]
    print('{} : {: .2f}'.format(hour, com))

15:00 :  38.59
02:00 :  23.81
20:00 :  21.52
16:00 :  16.80
21:00 :  16.01


So the ideal time for posting is 3 p.m., 
but if you are busy at work you can do it after around 8 p.m. or 2 a.m.

---