# Exploring hacker news dataset (dataquest exercise)

**TABLE COLUMNS**

* id: The unique identifier from Hacker News for the post.

* title: The title of the post.

* url: The URL that the posts links to, if the post has a URL.

* num_points: The number of points the post acquired, calculated as the total  number of upvotes minus the total number of downvotes.

* num_comments: The number of comments that were made on the post.

* author: The username of the person who submitted the post.

* created_at: The date and time at which the post was submitted. 

** Open and read the file: **

In [None]:
from csv import reader as read
hn = list(read(open("../input/hacker-news-posts/HN_posts_year_to_Sep_26_2016.csv")))

** Return first five rows: **

In [None]:
for row in hn[0:5]:
    print(row,"\n")

** Header row is in our dataset so we are going remove it. Then we will return header and first five rows**

In [None]:
headers = hn[0]
hn=hn[1:]
print(headers,"\n")
for row in hn[:5]:
    print(row,"\n")

### ** We're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles. **

In [None]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if (((title.lower()).startswith("ask hn")) == True):
        ask_posts.append(row)
    elif(((title.lower()).startswith("show hn")) == True):
        show_posts.append(row)
    else:
        other_posts.append(row)

#### Number of posts

In [None]:
print("total posts: ",len(hn))

In [None]:
print("ask posts:",len(ask_posts),"- show posts:",len(show_posts),"- other posts:",len(other_posts))

### **Next, let's determine if "ask" posts or "show" posts receive more comments on average.**

In [None]:
#first we will find total comments for ask
total_ask_comments = 0
for row in ask_posts:
     total_ask_comments+=int(row[4])
#then we will find average
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average comments for ask:",avg_ask_comments)

In [None]:
#Now we will find total comments for show
total_show_comments=0
for row in show_posts:
     total_show_comments+=int(row[4])
#then we will find average
avg_show_comments = total_show_comments/len(show_posts)
print("Average comments for show:",avg_show_comments)

** As we can see people tend to comment more on ask posts . I think the results were expected because most of people will use comments to help someone instead of to really leave a comment on something **

### **Next, we'll determine if ask posts created at a certain time are more likely to attract comments. **

In [None]:
import datetime as dt
result_list=[]

for row in ask_posts:
    temp=[]
    temp.append(row[6])
    temp.append(int(row[4]))
    result_list.append(temp)
    
counts_by_hour = {}
comments_by_hour={}

In [None]:
for row in result_list:
    temp = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    temp = temp.strftime("%H")
    if temp not in counts_by_hour:
        counts_by_hour[temp]=1
        comments_by_hour[temp] = row[1]
    else:
        counts_by_hour[temp]+=1
        comments_by_hour[temp]+= row[1]
    

** Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. **


In [None]:
avg_by_hour = []
for row in counts_by_hour:
    avg_by_hour.append([row,comments_by_hour[row]/counts_by_hour[row]])

In [None]:
for row in avg_by_hour:
    print(row)

** Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read. **

In [None]:
swap_avg_by_hour =[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
for row in swap_avg_by_hour:
    print(row)

In [None]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

In [None]:
print("Top 5 Hours for Ask Posts Comments")

In [None]:
for row in sorted_swap[:5]:
    temp  = dt.datetime.strptime(row[1],"%H")
    temp = temp.strftime("%H:%M")
    print("{a} : {b:.2f} average comments per post ".format(a=temp,b=row[0]))

** End **