# An Analysis of News Posts on a Popular Website (Hacker News)
This project analyses user content on a popular technology website "Hacker News". Dataset containing user information has been downloaded from [kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts/kernels). Two types of posts "Ask HN" and "Show HN" are more thoroughly regarding contents, ratings, timings etc.

### Opening and Preliminary Exploration of Data
dataset is opened in code editor using `open` command, and then read using `import` and `reader`commands. Header, i.e. column heading is assigned to separate variable and then the first row containing title is removed. 

In [1]:
def explore_dataset(dataset,start,end,rows_and_columns=False):
        dataset_slice=dataset[start:end]
        for row in dataset_slice:
            print(row)
            print('\n') #adds an empty line after each row
        if rows_and_columns:
            print('Number of rows',len(dataset))
            print('Number of Columns',len(dataset[0]))

In [2]:
opened_file_hn=open('HN_posts.csv',encoding="utf8")
from csv import reader
read_file_hn=reader(opened_file_hn)
hn_full=list(read_file_hn)
hn_header=hn_full[0]
hn_full=hn_full[1:]
print('hn full Columns:',hn_header,'\n')
hn_5rows= explore_dataset(hn_full,0,5,True)
print(hn_5rows)

hn full Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:

### Cleaning Data-I
- Removing entries with "no comments". This reduces "293119" entries to approx "80000".

In [3]:
hn_big=[]
for row in hn_full:
    num_comments=row[4]
    if num_comments!='0':
        hn_big.append(row)
print('hn Columns:',hn_header,'\n')
hn_5rows= explore_dataset(hn_big,0,5,True)
print(hn_5rows)

hn Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']


['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']


['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


Number of rows 80401
Number of Columns 7
None


### Cleaning Data-II
- Using "sample" method to reduce entries to 20,000 through random sampling. 

In [4]:

from random import sample
hn=sample(hn_big,20000)
print('hn Columns:',hn_header,'\n')
hn_5rows= explore_dataset(hn,0,5,True)
print(hn_5rows)

hn Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12354992', 'Ask HN: What should I know about FDM Group?', '', '1', '1', 'greyostrich', '8/24/2016 20:27']


['11579098', 'Rotonde: the IOT library', 'https://github.com/HackerLoop/rotonde/tree/master', '10', '2', 'ges', '4/27/2016 10:11']


['10692656', 'Story of Diaspora community manager', 'https://medium.com/anti-fiction/planting-a-seed-what-working-at-diaspora-was-like-cde26fa29364#.gyo9c9aoy', '3', '1', 'zlatan_todoric', '12/7/2015 21:00']


['10990621', 'Banking Model of Education', 'https://en.wikipedia.org/wiki/Banking_education', '3', '1', 'Kinnard', '1/28/2016 19:39']


['12091931', 'The Downside to Cord-Cutting', 'http://www.nytimes.com/2016/07/14/technology/personaltech/the-downside-to-cord-cutting.html?smid=fb-nytimes&smtyp=cur&_r=0', '2', '1', 'prostoalex', '7/14/2016 5:59']


Number of rows 20000
Number of Columns 7
None


### Cleaning Data-III
- Segregating posts with titles "Ask HN" and "Show HN" from "Other Posts"

In [5]:

ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    title=row[1]
    title=title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):    
        show_posts.append(row)
    else:
        other_posts.append(row)    
    

### Verification
In order to verify, we use the function ```explore_dataset()``` defined above.

In [6]:
print('ask posts Columns:',hn_header,'\n')
ask_posts_5rows= explore_dataset(ask_posts,0,5,True)
print(ask_posts_5rows)

ask posts Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12354992', 'Ask HN: What should I know about FDM Group?', '', '1', '1', 'greyostrich', '8/24/2016 20:27']


['11644667', 'Ask HN: Unknown certificate authority', '', '1', '2', 'andrewfromx', '5/6/2016 15:37']


['10595325', 'Ask HN: How can I influence a change in this development process?', '', '5', '4', 'notinreallife', '11/19/2015 15:42']


['11212623', 'Ask HN: Was Mark Zuckerberg wrong to call All Lives Matter employees malicious?', '', '5', '14', 'alllivesmatter', '3/2/2016 20:08']


['12300433', "Ask HN: What's the best way to do lighting in 2016?", '', '2', '2', 'jMyles', '8/16/2016 20:33']


Number of rows 1656
Number of Columns 7
None


In [7]:
print('show posts Columns:',hn_header,'\n')
show_posts_5rows= explore_dataset(show_posts,0,5,True)
print(show_posts_5rows)

show posts Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['11432026', 'Show HN: In-Depth Guide to Choosing a Website Builder', 'http://www.sitebuilderreport.com/', '2', '5', 'steve-benjamins', '4/5/2016 16:50']


['10219223', 'Show HN: Playing with the Google Charts API  Trending News', 'http://techwatching.com/discover.php', '4', '2', 'techwatching', '9/15/2015 6:18']


['10887071', 'Show HN: Libconcurrent  Coroutines in C', 'https://github.com/sharow/libconcurrent', '91', '24', 'mitghi', '1/12/2016 12:50']


['11596408', 'Show HN: Free Website Speed Test with Analytics and Improvement Tips', 'https://loadfocus.com', '1', '2', 'loadfocus', '4/29/2016 15:39']


['12397346', 'Show HN: A Small SHMUP Game, Frequent Flyer', '', '2', '4', 'comrad_gremlin', '8/31/2016 10:55']


Number of rows 1314
Number of Columns 7
None


In [8]:
print('other posts Columns:',hn_header,'\n')
other_posts_5rows= explore_dataset(other_posts,0,5,True)
print(other_posts_5rows)

other posts Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['11579098', 'Rotonde: the IOT library', 'https://github.com/HackerLoop/rotonde/tree/master', '10', '2', 'ges', '4/27/2016 10:11']


['10692656', 'Story of Diaspora community manager', 'https://medium.com/anti-fiction/planting-a-seed-what-working-at-diaspora-was-like-cde26fa29364#.gyo9c9aoy', '3', '1', 'zlatan_todoric', '12/7/2015 21:00']


['10990621', 'Banking Model of Education', 'https://en.wikipedia.org/wiki/Banking_education', '3', '1', 'Kinnard', '1/28/2016 19:39']


['12091931', 'The Downside to Cord-Cutting', 'http://www.nytimes.com/2016/07/14/technology/personaltech/the-downside-to-cord-cutting.html?smid=fb-nytimes&smtyp=cur&_r=0', '2', '1', 'prostoalex', '7/14/2016 5:59']


['11425705', "Millennials' new retirement number? $1.8M (or more)", 'http://college.usatoday.com/2016/03/30/millennials-new-retirement-number-1-8-million-or-more/', '3', '2', 'JSeymourATL', '4/4/2016 21:12'

### Average Number of Comments
We have to find now average number of comments for both ask_posts and show_posts. For this task we define a function ```avg_entity(dataset,n)``` where n is the index number of the entity for which average is to be determined in a given 'dataset'. It is assumed that the entity for which average is to be determined is listed as a string in the dataset and will, therefore, have to be converted to an integer or float.


In [9]:
def avg_entity(dataset,n):
    entity_list=[]
    for row in dataset:
        entity=row[n]
        n_entity=int(entity)
        entity_list.append(n_entity)
    sum_entity=sum(entity_list)
    length_entity=len(entity_list)
    avg_entity=sum_entity/length_entity
    return avg_entity    

** We now use the function defined above to find average comments per post for both "ask_posts' and "show_posts.**

In [10]:
average_ask_comments=avg_entity(ask_posts,4)
average_show_comments=avg_entity(show_posts,4)
print('avg ask comments',average_ask_comments,'\n', 'avg show comments',average_show_comments)

avg ask comments 13.492149758454106 
 avg show comments 10.028158295281584


### Finding number of ask posts and number of comments by hour
*** We now create a separate list named ```result_list``` to work with time and number of comments entities in ``ask_posts`` list  

In [11]:
result_list=[]

for row in ask_posts:
    created_at=row[6]
    num_comments=int(row[4])
    data=[created_at,num_comments]
    result_list.append(data)

print(result_list[:5]) 
print(len(result_list))


[['8/24/2016 20:27', 1], ['5/6/2016 15:37', 2], ['11/19/2015 15:42', 4], ['3/2/2016 20:08', 14], ['8/16/2016 20:33', 2]]
1656


***We now use datetime module to calculate total posts and total comments by the hour for ask_posts***
Note: We demonstrate different methods to extract date and time including 'formatting the string", using datetime module, and strptime as well as strftime methods.  

In [12]:
import datetime as dt
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    created_at_str=row[0]
    dt_time_str=created_at.split(' ')
    date_str=dt_time_str[0]
    time_str=dt_time_str[1]
    time_str=time_str.split(':')
    hour=int(time_str[0])
    date=dt.datetime.strptime(created_at_str,'%m/%d/%Y %H:%M')
    time_object=dt.datetime.time(date)
    hour_object=time_object.hour
    hour_str=dt.datetime.strftime(date,'%H')
    if hour_str in counts_by_hour:
        counts_by_hour[hour_str]+=1
    else:
        counts_by_hour[hour_str]=1
    num_comments=row[1]    
    if hour_str in comments_by_hour: 
        comments_by_hour[hour_str]+=num_comments
    else:
        comments_by_hour[hour_str]=num_comments
    
    
print('Counts by Hour',counts_by_hour,'\n',) 
print('Comments by Hour',comments_by_hour,'\n')
    

Counts by Hour {'20': 97, '15': 102, '23': 62, '14': 85, '17': 118, '21': 85, '13': 67, '11': 65, '22': 74, '07': 32, '18': 99, '19': 95, '03': 49, '12': 84, '06': 38, '09': 42, '10': 57, '02': 43, '01': 52, '16': 120, '00': 57, '08': 45, '05': 46, '04': 42} 

Comments by Hour {'20': 1211, '15': 2655, '23': 610, '14': 1014, '17': 2394, '21': 1192, '13': 1784, '11': 521, '22': 738, '07': 221, '18': 1116, '19': 778, '03': 332, '12': 1260, '06': 331, '09': 548, '10': 650, '02': 573, '01': 362, '16': 1628, '00': 967, '08': 597, '05': 403, '04': 458} 



### Finding Average Number of Ask Posts and Comments by Hour
We now find average number of ask posts by the hour.

In [13]:
avg_by_hour=[]

for hour_str in counts_by_hour:
    avg_by_hour.append([hour_str,(comments_by_hour[hour_str]/counts_by_hour[hour_str])])
    
print(avg_by_hour)    
    

[['20', 12.484536082474227], ['15', 26.029411764705884], ['23', 9.838709677419354], ['14', 11.929411764705883], ['17', 20.28813559322034], ['21', 14.023529411764706], ['13', 26.62686567164179], ['11', 8.015384615384615], ['22', 9.972972972972974], ['07', 6.90625], ['18', 11.272727272727273], ['19', 8.189473684210526], ['03', 6.775510204081633], ['12', 15.0], ['06', 8.710526315789474], ['09', 13.047619047619047], ['10', 11.403508771929825], ['02', 13.325581395348838], ['01', 6.961538461538462], ['16', 13.566666666666666], ['00', 16.964912280701753], ['08', 13.266666666666667], ['05', 8.76086956521739], ['04', 10.904761904761905]]


 ### Sorting and Printing Values
 We will now sort the above list in the descending order. 
 First we swap the position of arguments on list ```avg_by_hour``` and store the values in a separate list ```swap_avg_by_hour```

In [14]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    hour_str=row[0]
    avg_comments=row[1]
    swap_avg_by_hour.append([avg_comments,hour_str])
print(swap_avg_by_hour)    



[[12.484536082474227, '20'], [26.029411764705884, '15'], [9.838709677419354, '23'], [11.929411764705883, '14'], [20.28813559322034, '17'], [14.023529411764706, '21'], [26.62686567164179, '13'], [8.015384615384615, '11'], [9.972972972972974, '22'], [6.90625, '07'], [11.272727272727273, '18'], [8.189473684210526, '19'], [6.775510204081633, '03'], [15.0, '12'], [8.710526315789474, '06'], [13.047619047619047, '09'], [11.403508771929825, '10'], [13.325581395348838, '02'], [6.961538461538462, '01'], [13.566666666666666, '16'], [16.964912280701753, '00'], [13.266666666666667, '08'], [8.76086956521739, '05'], [10.904761904761905, '04']]


### Sorting and Identifying Top Five Hours for Posts Comments
Next we sort the list ```swap_avg_by_hour``` and store values in a list ``` sorted_swap```

In [15]:
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
sorted_swap=sorted_swap[:5]
print(sorted_swap[:5])
print('\n','Top 5 Hours for Ask Posts Comments')
for row in sorted_swap:
    avg=row[0]
    hour=row[1]
    hour_object=dt.datetime.strptime(hour,'%H')
    hour_object_str=hour_object.strftime('%H:%M')
    top_5="{time}: {avg:.2f} average comments per post".format(time=hour_object_str,avg=avg)
    print(top_5)
    

[[26.62686567164179, '13'], [26.029411764705884, '15'], [20.28813559322034, '17'], [16.964912280701753, '00'], [15.0, '12']]

 Top 5 Hours for Ask Posts Comments
13:00: 26.63 average comments per post
15:00: 26.03 average comments per post
17:00: 20.29 average comments per post
00:00: 16.96 average comments per post
12:00: 15.00 average comments per post


### Time Zones - Best Time to Post
As per [documention](https://www.kaggle.com/hacker-news/hacker-news-posts#HN_posts_year_to_Sep_26_2016.csv) of the dataset, the time zone for created_at column is Eastern-US which is +9 hours ahead of our time zone. 