# Assignment - Recency

-by Qi Sun


## Hacker News
 
I downloaded the dataset from Kaggle. 

https://www.kaggle.com/santiagobasulto/all-hacker-news-posts-stories-askshow-hn-polls

The size of this dataset is 503.07 MB. The data were collected from Oct.2006 to Sep.2020. There are 8 columns in this dataset. They are: 

* Object ID: The unique identifier from Hacker News for the post

* Title: The title of the post 

* Post Type: there are four types - story (regular post), ask_hn, show_hn, poll

* Author: The username of the person who submitted the post

* Created At:  The date and time at which the post was submitted

* URL: The URL that the posts links to, if it the post has a URL 

* Points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

* Number of Comments: The number of comments that were made on the post

 
## Hacker News ranking algorithm

Reference: https://medium.com/hacking-and-gonzo/how-hacker-news-ranking-algorithm-works-1d9b0cf2c08d

The post above explain how the Hacker News ranking algorithm works. It displays the source code (shown below) for the ranking function of the articles on the Hacker News website. 

<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-07%20at%207.51.14%20PM.png" width="700">


Below is the Hacker News’s formula

<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-07%20at%207.51.21%20PM.png" width="700">


Also, the author proves that gravity (G) and time (T) have a significant impact on the score of an item: 

* "*the score decreases as T increases, meaning that older items will get lower and lower scores*"
* "*the score decreases much faster for older items if gravity is increased*"

P is upvote of the article (upvote-downvote). This vote should be subtracted 1 to not count the upvote of the writer.

T is time between posting time and the current time (in hours). 

G is the “gravity” constant, default is 1.8. To control whether the score is reduced rapidly or slowly over time, the G constant is used as a gravity number. The higher the gravity number, the faster the score will decrease over time.



**Python implementation**

The author rewritten the score function in Python. 

<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-07%20at%208.06.27%20PM.png" width="700">

In this assignment, I'll use this function to create rating score for this dataset. Then, I'll use Reddit algorithm - hot ranking to calculate penalties in order to rank newer stories higher than older.


In [35]:
import pandas as pd
import numpy as np

df = pd.read_csv('/Users/yangyang/Desktop/hn.csv')
df.head()

Unnamed: 0,Object ID,Title,Post Type,Author,Created At,URL,Points,Number of Comments
0,1,Y Combinator,story,pg,2006-10-09 18:21:51,http://ycombinator.com,61,18.0
1,2,A Student's Guide to Startups,story,phyllis,2006-10-09 18:30:28,http://www.paulgraham.com/mit.html,16,1.0
2,3,Woz Interview: the early days of Apple,story,phyllis,2006-10-09 18:40:33,http://www.foundersatwork.com/stevewozniak.html,7,1.0
3,4,NYC Developer Dilemma,story,onebeerdave,2006-10-09 18:47:42,http://avc.blogs.com/a_vc/2006/10/the_nyc_deve...,5,1.0
4,5,"Google, YouTube acquisition announcement could...",story,perler,2006-10-09 18:51:04,http://www.techcrunch.com/2006/10/09/google-yo...,7,1.0


In [37]:
# rename varible
df.rename(columns = {'Created At':'created'}, inplace = True)
df.tail(2)

Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments
3121799,24517608,A Great Wave of Hokusai Drawings Resurfaced at...,story,bookofjoe,2020-09-18 15:00:12,https://www.atlasobscura.com/articles/lost-hok...,1,0.0
3121800,24517619,Doukutsu-rs: A reimplementation of the Cave St...,story,calibas,2020-09-18 15:01:13,https://github.com/alula/doukutsu-rs,1,0.0


### 1. Create current date time

Since the end date of this dataset is Sep 18, 2020, I assume current data time is Sep 19, 2020. 

In [38]:
import datetime

df02 = df.copy()

df02['dates'] = pd.to_datetime("'2020-09-19'".replace("'",""))
df02.head(2)


Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments,dates
0,1,Y Combinator,story,pg,2006-10-09 18:21:51,http://ycombinator.com,61,18.0,2020-09-19
1,2,A Student's Guide to Startups,story,phyllis,2006-10-09 18:30:28,http://www.paulgraham.com/mit.html,16,1.0,2020-09-19


In [39]:
# Change the data type of created date to datatime

df02['created'] = pd.to_datetime(df02['created'])
df02.dtypes

Object ID                      int64
Title                         object
Post Type                     object
Author                        object
created               datetime64[ns]
URL                           object
Points                         int64
Number of Comments           float64
dates                 datetime64[ns]
dtype: object

### 2. Calculate time between posting time and the current time (in hours)

In [40]:
df03 = df02.copy()

df03['hours'] = (df03.dates - df03.created).astype('timedelta64[h]')
df03.head()


Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments,dates,hours
0,1,Y Combinator,story,pg,2006-10-09 18:21:51,http://ycombinator.com,61,18.0,2020-09-19,122237.0
1,2,A Student's Guide to Startups,story,phyllis,2006-10-09 18:30:28,http://www.paulgraham.com/mit.html,16,1.0,2020-09-19,122237.0
2,3,Woz Interview: the early days of Apple,story,phyllis,2006-10-09 18:40:33,http://www.foundersatwork.com/stevewozniak.html,7,1.0,2020-09-19,122237.0
3,4,NYC Developer Dilemma,story,onebeerdave,2006-10-09 18:47:42,http://avc.blogs.com/a_vc/2006/10/the_nyc_deve...,5,1.0,2020-09-19,122237.0
4,5,"Google, YouTube acquisition announcement could...",story,perler,2006-10-09 18:51:04,http://www.techcrunch.com/2006/10/09/google-yo...,7,1.0,2020-09-19,122237.0


### 3. Apply Hacker News ranking algorithm

Score = (P-1) / (T+2)^G

G = 1.8

In [41]:
df03['score_raw']= (df03['Points']-1)/pow((df03['hours']+2),1.8)
df03.head()


Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments,dates,hours,score_raw
0,1,Y Combinator,story,pg,2006-10-09 18:21:51,http://ycombinator.com,61,18.0,2020-09-19,122237.0,4.179974e-08
1,2,A Student's Guide to Startups,story,phyllis,2006-10-09 18:30:28,http://www.paulgraham.com/mit.html,16,1.0,2020-09-19,122237.0,1.044994e-08
2,3,Woz Interview: the early days of Apple,story,phyllis,2006-10-09 18:40:33,http://www.foundersatwork.com/stevewozniak.html,7,1.0,2020-09-19,122237.0,4.179974e-09
3,4,NYC Developer Dilemma,story,onebeerdave,2006-10-09 18:47:42,http://avc.blogs.com/a_vc/2006/10/the_nyc_deve...,5,1.0,2020-09-19,122237.0,2.78665e-09
4,5,"Google, YouTube acquisition announcement could...",story,perler,2006-10-09 18:51:04,http://www.techcrunch.com/2006/10/09/google-yo...,7,1.0,2020-09-19,122237.0,4.179974e-09


### 4. Top 10 items using Hacker News ranking algorithm

In [42]:
# sort value
df04 = df03.sort_values(by='score_raw', ascending=False)
df04.head(10)


Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments,dates,hours,score_raw
3121711,24516345,Trump offered to pardon Assange if he provided...,story,pseudolus,2020-09-18 13:17:02,https://www.reuters.com/article/us-britain-ass...,307,230.0,2020-09-19,10.0,3.492973
3120937,24506303,Tell HN: Never search for domains on Godaddy.com,story,wasteme,2020-09-17 16:02:59,,1572,687.0,2020-09-19,31.0,2.903027
3121461,24513340,FreeCAD: A free and open source multiplatform ...,story,creolabs,2020-09-18 05:41:51,https://github.com/FreeCAD/FreeCAD,405,193.0,2020-09-19,18.0,1.83877
3121596,24515063,What happened to Firefox Send?,story,Techbrunch,2020-09-18 10:36:11,https://support.mozilla.org/en-US/kb/what-happ...,238,146.0,2020-09-19,13.0,1.81044
3121239,24510053,21 years after the request OpenPGP support get...,story,janvdberg,2020-09-17 21:14:48,https://bugzilla.mozilla.org/show_bug.cgi?id=2...,627,228.0,2020-09-19,26.0,1.554855
3120414,24499924,This electrical transmission tower has a problem,story,danso,2020-09-17 00:49:13,https://twitter.com/tubetimeus/status/13063593...,1588,454.0,2020-09-19,47.0,1.439541
3121640,24515540,Commerce Department Prohibits WeChat and TikTo...,story,JacobHenner,2020-09-18 11:49:05,https://www.commerce.gov/news/press-releases/2...,158,139.0,2020-09-19,12.0,1.357904
3121542,24514433,Facebook Accused of Watching Instagram Users T...,story,drewem,2020-09-18 08:43:48,https://www.bloomberg.com/news/articles/2020-0...,198,113.0,2020-09-19,15.0,1.201319
3120747,24504080,"Cloudflare and the Wayback Machine, joining fo...",story,jgrahamc,2020-09-17 13:03:35,http://blog.archive.org/2020/09/17/internet-ar...,667,140.0,2020-09-19,34.0,1.052276
3121658,24515786,"Apple Ending ""Fortnite Save the World"" Updates...",story,tosh,2020-09-18 12:26:09,https://www.epicgames.com/fortnite/en-US/news/...,107,172.0,2020-09-19,11.0,1.04763


## Reddit algorithm - hot ranking


Next, I'll use Reddit algorithm to add penalty.

<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-08%20at%2011.42.35%20AM.png" width="500">

Based on the articles below, 

* Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older.

* The score won’t decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News’s algorithm which decreases the score as time goes by.

'1134028003' is a fixed time (August 12, 2005 @ 7:46 am (UTC), can be understood as the time when Reddit started operating), which is Unix Timestamp format.

We can also decrease 45000 to give the time component more weight.


References:

https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9

https://medium.com/jp-tech/how-are-popular-ranking-algorithms-such-as-reddit-and-hacker-news-working-724e639ed9f7



In [43]:
df05 = df04.copy()

In [57]:
# https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9
from datetime import datetime, timedelta
from math import log

epoch = datetime(2005, 8, 12)

def epoch_seconds(date):
    td = date - epoch
    return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)


def calculate_rank_sum(s, date):
    order = log(max(abs(s), 1), 10)
    sign = 1 if s > 0 else -1 if s < 0 else 0
    seconds = epoch_seconds(date) - 1134028003
    long_number = sign * order + seconds / 45000
    return round(long_number, 7)



In [58]:
df05['penalized_score'] = df05.apply(lambda x: calculate_rank_sum(x.Points, x.created), axis=1)

### Top 10 items using Reddit algorithm for penalizing older information

In [60]:
# sort value
df06 = df05.sort_values(by='penalized_score', ascending=False)
df06.head(10)


Unnamed: 0,Object ID,Title,Post Type,Author,created,URL,Points,Number of Comments,dates,hours,score_raw,penalized_score
3121711,24516345,Trump offered to pardon Assange if he provided...,story,pseudolus,2020-09-18 13:17:02,https://www.reuters.com/article/us-britain-ass...,307,230.0,2020-09-19,10.0,3.492973,-14606.352439
3121596,24515063,What happened to Firefox Send?,story,Techbrunch,2020-09-18 10:36:11,https://support.mozilla.org/en-US/kb/what-happ...,238,146.0,2020-09-19,13.0,1.81044,-14606.677467
3121640,24515540,Commerce Department Prohibits WeChat and TikTo...,story,JacobHenner,2020-09-18 11:49:05,https://www.commerce.gov/news/press-releases/2...,158,139.0,2020-09-19,12.0,1.357904,-14606.758187
3121461,24513340,FreeCAD: A free and open source multiplatform ...,story,creolabs,2020-09-18 05:41:51,https://github.com/FreeCAD/FreeCAD,405,193.0,2020-09-19,18.0,1.83877,-14606.839034
3121658,24515786,"Apple Ending ""Fortnite Save the World"" Updates...",story,tosh,2020-09-18 12:26:09,https://www.epicgames.com/fortnite/en-US/news/...,107,172.0,2020-09-19,11.0,1.04763,-14606.878038
3121542,24514433,Facebook Accused of Watching Instagram Users T...,story,drewem,2020-09-18 08:43:48,https://www.bloomberg.com/news/articles/2020-0...,198,113.0,2020-09-19,15.0,1.201319,-14606.907224
3121635,24515461,TikTok to be banned from US app stores from Su...,story,jsty,2020-09-18 11:37:04,https://www.ft.com/content/c460ce4c-c691-4df5-...,93,113.0,2020-09-19,12.0,0.795715,-14607.004384
3121586,24514978,OpenSCAD - The Programmers Solid 3D CAD Modeller,story,MrsPeaches,2020-09-18 10:20:39,http://www.openscad.org/,100,25.0,2020-09-19,13.0,0.75626,-14607.074756
3121719,24516453,Backdoors and other vulnerabilities in HiSilic...,story,blablablub,2020-09-18 13:26:37,https://kojenov.com/2020-09-15-hisilicon-encod...,53,16.0,2020-09-19,10.0,0.593577,-14607.102524
3121589,24515019,Reinventing virtualization with the AWS Nitro ...,story,manigandham,2020-09-18 10:27:52,https://www.allthingsdistributed.com/2020/09/r...,82,15.0,2020-09-19,13.0,0.618758,-14607.15132
