Skip to content

rumca-js/RSS-Link-Database-2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RSS links database for year 2022

Suite of projects

Goal

  • Archive purposes
  • Data analysis - possible to verify link rot, etc.
  • Google sucks at providing results for various topics (dead internet)

Inspirations

Data

Daily Data

  • RSS links are captured for each source separately
  • two files formats for each day and source: JSON and markdown
  • markdown file is generated as a form of preview, JSON can be reused, imported
  • links are bookmarked, but that does not necessarily mean something is endorsed. It shows particular intrest in topic. It is indication of importance. Such links are stored 'forever'

Sources

  • provided in sources.json file
  • provides information about sources, like title, url, langugage

Data analysis

With these data we can perform further analysis:

  • how many of old links are not any longer valid (link rot test)
  • capture all domains from RSS links (internal, and leading outside?). Analyse which domains are most common
  • which site generates most entries
  • we can capture all external links from entries, to capture where these sites lead to (check network effect, etc)
  • we can verify who reported first on certain topics

Problems, notes

  • Internet Archive (archive.org) does not provide snapshots for each and every day for all RSS sources. It is sometimes pretty slow. We would like to be sure that a such snapshot takes place. Therefore we need to export links to daily repo ourselves. Django RSS app also makes requests to archive to make the snapshots
  • Google fails to deliver content of small creators (blogs etc. private pages). Google focuses on corporate hosting. Most common links are towards YouTube, Google maps, Facebook, reddit
  • We cannot replace Google search
  • Google provides only 31 pages of news (in news filter) and around 10 pages for ordinary search. This is a very small number. It is like looking through keyhole at the Internet
  • Link rot is real. My links may be not working after some time
  • Is the data relevant, or useful for anyone?
  • Either we would like to record data from 'well established sources' or gather as many links as possible. I think web engines do it? We cannot gather too much data, as it can destroy our potato servers.
  • there are other RSS solutions like 'feedly', but it is an app, not data. You cannot parse it, you do not own the data, you can only do things that feedly allows you to do

Ending notes

All links belong to us!