Skip to content

rumca-js/RSS-Link-Database-2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Link database for year 2024

This repository contains link metadata: title, description, publish date, etc.

Suite of projects

Goal

  • Archive purposes
  • Data analysis - possible to verify link rot, etc.

Inspirations

Data

Daily Data

Are stored in the year directory.

  • data are in %Y%M%Y-%M-%D directories, where %Y stands for year, %M for month, %D for day
  • Most links are captured via RSS. Some entries were added manually
  • for each source two files are provided: JSON and markdown
  • markdown file is provided for data preview
  • this repo contains many links captured via automated process. They are here, but that does not mean I endorse them all

Sources

  • file: sources.json
  • provides information about sources, like title, url, langugage

Domains

  • file: domains.json
  • provides information about domains, like title, url, langugage

Data analysis

With these data we can perform further analysis:

  • analysis of links: how many of old links are not any longer valid
  • analysis of RSS source: how often it publishes data
  • analysis of RSS source: what kind of data it produces, is it reliable
  • analysis of RSS source: is it a content farm, does it contain many links outside of the domain?
  • analysis of domains: is the domain correctly configured?
  • analysis of topics: who was the first to report on certain topic
  • analysis of topics: which source uses which words? For example it seems that left leaning sites, and white leaning sites have a different vocublary. There are different kind of words and ideads in. With this file history, you can analyze which sites have which ideas

Problems, notes

  • This solution does not replace Internet Archive. We do not store all link data
  • Internet Archive (archive.org) is sometimes slow
  • This solution does not replace Google. This would be futile. However Google provides only 31 pages of news (in news filter) and around 10 pages for ordinary search. This is a very small number. It is like looking through keyhole at the Internet
  • You cannot discover new content using Google. For example write 'blog' into search. You will not be able to find new blogs.
  • You cannot discover new content using YouTube. For example write 'trailer' into search. It shows me certain amount of new trailers from time span of 10 days, nothing older, or just a few older titles. I prefer capturing data from a movie trailer channel and continue to use it's data
  • Link rot is real. Many github pages do not work at all. Many pages stop working after significant amount of time
  • I am using raspberry PI at the moment. With it I cannot track all of the sources in the world. Therefore I track only 'well established' sources, or the ones I am really interested in
  • Is the data relevant, or useful for anyone but me?

Q & A

Q: Why are you using sources like daily mail? This does not make any sense! A: This project aims to capture time capsule of day, or year. In statistics sometimes it is too late to capture "new" data. For analysis only credible, and reliable sources could be used

Ending notes

All links belong to us!

About

link archive for year 2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published