Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A maintained list of news feeds #131

Open
scripting opened this issue Sep 15, 2019 · 15 comments

Comments

@scripting
Copy link
Owner

commented Sep 15, 2019

I've been posting about the need for a maintained list of stable and usable RSS feeds from news orgs. There is some activity out there, and a discussion is possible, but I can't be in the center of it. I'm willing to facilitate though, so here's a place to post instead of emailing directly to me.

Please be on-topic. No debates.

I have many applications for it. Think of me as a user.

@lemeb

This comment has been minimized.

Copy link

commented Sep 15, 2019

Thanks for writing that up. A few remarks:

  • Media Cloud is a pretty stable project. Not only have we been around since 2009, but we are publishing a significant amount of scholarship based on its data. The most notable piece is probably the Berkman study 1 about the 2016 election, which was made into a book 2 last year. We also have guaranteed funding for the next few years, thanks notably to Knight, the Ford Foundation, and the Gates Foundation. And we have something like 7 to 8 full-time staff.
  • We are very aggressively operating as an open source project. You can bootstrap an instance of Media Cloud on Github. 3 You can also already play with the web interface 4 or the API. 5 Almost everything that we have access to can be accessed by anyone — except the contents of the articles that we crawled, which have to be kept on the Media Lab premise for copyright reasons.
  • I understand why making a list of RSS feeds sounds elegant, and that’s essentially what we’re doing right now, but it has become more and more difficult to maintain. RSS feeds on news sites, to put it politely, are not managed with the same care as they were ten years ago. We are progressively moving using site maps as a backstop instead.
  • A lot of the questions that you raised have been discussed by the team over the past years. For instance, should we have a conservative list of RSS feeds? Well, probably not if we want to make sure our coverage is comprehensive. So what we did instead is to create collections, which are maintained by a human hand, to have cleaner subsets of feeds. 6 That also helps with websites that close and feeds that become silent.
  • Media Cloud is more than just news websites and blogs. We have added a fair amount of Twitter and Facebook data, we will have everything from Reddit indexed shortly, we have some amount of TV transcripts, and the archives of 4ch/8ch are on their way.
@scripting

This comment has been minimized.

Copy link
Owner Author

commented Sep 15, 2019

I'm glad you're doing what you're doing, and I hear good things about it, but as a user, I've tried to describe what I, and I think many other people, need.

We're all kind of fumbling in the dark trying to find stuff worth following. I want to be systematic about it. A process.

Media Cloud seems like something much more comprehensive. I want what I described.

@andysylvester

This comment has been minimized.

Copy link

commented Sep 17, 2019

@mterenzio

This comment has been minimized.

Copy link

commented Sep 18, 2019

@lemeb Does Media Cloud provide a dump of the RSS feeds it does currently use? You say it's hard to maintain. A project like this would get the community involved in maintaining them. It wouldn't be extra work on your part and it might even help you.
If it doesn't make the RSS feed dumps available, can you answer why?
That wouldn't seem to violate any copyright issues.

@scripting

This comment has been minimized.

Copy link
Owner Author

commented Sep 19, 2019

@mterenzio -- it seems like we're not going to hear from them. I did write them a follow-up email suggesting we explore working on this together, but haven't heard back.

I plan to loop back around to this again and again, as I ship new software I'm working on, it'll become the #1 thing on my list again.

The key is the process, and association with organizations that are long-lived and high-reputation.

@mterenzio

This comment has been minimized.

Copy link

commented Sep 19, 2019

@scripting 100% agree. It's a vital resource that isn't available. For web news, this is as important as archive.org is for web history. I'll try a few avenues myself and let you know if I make any progress.

@anothercookiecrumbles

This comment has been minimized.

Copy link

commented Sep 24, 2019

Thanks for putting this together. Some comments from me, a research fellow at the Tow Center over at Columbia Journalism School.

As part of a different project, we've been whitelisting news organisations and have an automated process that checks their RSS feeds, including whether new ones have been added or old ones removed. We have over 700+ legitimate US local / national news organisations as well as a few others like The Guardian and The Financial Times. For some news organisations, we've failed to find any RSS feed, which in itself seems lamentable. Overall, as of now, we have about ~2000 feeds for these ~700 news organisations.

We want to eventually open-source all our code + provide API access, but because the project's in its nascent stage with plenty of moving parts, we've not done so yet. Lest assured, it's high on our list of things to do.

I am happy to share our RSS feeds or a regular basis (a database dump or something once a month? more frequently?), and ensure we're maintaining the quality of our list.

@mterenzio

This comment has been minimized.

Copy link

commented Sep 24, 2019

@anothercookiecrumbles I'm interested and I'd like to learn more about the project

@donpark

This comment has been minimized.

Copy link

commented Sep 24, 2019

I think everyone should share a feed of their 'trust worthy' news sources. Subscribing to a 'source' feed means a) I trust the feed owner's judgements and b) it's added to my feed of trusted news sources. Decentralized House of Cards made out of Turtles all the way down. :-)

@scripting

This comment has been minimized.

Copy link
Owner Author

commented Sep 24, 2019

@anothercookiecrumbles -- bingo! that's exactly what I was looking for.

I think the way to go is to periodically, ideally daily, a script runs, pulls out the feeds, along with any useful metadata you have, formats it as an OPML subscription list and uploads it to a GitHub repo. From there, people can deploy the feeds in any number of different applications.

I have JavaScript code that does all that, and am happy to help. The key thing here is an authoritative list of feeds, and I can't imagine a better authority than Tow Center.

Thanks for getting in touch.

@anothercookiecrumbles

This comment has been minimized.

Copy link

commented Sep 24, 2019

@mterenzio, we've got a bunch of efforts around local news, and the news outlet whitelisting/RSS feeds curation is part of a more data-intensive component to the larger project. We're still fine-tuning the research questions and the shape the project will take, but happy to go into more detail if you're curious.

@scripting, is the JavaScript code open-source? Alternatively, do you know of any Python libraries/repos that does that? If so, I can try to sort something out soon-ish based on the data we have.

The one thing worth pointing out: I think we'll struggle to have an authoritative list of feeds, mostly because whitelisting means we'll inevitably be missing out some news organisations inadvertently. And, even now, the stuff we have is predominantly English, which means we're not capturing a ton of stuff (I think we only have a handful of Spanish sites, and nothing in Chinese or Bengali, for example). This is something we're aware of and looking to address, but it's worth flagging upfront. We need to be able to crowdsource, if nothing else, legitimate lists of news organisations, and things like INN and LION get us some way there, but what about beyond that?

@scripting

This comment has been minimized.

Copy link
Owner Author

commented Sep 24, 2019

@anothercookiecrumbles -- yes all of it is open source, but it'll work better if i do the adaptation and make that open source.

Re struggle -- 1. do the best you can now, and 2. try to do better in the future. Software is a process. It sucks today but it'll suck less tomorrow. That's my philosophy. Here the real accomplishment is to flow the good work you're doing in academia into the RSS community, such as it is (we'll find out) in a useful way. And learn from that, and help each other with the next steps.

So let's go back to step 1. Is there a format you can make available through a (possibly private) API that would get me a list of your feeds and the metadata, in any format you find easy to produce. From that, I can take care of porting to OPML and uploading to GitHub on a daily basis.

I already do that for my blog, in the repo we're using right now. Look in the blog section at the top level. That is updated every night as I post new stuff at scripting.com. I would more or less model the interface on what I learned from that (and the working code that does the uploads).

@anothercookiecrumbles

This comment has been minimized.

Copy link

commented Oct 1, 2019

Sorry for the late reply. I think what might be easiest (and quickest) for me is to write a script that uploads a CSV or something (JSON, XML, whatever) to a GitHub repo, and you can pull it from there?

@scripting

This comment has been minimized.

Copy link
Owner Author

commented Oct 1, 2019

@anothercookiecrumbles -- no worries, this is a very asynchronous thread. ;-)

That would work. Whatever format works best for you.

@PMaynard

This comment has been minimized.

Copy link

commented Oct 7, 2019

@scripting I like the idea. I've tried to operate something similar for a few years. The news is focused on information security, with an industrial control systems bent (Since that's my research topic).

I've added an opml subscription list [1] to my news aggregator[2]. I have been meaning to prune and add more quality feeds. Somethings like the Reddit feed does raise the overall signal to noise ratio, and are not what you'd want to include.

[1] http://port22.co.uk/port22_feeds.opml
[2] https://port22.co.uk/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.