These are the source RDF files used to generate the Who's on First (at the New York Times) webpages between 2006 and 2010. As of this writing that link (to the "Who's on First" pages) is broken because I am a space cadet. It will be fixed shortly...
These are not the actual articles as published by the New York Times but instead the metadata about each article (authors, subjects, locations, etc.) along with pointers to the articles themselves.
Comprehensive documentation still needs to be written.
Yes. It's all in scary RDF/XML. It's not how I would do it now but it's what I did then. It looks scarier than it is. Please submit patches, fixes, whatever. That's part of the reason I am putting this all on Github.
I have no excuses. Some of the data is probably garbled beyond recognition at this point. I suppose it would be possible to recrawl the New York Times website to fix those mistakes but I haven't done that, ever.
I'm sorry. Luminoso's Fixing common Unicode mistakes with Python â€” after they’ve been made might fix some of the problems (maybe?) but I have not tried this yet...