NewsVac Design Document
NewsVac is a plugin for Slash that:
Mines web sites for news
Automatically submits news that matches keywords
The practical objective of NewsVac is to have an average of 120 NewsVac submissions per day, with half being postable stories, and that it takes less than an hour per day (half hour in the morning, half hour late afternoon) of an editor's time per day. Of course, any improvement on the percentage of postable stories and decrease in time taken is good.
NewsVac is a powerful and complex system. It has several components.
The URL tables store all the URLs, whether they are URLs to be mined for stories, or they are the story URLs themselves.
Miners take URLs assigned to them and mine them, looking for stories, and creating nuggets that describe the stories.
Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL. The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext). (Jamie: so source might be what NewsForge is calling a "Parent URL", such as "osdn.com"?)
Parsers operate in turn on specific types of data. There are currently three parsers.
- parse_miner
-
This parser is used by the URLs returned by the miners. It parses the incoming data (using various regexes to trim the page, and to find multiple stories on the page), and creates nuggets.
- parse_nugget
-
This parser takes the nuggets and creates data (???) to be passed to the next parser.
- parse_plaintext
-
This parser takes the stories and pulls out the text, to later be matched against keywords, and submitted.
Spiders control the whole flow of processing, executing the miners and calling each parser in order. It is a data structure of conditions and instructions.
For each stage of the spidering, a group of URLs is fetched from the database. Each URL is processed: that is 1. requested, and 2. analyzed.
To request a URL is to fetch it, either from the web, or, in the case of nuggets, by extracting the data from the nugget itself. Here, the url_info and url_content tables are updated.
To analyze the URL is to process its data with the right parsers. Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (not sure what this means in this context?)).
Keywords are words, grouped together by tags, to be used to weigh against stories parsed by parse_plaintext. robosubmit() goes through and matches the stories against the keywords, and if the story meets a certain threshold, submits it to the submission bin.
Is there a relationship between URLs and keywords ... ?
- url_info
-
The basic information describing a URL, including the URI, timestamps, status code, content type, what miner it belongs to, etc.
- url_content
-
The response header and cookies for a URL.
- url_ analysis
-
Information about the URL's analysis: what parser was used, what miner was used, when it was analyzed, how long it took, how many nuggets were produced.
- url_message_body
-
The body of the URL.
- url_plaintext
-
The plaintext body of the URL.
- rel table
-
A parser, when creating a new URL, will form a relationship between the new URL and the parsed URL, and stick that relationship here. It describes what parser created the relationship, what type of URL the new URL is, etc.
- miner
-
Describes each miner, including the various regexes used to trim text and find stories.
- spider
-
Describes each spider, including the conditions, group_0 selects, and commands.
- robosubmitlock
- spiderlock
- spider_timespec
-
Something apparently hacked on so certain spiders would only run at certain times. This really should be changed so one spider can run an entire site, efficiently. But there might not be time for that. I am going to reevaluate this when I find out more about how it works.
- nugget_sub
-
I don't know. I thought this just listed URLs that can be submitted (submitworthy), but it apparently inserts into the table after creating submissions. I don't understand when it is used, or what its purpose is.
- newsvac_keywords
-
All the keywords, consisting of a regular expression (can be literal text), a relative weight in floating point, and what tag (group) it belongs to.
Relative priorities from 1-10 are in parentheses. All of these should get done, but will be done in order of priority so NewsVac can be "up and running" as soon as possible. Other features and fixes can be finished later. Some of these items may already be completed, though I am unaware of any of them being completed.
Add ability to "mine" RSS. Shouldn't be difficult to add; just add a new parse code, which will be called for certain content types. The trick will be making sure that the content type is correct. (5)
Trim actual story (plaintext) so we don't match all of page, just the story. Likely, this will happen in parse_plaintext; the trick is knowing how to trim each individual story, either having generic regexes (not likely) or specific ones per site, like with miners. But while miners have regexes attached to them, these things don't have any such reasonable relationship. I can attach the regex to the miner and climb the tree back up, and pick the first miner that matches, or the most recent one. (9)
Similar to above, also get first paragraph of story. (8)
Only check for new stories, since last spider, not all new stories. Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs. Perhaps match site name + paragraphs/title? Find degree of matching? Consider using code from admin.pl (which I don't think will work, since it is has many many false positives, which is fine for human editors, but this must be automated; but perhaps it can be fine-tuned to be better for automation). Check garbage collection, efficiency of table (index help?). (7)
Refine how NewsVac submissions are displayed in the submissions bin. Probably sufficient to make sure submissions are flagged as being from NewsVac, and then displayed separately, perhaps with a horizontal line below user-submitted stories. No need to sort by weight, but have a cutoff for the total score, of course. (6)
Allow different keyword sets to apply to different URLs. Assigned to miners, or spiders? (5)
Abstract out robosubmitting, allow for possibly emailing results, not just creating submissions. Defined per site, per spider, per miner? (3)
Test miners from the interface, somehow. I get the impression this is already working, though, at least to some degree. I don't quite understand what happens when a URL/miner is added/edited; something is going out and fetching URLs, but I don't know what parsers are being called, what is being put into the DB, etc. (8)
When submitting stories, properly populate the URL and Parent URL fields in section_extras (or any other fields they decide on). (8)
Related: perhaps make topical RSS feeds for Slash sites, not just sectional RSS feeds, which would make it so we could have NewsForge/Linux.com be the clearinghouse for NewsVac'd stories, and put those feeds into topics, letting each foundry pick up different applicable topics. Just a thought, but we need to figure out how to populate foundries soon. (7)
Revision 1.2 2002/09/04 20:29:58 pudge Added more information about spiders. Added basic information about the purpose of each DB table. Added TODO list.
Revision 1.1 2002/09/04 17:07:09 pudge Describe basic outline of NewsVac structure.
This document is being maintained by Chris Nandor <pudge@osdn.com>, with aid from Jamie McCarthy, Cliff Wood, Brian Aker, and Robin Miller.