Skip to content

Latest commit

 

History

History
executable file
·
226 lines (114 loc) · 8.58 KB

newsvac.pod

File metadata and controls

executable file
·
226 lines (114 loc) · 8.58 KB

NAME

NewsVac Design Document

SYNOPSIS

NewsVac is a plugin for Slash that:

  • Mines web sites for news

  • Automatically submits news that matches keywords

The practical objective of NewsVac is to have an average of 120 NewsVac submissions per day, with half being postable stories, and that it takes less than an hour per day (half hour in the morning, half hour late afternoon) of an editor's time per day. Of course, any improvement on the percentage of postable stories and decrease in time taken is good.

DESCRIPTION

NewsVac is a powerful and complex system. It has several components.

URLs

The URL tables store all the URLs, whether they are URLs to be mined for stories, or they are the story URLs themselves.

Miners

Miners take URLs assigned to them and mine them, looking for stories, and creating nuggets that describe the stories.

Nuggets

Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL. The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext). (Jamie: so source might be what NewsForge is calling a "Parent URL", such as "osdn.com"?)

Parsers

Parsers operate in turn on specific types of data. There are currently three parsers.

parse_miner

This parser is used by the URLs returned by the miners. It parses the incoming data (using various regexes to trim the page, and to find multiple stories on the page), and creates nuggets.

parse_nugget

This parser takes the nuggets and creates data (???) to be passed to the next parser.

parse_plaintext

This parser takes the stories and pulls out the text, to later be matched against keywords, and submitted.

Spiders

Spiders control the whole flow of processing, executing the miners and calling each parser in order. It is a data structure of conditions and instructions.

For each stage of the spidering, a group of URLs is fetched from the database. Each URL is processed: that is 1. requested, and 2. analyzed.

To request a URL is to fetch it, either from the web, or, in the case of nuggets, by extracting the data from the nugget itself. Here, the url_info and url_content tables are updated.

To analyze the URL is to process its data with the right parsers. Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (not sure what this means in this context?)).

Keywords

Keywords are words, grouped together by tags, to be used to weigh against stories parsed by parse_plaintext. robosubmit() goes through and matches the stories against the keywords, and if the story meets a certain threshold, submits it to the submission bin.

Is there a relationship between URLs and keywords ... ?

Tables

url_info

The basic information describing a URL, including the URI, timestamps, status code, content type, what miner it belongs to, etc.

url_content

The response header and cookies for a URL.

url_ analysis

Information about the URL's analysis: what parser was used, what miner was used, when it was analyzed, how long it took, how many nuggets were produced.

url_message_body

The body of the URL.

url_plaintext

The plaintext body of the URL.

rel table

A parser, when creating a new URL, will form a relationship between the new URL and the parsed URL, and stick that relationship here. It describes what parser created the relationship, what type of URL the new URL is, etc.

miner

Describes each miner, including the various regexes used to trim text and find stories.

spider

Describes each spider, including the conditions, group_0 selects, and commands.

robosubmitlock
spiderlock
spider_timespec

Something apparently hacked on so certain spiders would only run at certain times. This really should be changed so one spider can run an entire site, efficiently. But there might not be time for that. I am going to reevaluate this when I find out more about how it works.

nugget_sub

I don't know. I thought this just listed URLs that can be submitted (submitworthy), but it apparently inserts into the table after creating submissions. I don't understand when it is used, or what its purpose is.

newsvac_keywords

All the keywords, consisting of a regular expression (can be literal text), a relative weight in floating point, and what tag (group) it belongs to.

TODO

Relative priorities from 1-10 are in parentheses. All of these should get done, but will be done in order of priority so NewsVac can be "up and running" as soon as possible. Other features and fixes can be finished later. Some of these items may already be completed, though I am unaware of any of them being completed.

  • Add ability to "mine" RSS. Shouldn't be difficult to add; just add a new parse code, which will be called for certain content types. The trick will be making sure that the content type is correct. (5)

  • Trim actual story (plaintext) so we don't match all of page, just the story. Likely, this will happen in parse_plaintext; the trick is knowing how to trim each individual story, either having generic regexes (not likely) or specific ones per site, like with miners. But while miners have regexes attached to them, these things don't have any such reasonable relationship. I can attach the regex to the miner and climb the tree back up, and pick the first miner that matches, or the most recent one. (9)

  • Similar to above, also get first paragraph of story. (8)

  • Only check for new stories, since last spider, not all new stories. Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs. Perhaps match site name + paragraphs/title? Find degree of matching? Consider using code from admin.pl (which I don't think will work, since it is has many many false positives, which is fine for human editors, but this must be automated; but perhaps it can be fine-tuned to be better for automation). Check garbage collection, efficiency of table (index help?). (7)

  • Refine how NewsVac submissions are displayed in the submissions bin. Probably sufficient to make sure submissions are flagged as being from NewsVac, and then displayed separately, perhaps with a horizontal line below user-submitted stories. No need to sort by weight, but have a cutoff for the total score, of course. (6)

  • Allow different keyword sets to apply to different URLs. Assigned to miners, or spiders? (5)

  • Abstract out robosubmitting, allow for possibly emailing results, not just creating submissions. Defined per site, per spider, per miner? (3)

  • Test miners from the interface, somehow. I get the impression this is already working, though, at least to some degree. I don't quite understand what happens when a URL/miner is added/edited; something is going out and fetching URLs, but I don't know what parsers are being called, what is being put into the DB, etc. (8)

  • When submitting stories, properly populate the URL and Parent URL fields in section_extras (or any other fields they decide on). (8)

  • Related: perhaps make topical RSS feeds for Slash sites, not just sectional RSS feeds, which would make it so we could have NewsForge/Linux.com be the clearinghouse for NewsVac'd stories, and put those feeds into topics, letting each foundry pick up different applicable topics. Just a thought, but we need to figure out how to populate foundries soon. (7)

CHANGES

$Log$ Revision 1.3 2002/09/06 20:17:04 pudge Added brief objective. Added more information about miners, nuggets, spidering; checking for duplicates; display in submissions bin. Added to TODO: testing miners from interface; populating URL and URL Parent (section_extra) fields; topical RSS feeds.

Revision 1.2 2002/09/04 20:29:58 pudge Added more information about spiders. Added basic information about the purpose of each DB table. Added TODO list.

Revision 1.1 2002/09/04 17:07:09 pudge Describe basic outline of NewsVac structure.

AUTHOR

This document is being maintained by Chris Nandor <pudge@osdn.com>, with aid from Jamie McCarthy, Cliff Wood, Brian Aker, and Robin Miller.

VERSION

$Id$