Skip to content

Commit

Permalink
Added brief objective.
Browse files Browse the repository at this point in the history
Added more information about miners, nuggets, spidering; checking for
  duplicates; display in submissions bin.
Added to TODO: testing miners from interface; populating URL and URL
  Parent (section_extra) fields; topical RSS feeds.
  • Loading branch information
pudge committed Sep 6, 2002
1 parent cbf02c9 commit bcbdf38
Showing 1 changed file with 30 additions and 7 deletions.
37 changes: 30 additions & 7 deletions plugins/NewsVac/newsvac.pod
Expand Up @@ -29,6 +29,8 @@ Automatically submits news that matches keywords

=back

The practical objective of NewsVac is to have an average of 120 NewsVac submissions per day, with half being postable stories, and that it takes less than an hour per day (half hour in the morning, half hour late afternoon) of an editor's time per day. Of course, any improvement on the percentage of postable stories and decrease in time taken is good.


=head1 DESCRIPTION

Expand All @@ -42,12 +44,12 @@ The URL tables store all the URLs, whether they are URLs to be mined for stories

=head2 Miners

Miners take URLs assigned to them and mine them, looking for story URLs.
Miners take URLs assigned to them and mine them, looking for stories, and creating nuggets that describe the stories.


=head2 Nuggets

Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL. The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext).
Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL. The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext). (Jamie: so source might be what NewsForge is calling a "Parent URL", such as "osdn.com"?)


=head2 Parsers
Expand Down Expand Up @@ -75,11 +77,11 @@ This parser takes the stories and pulls out the text, to later be matched agains

Spiders control the whole flow of processing, executing the miners and calling each parser in order. It is a data structure of conditions and instructions.

For each stage of the spidering, a group of URLs is fetched from the database. Each URL is processed: that is, requested, and analyzed.
For each stage of the spidering, a group of URLs is fetched from the database. Each URL is processed: that is 1. requested, and 2. analyzed.

To request a URL is to fetch it, either from the web, or, in the case of nuggets, by extracting the data from the nugget itself. Here, the url_info and url_content tables are updated.

To analyze the URL is to processed its data the right parsers. Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (???)).
To analyze the URL is to process its data with the right parsers. Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (not sure what this means in this context?)).


=head2 Keywords
Expand Down Expand Up @@ -126,7 +128,9 @@ Describes each miner, including the various regexes used to trim text and find s
Describes each spider, including the conditions, group_0 selects, and commands.

=item robosubmitlock

=item spiderlock

=item spider_timespec

Something apparently hacked on so certain spiders would only run at certain times. This really should be changed so one spider can run an entire site, efficiently. But there might not be time for that. I am going to reevaluate this when I find out more about how it works.
Expand Down Expand Up @@ -164,11 +168,11 @@ Similar to above, also get first paragraph of story. (8)

=item *

Only check for new stories, since last spider, not all new stories. Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs. Perhaps match site name + paragraphs/title? Find degree of matching? (7)
Only check for new stories, since last spider, not all new stories. Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs. Perhaps match site name + paragraphs/title? Find degree of matching? Consider using code from admin.pl (which I don't think will work, since it is has many many false positives, which is fine for human editors, but this must be automated; but perhaps it can be fine-tuned to be better for automation). Check garbage collection, efficiency of table (index help?). (7)

=item *

Refine how NewsVac submissions are displayed in the submissions bin. (6)
Refine how NewsVac submissions are displayed in the submissions bin. Probably sufficient to make sure submissions are flagged as being from NewsVac, and then displayed separately, perhaps with a horizontal line below user-submitted stories. No need to sort by weight, but have a cutoff for the total score, of course. (6)

=item *

Expand All @@ -178,11 +182,30 @@ Allow different keyword sets to apply to different URLs. Assigned to miners, or

Abstract out robosubmitting, allow for possibly emailing results, not just creating submissions. Defined per site, per spider, per miner? (3)

=item *

Test miners from the interface, somehow. I get the impression this is already working, though, at least to some degree. I don't quite understand what happens when a URL/miner is added/edited; something is going out and fetching URLs, but I don't know what parsers are being called, what is being put into the DB, etc. (8)

=item *

When submitting stories, properly populate the URL and Parent URL fields in section_extras (or any other fields they decide on). (8)

=item *

Related: perhaps make I<topical> RSS feeds for Slash sites, not just sectional RSS feeds, which would make it so we could have NewsForge/Linux.com be the clearinghouse for NewsVac'd stories, and put those feeds into topics, letting each foundry pick up different applicable topics. Just a thought, but we need to figure out how to populate foundries soon. (7)

=back

=head1 CHANGES

$Log$
Revision 1.3 2002/09/06 20:17:04 pudge
Added brief objective.
Added more information about miners, nuggets, spidering; checking for
duplicates; display in submissions bin.
Added to TODO: testing miners from interface; populating URL and URL
Parent (section_extra) fields; topical RSS feeds.

Revision 1.2 2002/09/04 20:29:58 pudge
Added more information about spiders.
Added basic information about the purpose of each DB table.
Expand All @@ -195,7 +218,7 @@ Describe basic outline of NewsVac structure.

=head1 AUTHOR

This document is being maintained by Chris Nandor E<lt>pudge@osdn.comE<gt>, with aid from Jamie McCarthy, Cliff Wood, and Robin Miller.
This document is being maintained by Chris Nandor E<lt>pudge@osdn.comE<gt>, with aid from Jamie McCarthy, Cliff Wood, Brian Aker, and Robin Miller.


=head1 VERSION
Expand Down

0 comments on commit bcbdf38

Please sign in to comment.