Added brief objective.

Added more information about miners, nuggets, spidering; checking for duplicates; display in submissions bin. Added to TODO: testing miners from interface; populating URL and URL Parent (section_extra) fields; topical RSS feeds.
scc · Sep 6, 2002 · bcbdf38 · bcbdf38
1 parent cbf02c9
commit bcbdf38
Showing 1 changed file with 30 additions and 7 deletions.
diff --git a/plugins/NewsVac/newsvac.pod b/plugins/NewsVac/newsvac.pod
@@ -29,6 +29,8 @@ Automatically submits news that matches keywords
 
 =back
 
+The practical objective of NewsVac is to have an average of 120 NewsVac submissions per day, with half being postable stories, and that it takes less than an hour per day (half hour in the morning, half hour late afternoon) of an editor's time per day.  Of course, any improvement on the percentage of postable stories and decrease in time taken is good.
+
 
 =head1 DESCRIPTION
 
@@ -42,12 +44,12 @@ The URL tables store all the URLs, whether they are URLs to be mined for stories
 
 =head2 Miners
 
-Miners take URLs assigned to them and mine them, looking for story URLs.
+Miners take URLs assigned to them and mine them, looking for stories, and creating nuggets that describe the stories.
 
 
 =head2 Nuggets
 
-Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL.  The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext).
+Nuggets are special URLs that contain the URL, title, slug, and source of a mined URL.  The source is where the data came from (such as "OSDN"); the slug is some additional text apart from the title (like introtext).  (Jamie: so source might be what NewsForge is calling a "Parent URL", such as "osdn.com"?)
 
 
 =head2 Parsers
@@ -75,11 +77,11 @@ This parser takes the stories and pulls out the text, to later be matched agains
 
 Spiders control the whole flow of processing, executing the miners and calling each parser in order.  It is a data structure of conditions and instructions.
 
-For each stage of the spidering, a group of URLs is fetched from the database.  Each URL is processed: that is, requested, and analyzed.
+For each stage of the spidering, a group of URLs is fetched from the database.  Each URL is processed: that is 1. requested, and 2. analyzed.
 
 To request a URL is to fetch it, either from the web, or, in the case of nuggets, by extracting the data from the nugget itself.  Here, the url_info and url_content tables are updated.
 
-To analyze the URL is to processed its data the right parsers.  Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (???)).
+To analyze the URL is to process its data with the right parsers.  Here, the url_analysis table is updated (along with url_content for the plaintext data, and rel to store links (not sure what this means in this context?)).
 
 
 =head2 Keywords
@@ -126,7 +128,9 @@ Describes each miner, including the various regexes used to trim text and find s
 Describes each spider, including the conditions, group_0 selects, and commands.
 
 =item robosubmitlock
+
 =item spiderlock
+
 =item spider_timespec
 
 Something apparently hacked on so certain spiders would only run at certain times.  This really should be changed so one spider can run an entire site, efficiently.  But there might not be time for that.  I am going to reevaluate this when I find out more about how it works.
@@ -164,11 +168,11 @@ Similar to above, also get first paragraph of story. (8)
 
 =item *
 
-Only check for new stories, since last spider, not all new stories.  Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs.  Perhaps match site name + paragraphs/title?  Find degree of matching?  (7)
+Only check for new stories, since last spider, not all new stories.  Also, possibly improve code to not get duplicates: often, titles slightly change, and so do URLs.  Perhaps match site name + paragraphs/title?  Find degree of matching?  Consider using code from admin.pl (which I don't think will work, since it is has many many false positives, which is fine for human editors, but this must be automated; but perhaps it can be fine-tuned to be better for automation).  Check garbage collection, efficiency of table (index help?).  (7)
 
 =item *
 
-Refine how NewsVac submissions are displayed in the submissions bin. (6)
+Refine how NewsVac submissions are displayed in the submissions bin.  Probably sufficient to make sure submissions are flagged as being from NewsVac, and then displayed separately, perhaps with a horizontal line below user-submitted stories.  No need to sort by weight, but have a cutoff for the total score, of course.  (6)
 
 =item *
 
@@ -178,11 +182,30 @@ Allow different keyword sets to apply to different URLs.  Assigned to miners, or
 
 Abstract out robosubmitting, allow for possibly emailing results, not just creating submissions.  Defined per site, per spider, per miner?  (3)
 
+=item *
+
+Test miners from the interface, somehow.  I get the impression this is already working, though, at least to some degree.  I don't quite understand what happens when a URL/miner is added/edited; something is going out and fetching URLs, but I don't know what parsers are being called, what is being put into the DB, etc. (8)
+
+=item *
+
+When submitting stories, properly populate the URL and Parent URL fields in section_extras (or any other fields they decide on). (8)
+
+=item *
+
+Related: perhaps make I<topical> RSS feeds for Slash sites, not just sectional RSS feeds, which would make it so we could have NewsForge/Linux.com be the clearinghouse for NewsVac'd stories, and put those feeds into topics, letting each foundry pick up different applicable topics.  Just a thought, but we need to figure out how to populate foundries soon.  (7)
+
 =back
 
 =head1 CHANGES
 
 $Log$
+Revision 1.3  2002/09/06 20:17:04  pudge
+Added brief objective.
+Added more information about miners, nuggets, spidering; checking for
+  duplicates; display in submissions bin.
+Added to TODO: testing miners from interface; populating URL and URL
+  Parent (section_extra) fields; topical RSS feeds.
+
 Revision 1.2  2002/09/04 20:29:58  pudge
 Added more information about spiders.
 Added basic information about the purpose of each DB table.
@@ -195,7 +218,7 @@ Describe basic outline of NewsVac structure.
 
 =head1 AUTHOR
 
-This document is being maintained by Chris Nandor E<lt>pudge@osdn.comE<gt>, with aid from Jamie McCarthy, Cliff Wood, and Robin Miller.
+This document is being maintained by Chris Nandor E<lt>pudge@osdn.comE<gt>, with aid from Jamie McCarthy, Cliff Wood, Brian Aker, and Robin Miller.
 
 
 =head1 VERSION