Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Problems with removeFetchCondition & removeDownloadCondition #377

Closed
venning opened this issue Jul 12, 2017 · 1 comment
Closed

Problems with removeFetchCondition & removeDownloadCondition #377

venning opened this issue Jul 12, 2017 · 1 comment

Comments

@venning
Copy link
Contributor

venning commented Jul 12, 2017

There are a few bugs with removeFetchCondition and removeDownloadCondition. I'm submitting a pull request with this issue that resolves those bugs. Feel free to ignore it or change it or ask me to change it.

Typo

First, removeFetchCondition has a typo on this line. The relevant block:

if (index instanceof Function) {
    var itemIndex = crawler._downloadConditions.indexOf(index);
    return Boolean(crawler._fetchConditions.splice(itemIndex, 1));
} else if (typeof index === "number") {
    return Boolean(crawler._fetchConditions.splice(index, 1));
}

throw new Error("Unable to find indexed fetch condition");

That second line should obviously be _fetchConditions not _downloadConditions.

Errors never throw

But that begs the question "Why are the tests passing?" That's due to a few logic errors in that block (and the corresponding block of removeDownloadCondition).

In the test, itemIndex gets a value of -1 from indexOf (since the function is not found in _downloadConditions) which is then passed to splice which interprets it as "one less than the end of the array" which is coincidentally the correct element (since it's a one element array). It then removes it successfully and returns, providing a false positive.

The remove methods' JSDocs both say:

If the removal was successful, the method will return true. Otherwise, it will throw an error.

Assuming that this is the contract you want to keep, the remove methods are not functioning correctly. The methods will never throw an error if passed a function. They will also never throw an error if passed a number. This is because splice will always successfully return an array, even if it's splicing on an empty array or if the index is negative. To keep the contract, index bounds checks need to be added to both the function parameter block and the index parameter block.

I propose the following for removeFetchCondition (and similar for removeDownloadCondition):

if (index instanceof Function) {
    var itemIndex = crawler._fetchConditions.indexOf(index);
    if (itemIndex !== -1) {
        crawler._fetchConditions.splice(itemIndex, 1);
        return true;
    }
} else if (typeof index === "number") {
    if (index >= 0 && index < crawler._fetchConditions.length) {
        crawler._fetchConditions.splice(index, 1);
        return true;
    }
}

throw new Error("Unable to find indexed fetch condition");

Now, if removeFetchCondition is passed a function which is not a fetch condition, or if it is passed a number which is outside of the bounds of _fetchConditions, it will throw an error. (I broke up the index if clause to make it more logically clear what's going on.)

Note that splice will always return an array, which will always evaluate as truthy. There's no need for a Boolean coercion, since that will always be true, which is what the method contract states anyways.

I've added four new mocha tests to check that errors are getting thrown. They fail for the current versions of removeFetchCondition and removeDownloadCondition but pass for these versions. I'm not sure I've styled the tests how you would prefer; feel free to correct them or have me change them.

Volatile conditionIDs

There is a separate logic problem here: Due to the use of splice, the "conditionID" which is returned by addFetchCondition (and addDownloadCondition) is an index number which may or may not remain correct after a call to removeFetchCondition. Calls to removeFetchCondition invalidate existing conditionIDs for conditions that had been added after the one that just got removed had been added. For example, the following code should work logically but won't:

var conditionID1 = myCrawler.addFetchCondition(function() {}); // = 0
var conditionID2 = myCrawler.addFetchCondition(function() {}); // = 1

myCrawler.removeFetchCondition(conditionID1); // SUCCESS: removing @ 0, now length is 1
myCrawler.removeFetchCondition(conditionID2); // ERROR THROWN: removing @ 1, which doesn't exist

If you reversed the calls to removeFetchCondition it would succeed, which may be a counter-intuitive gotcha for users. Additionally, with more than two conditions you could easily receive a success from removeFetchCondition while trying to remove one condition when it actually removes the wrong condition. For example:

var conditionID1 = myCrawler.addFetchCondition(function() {}); // = 0
var conditionID2 = myCrawler.addFetchCondition(function() {}); // = 1
var conditionID3 = myCrawler.addFetchCondition(function() {}); // = 2

myCrawler.removeFetchCondition(conditionID1); // SUCCESS: removing @ 0, now length is 2
myCrawler.removeFetchCondition(conditionID2); // SUCCESS BUT WRONG: removing @ 1, which is actually condition3

// myCrawler._fetchConditions now only contains condition2, which isn't what we wanted

Also notice that a subsequent call to addFetchCondition would return a conditionID of 1, which we already used. This is a common problem which can be easily fixed by using a "removed condition" placeholder object in the underlying arrays to preserve the accuracy of other conditionIDs, and then testing for (and ignoring) those placeholders in the async.every calls. Proposed change to removeFetchCondition:

if (index instanceof Function) {
    var itemIndex = crawler._fetchConditions.indexOf(index);
    if (itemIndex !== -1) {
        crawler._fetchConditions[itemIndex] = REMOVED_CONDITION;
        return true;
    }
} else if (typeof index === "number") {
    if (index >= 0 && index < crawler._fetchConditions.length) {
        crawler._fetchConditions[index] = REMOVED_CONDITION;
        return true;
    }
}

throw new Error("Unable to find indexed fetch condition");

Where REMOVED_CONDITION is defined once in the file to be a unique object for identification purposes. Proposed change to the actual fetch condition calls:

async.every(crawler._fetchConditions, function(fetchCondition, callback) {
    if (fetchCondition === REMOVED_CONDITION) {
        callback(null, true);
    } else if (fetchCondition.length < 3) {
        try {
            callback(null, fetchCondition(queueItem, referrer));
        } catch (error) {
            callback(error);
        }
    } else {
        fetchCondition(queueItem, referrer, callback);
    }
}, function(error, result) {
// ...

And similar for download conditions.

This method does make the async.every calls very slightly slower by adding overhead when conditions have been removed. If you're willing to add two new data structures with each crawler object, you could do this without the overhead while still maintaining a full list of all previously-assigned conditionIDs (which is the underlying problem). I wanted to restrain myself from modifying the codebase too much. If you'd like to see that implementation, I can provide that.

There is a slight complication with fixing it this way: the length of the underlying _fetchConditions does not change, which affects the other mocha tests for removing conditions because they are accessing the underlying array directly. I've changed those tests to no longer rely on the length of the arrays but instead on their actual contents. I've also added four new tests to check that multiple conditions can be removed correctly, both by ID and by reference.

If you eventually want to add a getFetchConditionCount() method, you would want to either keep track of how many removals are called per crawler object and subtract that from _fetchConditions.length or iterate through the array and count REMOVED_CONDITION objects, the first option being O(1) and probably preferable.

Multiple removals

Attempting to remove the same function twice will throw an error, but attempting to remove the same conditionID twice will not. This was not a problem before I made the above changes, since the methods would never error (as long as the input was a function or number). To maintain the original contract, I changed removeFetchCondition to throw if the conditionID has already been removed:

if (index instanceof Function) {
    var itemIndex = crawler._fetchConditions.indexOf(index);
    if (itemIndex !== -1) {
        crawler._fetchConditions[itemIndex] = REMOVED_CONDITION;
        return true;
    }
} else if (typeof index === "number") {
    if (index >= 0 && index < crawler._downloadConditions.length) {
        if (crawler._downloadConditions[index] !== REMOVED_CONDITION) {
            crawler._downloadConditions[index] = REMOVED_CONDITION;
            return true;
        }
    }
}

throw new Error("Unable to find indexed fetch condition");

And similar for removeDownloadCondition.

Again, I nested if clauses because I didn't know how you would want to style what would be a really long line, and it makes it more obvious what is happening. This also gives you space to add line comments explaining why each clause exists. Really, you could replace the var crawler = this; statement at the beginning of the method with var conditions = this._fetchConditions; and shorten the rest of the code in the method, but that's your call.

I added four tests for checking that it throws when removing a condition twice. At this point, I'd added so many tests, I created sub-sections within the tests to make it easier to read the output.

There are more tests you could write, like adding->removing->adding and then testing if the new conditionID is a repeat of the first, but that just seems like overkill.

Documentation

Unrelated to these changes, the JSDoc for the "index" parameters should probably also change to reflect the fact that they can accept either a condition reference or an index. It would also probably make sense to change the argument name from index to something that more accurately reflects its dual purpose; but, really, this is a minor point. I have not done either of these things.

@fredrikekelund
Copy link
Collaborator

Closing this since #378 was merged. Thanks again for your great bug report! 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants