Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not thread safe? #42

Closed
intellisense opened this issue Feb 24, 2014 · 15 comments
Closed

Not thread safe? #42

intellisense opened this issue Feb 24, 2014 · 15 comments

Comments

@intellisense
Copy link
Contributor

Just for testing i have set feed max_length=1000 and merge_max_length=1000, why? because duplicate aggregated activities is not an option for me in newsfeed (see on facebook newsfeed you will never see a duplicate activity). So i ran a program which reads existing user actions from database and generates feeds using celery (async tasks).

I was expecting that max_length=merge_max_length would never create a duplicate aggregated activity, but the program issue so many celery tasks that in the end there were some duplicate aggregated activities. Why that happen?

Why there is even merge_max_length variable? I thought we were storing activities in redis as key, value where key is a group? so why there is a need to traverse through all aggregated activities to find where the new aggregated activity belong? A new aggregated activity already have group defined so why not do something like this as we do in python dict.get(group)?

@tschellenbach
Copy link
Owner

Thread safety
The thread safety for aggregated and notification feeds depends on the Redis.lock functionality.
I can think of two ways this could go wrong:

  • Subclassing or other customized functionality not taking into account locking
  • Redis running out of memory (at which point locking stops working)

Merge max length
The problem is that we use a sorted set in redis. See http://redis.io/topics/data-types#sorted-sets
So the thing by which you sort is also what's used for lookups. In python terms the dict looks more like: {time: value, time2: value2}
That means we can't simply lookup the right aggregated activity, but we need to loop through the activities. (Another reason is that you can also built aggregation which is more complicated than group key -> activity)

Solutions
Redis is really fast, I expect that even with 1000 items you can still have a reasonably performant system.

Alternatively there are secondary indexes supported in Cassandra (really hard to setup and maintain)

Maybe you could store your own index in Redis so you only need two quick lookups. (though the 2 sequential lookups might be slower than just looping through 1000 rows, depends a bit on the size of the row and latency to redis)

@tbarbugli more ideas?

@intellisense
Copy link
Contributor Author

When i was testing it i was looking into the memory continuously. I have 14GiB of Memory and the operation never exceeded that.
If i regenerate feeds from existing users actions using celery mode CELERY_ALWAYS_EAGER=False Duplicate aggregated activities happen. But if i do it with CELERY_ALWAYS_EAGER=True (which obviously takes a lot of time to complete the process of regenerating feeds) duplicated didn't happen. So thats why i thought redis.lock has some issues or may not have implemented properly.

According to my knowledge feedly is not created with the thought that there should never be duplicate aggregated activities group, otherwise merge_max_length should never be existed. So maybe somewhere something needs to be improved or looked into?

@tbarbugli
Copy link
Collaborator

@Thierry where is the locking done at aggregated feed? I cant find it in
feedly and I suspect its not there at all (if it was there we would need a
cassandra implementation or at least some configuration for the lock
storage which sounds challenging :))

@intellisense how do you regenerate feeds?

The only problem I can think about the redis lock (as we use in the
notification feed) is that it gets automatically released after 10 seconds,
thats a way to prevent deadlocks.

2014-02-24 18:34 GMT+01:00 intellisense notifications@github.com:

When i was testing it i was looking into the memory continuously. I have
14GiB of Memory and the operation never exceeded that.
If i regenerate feeds from existing users actions using celery mode
CELERY_ALWAYS_EAGER=False Duplicate aggregated activities happen. But if
i do it with CELERY_ALWAYS_EAGER=True (which obviously takes a lot of
time to complete the process of regenerating feeds) duplicated didn't
happen. So thats why i thought redis.lock has some issues or may not have
implemented properly.

According to my knowledge feedly is not created with the thought that
there should never be duplicate aggregated activities group, otherwise
merge_max_length should never be existed. So maybe somewhere something
needs to be improved or looked into?

Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-35912265
.

@intellisense
Copy link
Contributor Author

Re: How do you regenerate feeds?
Basically i am storing every user action in a generic database model called Action provided by django-activity-stream. So regenerating feed is just matter of looping through the queryset and first creating an instance of feedly.activity.Activity from action object and adding it by doing feedly.add_user_activity(action.actor.id, activity).

P.S I am not spawning any extra celery tasks by myself. Celery tasks are those which feedly issues for add_operation

@tschellenbach
Copy link
Owner

I'm not sure we're talking about the same problem.

Problem scenario A:
Multiple updates happen on the same aggregated activity. Without locking this results in the activity being copied multiple times. So if you have aggregated activity
start [1]
and you have multiple add operations running, you can end up with
start [1,2]
start [1,3]
Instead of start [1,2,3]

Problem scenario B (Search depth)
If there is an aggregated activity past your search depth, you'll end up with 2 distinct aggregated activities with the same value for the group param.

PS
Tommaso you are right, the aggregated feed locking seems to be specific to Fashiolista. (Indeed because of the trouble of combining Redis and Cassandra. Here's the Fashiolista example:
(you could also use a different locking mechanism than Redis, but you need some sort of central lock to prevent duplication of aggregated activities)

class AggregatedFeed(FashiolistaBaseFeed, CassandraAggregatedFeed):
    aggregator_class = FashiolistaAggregator
    timeline_cf_name = 'timeline_aggregated'
    key_format = 'feed:aggregated:%(user_id)s'
    lock_format = 'feed:aggregated:lock:%(user_id)s'
    max_length = 2400
    merge_max_length = 40

    def __init__(self, *args, **kwargs):
        CassandraAggregatedFeed.__init__(self, *args, **kwargs)
        self.lock_key = self.lock_format % {'user_id': self.user_id}
        self.redis = get_redis_connection()

    def add_many(self, *args, **kwargs):
        with self.redis.lock(self.lock_key, timeout=30):
            results = super(AggregatedFeed, self).add_many(*args, **kwargs)
            return results

    def remove_many(self, *args, **kwargs):
        with self.redis.lock(self.lock_key, timeout=30):
            results = super(AggregatedFeed, self).remove_many(*args, **kwargs)
            return results

    def trim(self, *args, **kwargs):
        pass

@Thierry
Copy link

Thierry commented Feb 24, 2014

Folks, you have the wrong 'Thierry' among the recipients.

Could you please remove my name from this thread?

Thanks in advance
Le 24 févr. 2014 18:54, "intellisense" notifications@github.com a écrit :

Re: How do you regenerate feeds?
Basically i am storing every user action in a generic database model
called Action provided by django-activity-streamhttps://github.com/justquick/django-activity-stream.
So regenerating feed is just matter of looping through the queryset and
first creating an instance of feedly.activity.Activity from action object
and adding it by doing feedly.add_user_activity(action.actor.id, activity)
.

P.S I am not spawning any extra celery tasks by myself. Celery tasks are
those which feedly issues for add_operation

Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-35914473
.

@tschellenbach
Copy link
Owner

Hi, sorry about that, they meant to contact me.
That's quite spam sensitive though.

@tschellenbach
Copy link
Owner

I think only you can unsubscribe (there's a button) on the top right

@intellisense
Copy link
Contributor Author

@tschellenbach I think its Problem Scenario A. Using redis.lock should fix it. I will give it a try.

@intellisense
Copy link
Contributor Author

Any chance of a deadlock in using the above locking scheme? Because I can see 20%-30% celery fanout tasks are stuck forever while some succeeded. Using the locking technique is similar to celery Synchronous Subtasks Problem. Any suggestions?

@tschellenbach
Copy link
Owner

Hi!

The redis lock clears out after the timeout. So it would need to go wrong
many times during a fanout for tasks to seemingly hang forever.

Since it locks per feed and it processes one feed at the time deadlocks
arent possible.

It is possible that a lock is set but not cleared. If that happens a task
should hang for the timeout period.

What Celery broker are you using? We've had many issues running anything
else than rabbitmq as a broker.
On Jul 21, 2014 2:51 PM, "intellisense" notifications@github.com wrote:

Any chance of a deadlock in using the above locking scheme? Because I can
see 20%-30% celery fanout tasks are stuck forever while some succeeded.
Using the locking technique is similar to celery Synchronous Subtasks
Problem
http://docs.celeryproject.org/en/latest/userguide/tasks.html#avoid-launching-synchronous-subtasks.
Any suggestions?


Reply to this email directly or view it on GitHub
#42 (comment).

@intellisense
Copy link
Contributor Author

I am using redis as a broker. And redis comes from AWS Elastic Cache. Lock will be cleared automatically when exit from python with statement.

@tbarbugli
Copy link
Collaborator

Thats partially correct; killing the process that holds the locks will
leave the lock acquired until the redis key expires.

Regarding the issue, since locks are per feed, if you have lots of updates
on the same feed you will end up with lot of processes waiting for the lock
to be available and quite some waste; if thats the case i suggest batching
updates together to reduce lock contention and reduce read overhead
(batched updates are much more efficient than lot of single updates)

On Monday, 21 July 2014, intellisense notifications@github.com wrote:

I am using redis as a broker. And redis comes from AWS Elastic Cache. Lock
will be cleared automatically when exit from python with statement.


Reply to this email directly or view it on GitHub
#42 (comment).

sent from iphone (sorry for the typos)

@intellisense
Copy link
Contributor Author

Hi, As written here

Adds many activities to the feed
Unfortunately we can't support the batch interface.
The writes depend on the reads.
Also subsequent writes will depend on these writes.
So no batching is possible at all.

So I guess there is no work arround to add multiple user activities together in user aggregated feed and then fanout to followers?

@tbarbugli
Copy link
Collaborator

Hi,
thats correct; it is possible to add multiple activities together to an
aggregated feed but unfortunately it is not possible to run add_many on
many aggregated feeds in one batch (using the batch operator).

What backend storage are you using?

2014-08-16 23:14 GMT+02:00 intellisense notifications@github.com:

Hi, As written here
http://feedly.readthedocs.org/en/latest/_modules/feedly/feeds/aggregated_feed/base.html#AggregatedFeed.add_many

Adds many activities to the feed
Unfortunately we can't support the batch interface.
The writes depend on the reads.
Also subsequent writes will depend on these writes.
So no batching is possible at all.

So I guess there is no work arround to add multiple user activities
together in user aggregated feed and then fanout to followers?


Reply to this email directly or view it on GitHub
#42 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants