Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data import between Zulip servers doesn't properly handle FillState #13486

Open
4 tasks
timabbott opened this issue Dec 7, 2019 · 7 comments · May be fixed by #13695
Open
4 tasks

Data import between Zulip servers doesn't properly handle FillState #13486

timabbott opened this issue Dec 7, 2019 · 7 comments · May be fixed by #13695

Comments

@timabbott
Copy link
Sponsor Member

If you've exported data from an empty Zulip server and then import it into another empty Zulip server, you'll get the following exception when manage.py update_analytics_counts return: runs:

Traceback (most recent call last):
  File "/home/zulip/deployments/2019-11-22-09-58-06/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2019-11-22-09-58-06/zerver/lib/db.py", line 32, in execute
    return wrapper_execute(self, super().execute, query, vars)
  File "/home/zulip/deployments/2019-11-22-09-58-06/zerver/lib/db.py", line 19, in wrapper_execute
    return action(sql, params)
psycopg2.IntegrityError: duplicate key value violates unique constraint "analytics_usercount_user_id_property_subgrou_0597562a_uniq"
DETAIL:  Key (user_id, property, subgroup, end_time)=(9, messages_sent:is_bot:hour, false, 2015-11-25 11:00:00+00) already exists.

The issue is that the RealmCount, UserCount, etc. analytics tables were imported, but the global FillState table that is supposed to keep track of which CountStat objects from analytics/lib/counts.py have been properly updated cannot be readily imported in the general case, precisely because it is global. (The server we're importing into might be behind or more likely ahead of the server we're importing from, so there's no natural way to just copy the table).

There's a few things we can do to fix this:

  • In the common case where the new server is currently empty, we can just export and then import FillState from the old server, as a quick partial solution.
  • In the more general case, what we probably want to do is export FillState and then write a bit of code at the end of the import_realm to compare the old server's FillState to the new servers'. We can then do some of the following:
    • If the old server is ahead, we can either delete any entries from the imported realm's analytics tables
    • If the new server is ahead, there are a few options. The best is probably to extend the CountStat queries in analytics/lib/counts.py to support passing a Realm argument to process_count_stat, which would cause that logic to run for just the target realm, and then have a little loop catch up the newly imported realm. Effectively, we'd need to support this alternative version of each of the queries we define:
diff --git a/analytics/lib/counts.py b/analytics/lib/counts.py
index 85761a8260..d924300cd8 100644
--- a/analytics/lib/counts.py
+++ b/analytics/lib/counts.py
@@ -296,6 +296,7 @@ count_message_by_user_query = """
     WHERE
         zerver_userprofile.date_joined < %%(time_end)s AND
         zerver_message.date_sent >= %%(time_start)s AND
+        zerver_userprofile.realm_id = %%(realm_id)s AND
         zerver_message.date_sent < %%(time_end)s
     GROUP BY zerver_userprofile.id %(group_by_clause)s
 """

I think the right way to arrange that is to change the structure a bit to make e.g. count_message_by_user_query into a function like this:

def count_message_by_user_query(realm: Optional[Realm]) -> str:
    if realm is None:
        realm_clause = ""
    else: 
        realm_clause = "zerver_userprofile.realm_id = %%(realm_id)s AND"
    return """                                                                    
    INSERT INTO analytics_usercount                                                                  
        (user_id, realm_id, value, property, subgroup, end_time)                                     
    SELECT                                                                                           
        zerver_userprofile.id, zerver_userprofile.realm_id, count(*),                                
        '%(property)s', %(subgroup)s, %%(time_end)s                                                  
    FROM zerver_userprofile                                                                          
    JOIN zerver_message                                                                              
    ON                                                                                               
        zerver_userprofile.id = zerver_message.sender_id                                             
    WHERE                                                                                            
        zerver_userprofile.date_joined < %%(time_end)s AND                                           
        zerver_message.date_sent >= %%(time_start)s AND                                              
        {realm_clause}
        zerver_message.date_sent < %%(time_end)s                                                     
    GROUP BY zerver_userprofile.id %(group_by_clause)s                                               
"""                                                                                           
```.format(realm_clause=realm_clause)

and doing the plumbing to support passing a `realm_id` through the flow.  

I like that option because we may eventually have other uses for running these analytics update operations on a single realm, so the plumbing work we'd be doing would potentially have some future value as well.
@zulipbot
Copy link
Member

zulipbot commented Dec 7, 2019

Hello @zulip/server-misc members, this issue was labeled with the "area: export/import" label, so you may want to check it out!

@arpit551
Copy link
Collaborator

@zulipbot claim

@zulipbot
Copy link
Member

zulipbot commented Dec 26, 2019

Hello @arpit551, you have been unassigned from this issue because you have not updated this issue or any referenced pull requests for over 14 days.

You can reclaim this issue or claim any other issue by commenting @zulipbot claim on that issue.

Thanks for your contributions, and hope to see you again soon!

@arpit551
Copy link
Collaborator

@zulipbot still working

@arpit551
Copy link
Collaborator

@zulipbot claim

@zulipbot
Copy link
Member

Hello @arpit551, you claimed this issue to work on it, but this issue and any referenced pull requests haven't been updated for 10 days. Are you still working on this issue?

If so, please update this issue by leaving a comment on this issue to let me know that you're still working on it. Otherwise, I'll automatically remove you from this issue in 4 days.

If you've decided to work on something else, simply comment @zulipbot abandon so that someone else can claim it and continue from where you left off.

Thank you for your valuable contributions to Zulip!

@arpit551
Copy link
Collaborator

@zulipbot abandon

arpit551 added a commit to arpit551/zulip that referenced this issue Mar 3, 2020
FillState is handled according to the result of comparison between
old server and new server. Also its compatibility with running
update_count_stats is taken care by sharing the settings.ANALYTICS_LOCK_DIR
lock with update_analytics_count.py and wait until it grabs the lock.

Fixes zulip#13486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants