Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Initial Modeling with Through Model and Data Migration for RemoteRepository Model #7536

Conversation

saadmk11
Copy link
Member

@saadmk11 saadmk11 commented Oct 5, 2020

This is the initial modeling and data migration for the RemoteRepository Model.
This represents a simple implementation to start the discussion for the implementation of #7169

  • The first migration 0011 decouples the RemoteRepository model with a through model RemoteRelation using SeparateDatabaseAndState so that Django does not create a new table for this and uses the table that already exists for ManyToMany relation.

  • The second migration 0012 adds the required fields for the RemoteRelation Model. (uses JSONField instead of models.TextField for the json field)

  • The third migration 0013 is added to perform data migration between RemoteRepository and RemoteRelation

  • The fourth migration 0014 removes the transferred fields from the RemoteRepository model.

Ref: #2759 and #7169


relations_queryset = RemoteRelation.objects.all().select_related('remoterepository')
remote_relations = remote_relations_generator(relations_queryset)
batch_size = 5000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about what the size can be for each batch. but if we do all of them at once it will create a huge SQL Query

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is one of my concerns of this migrations. I don't have a good answer yet and we may ask somebody else for advice here.

The RemoteRepostory table is incredible huge. I already had to trigger the creation of an INDEX in this table manually without blocking things (concurrently) because it took hours.

In this case, we will need to find a way to not block the DB while deploying these changes.

@saadmk11 saadmk11 requested a review from humitos October 5, 2020 17:08

relations_queryset = RemoteRelation.objects.all().select_related('remoterepository')
remote_relations = remote_relations_generator(relations_queryset)
batch_size = 5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is one of my concerns of this migrations. I don't have a good answer yet and we may ask somebody else for advice here.

The RemoteRepostory table is incredible huge. I already had to trigger the creation of an INDEX in this table manually without blocking things (concurrently) because it took hours.

In this case, we will need to find a way to not block the DB while deploying these changes.

readthedocs/oauth/models.py Outdated Show resolved Hide resolved
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going in a good direction IMO. The PR description was super useful to understand what you did 💯

I think there is no more big modeling changes here probably. The rest of the work is more about "how to deploy it without breaking things".

We will need a plan for:

  • do not kill the db while running this migrations
    • how we can improve the migration?
    • is it enough just by lowering the batch_size?
    • will it block the db while running the migration?
  • keeping compatibility with all our existing features that use this models

Unfortunately, I don't have the answer to most of those questions yet, but it's definitely something we need to talk about and have a plan for.

I'm not sure if you already did a test locally and ran this queries, but it would be useful if you can connect your Social accounts (github, gitlab and bitbucket) to your local instance and double check that the data migration and structure is all in its place.

Thanks for helping us on this, really appreciated! ❤️

@@ -90,14 +91,7 @@ class RemoteRepository(models.Model):
User,
verbose_name=_('Users'),
related_name='oauth_repositories',
)
account = models.ForeignKey(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add the remote_id field here as well and populate it with the id that it's in remoterepository.json.

If we are not doing it in this PR, we should create an issue to keep track of. We will need it to remove the usage of full_name in many places.

I created a project at https://github.com/orgs/readthedocs/projects/74?add_cards_query=is%3Aopen where we can track all of the related PR and issues. Let me know if you have permissions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sure, maybe we can do data migration for this and #7536 (comment) first.

This PR will remove the json field from RemoteRepository model so I think its better to do this first in a different PR or make it the first migration on this PR.
@humitos What do you say?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a project at https://github.com/orgs/readthedocs/projects/74?add_cards_query=is%3Aopen where we can track all of the related PR and issues. Let me know if you have permissions.

@humitos I can not see some of the cards maybe they are from a private repo or something, otherwise it all good 👍

migrations.RemoveField(
model_name='remoterepository',
name='account',
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are moving the SocialAccount to the RemoteRelation, shouldn't we have a provider or similar in the RemoteRepository to know to which service this repository belongs to?

Copy link
Member Author

@saadmk11 saadmk11 Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Great idea 💯 It will be better to add a provider field so we don't have to do an extra query to know the provider name or if the user is deleted or disconnected we don't lose that data.

Some thoughts on migration of this data here #7536 (comment)

@saadmk11
Copy link
Member Author

saadmk11 commented Oct 8, 2020

I'm not sure if you already did a test locally and ran this queries, but it would be useful if you can connect your Social accounts (github, gitlab and bitbucket) to your local instance and double check that the data migration and structure is all in its place.

I have connected my GitHub account and imported all my repositories and ran the migrations. In addition to that I created ~53k RemoteRelation objects and ran migrations on them It took around ~50 seconds to run.

@saadmk11
Copy link
Member Author

saadmk11 commented Oct 10, 2020

Updated the PR!

Used TimeStampedModel for the RemoteRelationModel also used the table rename method showed in Django docs to create the RemoterelationModel. Added some small data migration optimization and other suggested changes.

@saadmk11 saadmk11 requested a review from humitos October 10, 2020 17:20
@saadmk11
Copy link
Member Author

saadmk11 commented Oct 11, 2020

Learned something interesting, It seems SQLite has restrictions on the number of variables used in a query so it needs multiple queries for bulk_update(), for my case it was doing 41 queries per 5000 row updates, So, I switched to PostgreSQL and it was only doing 1 query per 5000 row updates (Consumes more memory), it's the same case for 10000 rows together.

https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-update

The batch_size parameter controls how many objects are saved in a single query. The default is to update all objects in one batch, except for SQLite and Oracle which have restrictions on the number of variables used in a query.

@humitos
Copy link
Member

humitos commented Nov 9, 2020

I got another idea that I wanted to share here after talking about it with @saadmk11 today. Keeping in mind that we can't just run migrations on this table because is so huge that it will take down the DB.

In a few words, I'm thinking on:

  • re-sync the data we are using immediately (~700k RemoteRepository)
  • migrate the data not frequently used by chunks (~6.5M RemoteRepository)

To accomplish this, this is the plan we've got:

  1. use this PR to modify the DB schema decoupling RemoteRepository to create RemoteRelation
  2. modify all the Python code to use the new modeling in .org and .com (will help us to find out bugs locally in an easier way)
  3. QA this locally with test data
  4. write a management command to re-sync (from GitHub) RemoteRepository for active users (logged in in the last 30 days). Active users will have updated data immediately (currently we have ~6k active users and we have a rate limit of 5k requests/hour to GitHub --we will need to adapt these numbers or run it multiple times on different hours)
  5. enable Django signal to re-sync RemoteRepository on login async (we already have this in .com). New active users will have updated data immediately

At this point, we could do the first deploy. Active users and new active users will have updated data immediately after they log in. From here, we can continue migrating not frequently used data (or stale) following these steps:

  1. write a periodic celery task to migrate ~1k records (from the old table to the new modeling) every ~1h to not kill the db (some of this code is written in a migration on this PR already)
    • needs to consider deleting old duplicated RemoteRepository records (new modeling will have only one RemoteRepository for all the users)
    • may needs to re-sync the RemoteRepository for all the users that have access to this RemoteRepository

Optionally, we can spin up a test DB in production and an new instance with the new code to do as much QA as we want.

This will give us the best of two worlds: active/new users will have updated data & old users will be migrated slowly without killing the db. However, immediately one of those old users log in, their data will be automatically re-synced from GitHub via the Django signal.

  1. remove all the extra fields we have in the old table.

I left 7) for the end and alone because I'm not sure if removing this fields will take the DB down as well, considering the amount of data we have in that table.

@ericholscher
Copy link
Member

@humitos This makes sense to me. I think the main thing we need to worry about is GH rate limits, but there's no reason we couldn't ship the modeling a week in advance, and then migrate the data over a longer period of time, then deploy the code to use it, right?

@humitos
Copy link
Member

humitos commented Nov 10, 2020

but there's no reason we couldn't ship the modeling a week in advance, and then migrate the data over a longer period of time, then deploy the code to use it, right?

We can't really ship the new modeling and continue using the old one in the application Python code. The new modeling modifies relationships between models and renames a table, so the queries will need to be changed as well. In my proposal, we are deploying the new modeling and application Python code changes to use it at the same time --together with a script to re-sync active users in the same deploy.

I'm still not 100% convinced that we have to migrate all the data at the same time, considering that we can re-fetch it (and up to date!) from GitHub when we need it. That's why I left the 6) step outside the first deploy. It may not be required.

@saadmk11 saadmk11 marked this pull request as ready for review November 14, 2020 09:07
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I'm surprised that tests keep passing after changing the RemoteRepository.users field. I'd expected the code to fail because of this considering that field is not valid anymore and it has to be accessed via the remoterelation.

We need to do exactly the same migration for RemoteOrganization as well.

@@ -206,3 +210,30 @@ def matches(self, user):
},
),
} for project in projects]


class RemoteRelation(TimeStampedModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to rename this model to something like RemoteRepositoryRelation because we will also need a relation model for RemoteOrganization as well and we will have conflicting names.

Does it make sense to add make remoterepository field nullable and add remoteorganization field as well to this RemoteRelation model and use it for both? It seems it's exactly the same data but RemoteOrganization does not has admin field.

@humitos humitos changed the base branch from master to humitos/remote-repository-normalization November 16, 2020 16:43
@humitos humitos changed the base branch from humitos/remote-repository-normalization to remote-repository-normalization November 16, 2020 16:44
@humitos humitos merged commit 2c74027 into readthedocs:remote-repository-normalization Nov 16, 2020
3 checks passed
@saadmk11 saadmk11 deleted the remote-repo-normalization branch November 17, 2020 06:56
@humitos humitos mentioned this pull request Feb 23, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants