Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: Import from gitter. #9569

Closed
wants to merge 6 commits into from
Closed

Conversation

rheaparekh
Copy link
Collaborator

@rheaparekh rheaparekh commented May 28, 2018

This successfully imports gitter data to Zulip.

Things done:
Mapping users, stream, recipients, subscriptions, messages, avatars, added Management command convert_gitter_data, added basic tests, basic documentation, support user mentions.

Things to do:
Improve documentation, See how markdown conversion can be improved after feedback.

To Test:

  1. Get the sample dataset in a file gitter.json.
  2. Use ./manage.py convert_gitter_data gitter.json --output gitter_data to get the converted file.
  3. Use ./manage.py import 'test-gitter-import' gitter_data to import. This will create a realm with the name test-gitter-import.

@rheaparekh rheaparekh force-pushed the gitter-import branch 7 times, most recently from 3c37683 to ad7d8c3 Compare June 1, 2018 16:13
@rheaparekh rheaparekh force-pushed the gitter-import branch 4 times, most recently from 895b09c to 9138063 Compare June 7, 2018 18:47
zerver_subscription: List[ZerverFieldsT],
user_map: Dict[str, int]) -> ZerverFieldsT:
user_map: Dict[str, int], chunk_size: int=800) -> None:
Copy link
Collaborator Author

@rheaparekh rheaparekh Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk_size used in export.py is 1000. But when I used the chunk_size as 1000 with the dataset of 15493 messages and 273 users, I had a successful gitter to zulip data conversion, but while importing, I received a Memory error. Working with 800 as the chunk_size didn't have any issue. I think it would be good if we decide upon a proper chunk_size.

@rheaparekh rheaparekh force-pushed the gitter-import branch 3 times, most recently from 123f0fc to c8254e4 Compare July 5, 2018 09:18
@rheaparekh rheaparekh changed the title [WIP] Import: Import from gitter. import: Import from gitter. Jul 5, 2018
@rheaparekh rheaparekh force-pushed the gitter-import branch 3 times, most recently from b5873bc to 1950200 Compare July 5, 2018 09:32
@rheaparekh
Copy link
Collaborator Author

rheaparekh commented Jul 5, 2018

The build seems to be failing because of a test flake.

@timabbott
Copy link
Sponsor Member

@rht do you have time to help review this?

@neiljp I'd also appreciate your trying this out with our current sample data set. After doing the import, it's valuable to spend a while clicking around looking for inconsistencies.

@rheaparekh
Copy link
Collaborator Author

rheaparekh commented Jul 5, 2018

This is the current sample dataset: https://s3.amazonaws.com/custodian-gitter/capitalone-cloud-custodian.json

Messages can be bulky, and storing them in a single
data structure can cause a memory error.

In this commit, the messages are written to a file
batch-wise, thus avoiding the memory error.

Similar to commit 6b7b6b3
The gitter mentions are in the format '@usermention'
and the mentions are included in the export data as:

"mentions": [
   {
    "screenName": "usermention",
    "userId": "54d7876c15522ed4b3dbbefb",
    "userIds": []
   }]

We extract this data and map this mention to @**usermention**
for Zulip.
@rht
Copy link
Collaborator

rht commented Jul 6, 2018

I will see if I can do this by the end of the week(end).

@rht
Copy link
Collaborator

rht commented Jul 7, 2018

@rheaparekh
Note: most of the time I had to read the Slack version of each function as a reference point. It would have eased the checking if the well-tested functions in slack_data_to_zulip_data.py are repurposed here, e.g. build_recipient_and_subscription could reuse channels_to_zerver_stream. It would have eased the writing of IRC importer (is this being made?) as well.

There should be additional caveats in the documentation that 1. Gitter markdown 2. issue mentions haven't been mapped yet.

Other than those points, LGTM for the rest of the commits.

@rheaparekh
Copy link
Collaborator Author

I think I will add another PR to add common import functions, so that it'll help both slack and gitter importer (and any other future imports).
I have updated the documentation with the caveats (I'll work on the gitter markdown and issue mention next after this is merged)

@rheaparekh
Copy link
Collaborator Author

@timabbott This should be ready for a final review.

@timabbott
Copy link
Sponsor Member

This is good enough for a preliminary merge (since it does work and is a lot better than nothing), but I don't want to advertise this until we've cleaned it up a bit more. @rheaparekh here's the main things we'll need to adjust here before we advertise this feature:

  • We don't import any organization administrators. I think probably it's good enough to document using manage.py knight to create one, if the Gitter export doesn't contain data on that (though I think the correct concept is that the Git repo owners are org admins? Worth looking up).
  • I don't think we need a lot for formatting conversion, since they are both based on markdown. Probably one of the main Gitter->Zulip markdown conversions we want to do is if a message looks like this:
```code block

end of code block```

(without newlines at start/end), we should treat that as a code block in the import.

  • I'd like to convert a bunch of the code for constructing a user/subscription/etc. object into functions shared between Slack and Gitter import (even if it's just the low-level piece that makes a UserProfile, picking the defaults of ~10 fields that are the same for all imported users). We can put those in zerver/lib/export_util.py if you don't think we want them in the main zerver/lib/export.py. This should help reduce code duplication as we add more import tools in the future. Specifically I'm thinking of blocks like build_subscription, zulip_message = dict(, process_avatars, and the userprofile = UserProfile( one.
  • Can you give an example of what "issue mentions" are? I'm curious how hard it would be to address that, e.g. by generating a RealmFilter.

I think these could probably be burned through pretty quickly, so let's try to focus on them.

@timabbott timabbott closed this Jul 23, 2018
@timabbott
Copy link
Sponsor Member

docs/production/maintain-secure-upgrade.md has docs on manage.py knight that we can link to.

@rheaparekh
Copy link
Collaborator Author

@timabbott @rht thankyou for the reviews! I'll get started on the followups.

@rheaparekh rheaparekh deleted the gitter-import branch July 23, 2018 16:52
@timabbott timabbott mentioned this pull request Jul 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants