import: Import from gitter. #9569

rheaparekh · 2018-05-28T17:28:17Z

This successfully imports gitter data to Zulip.

Things done:
Mapping users, stream, recipients, subscriptions, messages, avatars, added Management command convert_gitter_data, added basic tests, basic documentation, support user mentions.

Things to do:
Improve documentation, See how markdown conversion can be improved after feedback.

To Test:

Get the sample dataset in a file gitter.json.
Use ./manage.py convert_gitter_data gitter.json --output gitter_data to get the converted file.
Use ./manage.py import 'test-gitter-import' gitter_data to import. This will create a realm with the name test-gitter-import.

rheaparekh · 2018-06-07T18:50:52Z

zerver/lib/gitter_import.py

                                      zerver_subscription: List[ZerverFieldsT],
-                                      user_map: Dict[str, int]) -> ZerverFieldsT:
+                                      user_map: Dict[str, int], chunk_size: int=800) -> None:


The chunk_size used in export.py is 1000. But when I used the chunk_size as 1000 with the dataset of 15493 messages and 273 users, I had a successful gitter to zulip data conversion, but while importing, I received a Memory error. Working with 800 as the chunk_size didn't have any issue. I think it would be good if we decide upon a proper chunk_size.

rheaparekh · 2018-07-05T09:53:31Z

The build seems to be failing because of a test flake.

timabbott · 2018-07-05T10:00:35Z

@rht do you have time to help review this?

@neiljp I'd also appreciate your trying this out with our current sample data set. After doing the import, it's valuable to spend a while clicking around looking for inconsistencies.

rheaparekh · 2018-07-05T13:28:13Z

This is the current sample dataset: https://s3.amazonaws.com/custodian-gitter/capitalone-cloud-custodian.json

Messages can be bulky, and storing them in a single data structure can cause a memory error. In this commit, the messages are written to a file batch-wise, thus avoiding the memory error. Similar to commit 6b7b6b3

The gitter mentions are in the format '@usermention' and the mentions are included in the export data as: "mentions": [ { "screenName": "usermention", "userId": "54d7876c15522ed4b3dbbefb", "userIds": [] }] We extract this data and map this mention to @**usermention** for Zulip.

rht · 2018-07-06T08:26:05Z

I will see if I can do this by the end of the week(end).

rht · 2018-07-07T09:19:45Z

@rheaparekh
Note: most of the time I had to read the Slack version of each function as a reference point. It would have eased the checking if the well-tested functions in slack_data_to_zulip_data.py are repurposed here, e.g. build_recipient_and_subscription could reuse channels_to_zerver_stream. It would have eased the writing of IRC importer (is this being made?) as well.

There should be additional caveats in the documentation that 1. Gitter markdown 2. issue mentions haven't been mapped yet.

Other than those points, LGTM for the rest of the commits.

rheaparekh · 2018-07-09T08:38:23Z

I think I will add another PR to add common import functions, so that it'll help both slack and gitter importer (and any other future imports).
I have updated the documentation with the caveats (I'll work on the gitter markdown and issue mention next after this is merged)

rheaparekh · 2018-07-09T09:01:10Z

@timabbott This should be ready for a final review.

timabbott · 2018-07-23T16:03:53Z

This is good enough for a preliminary merge (since it does work and is a lot better than nothing), but I don't want to advertise this until we've cleaned it up a bit more. @rheaparekh here's the main things we'll need to adjust here before we advertise this feature:

We don't import any organization administrators. I think probably it's good enough to document using manage.py knight to create one, if the Gitter export doesn't contain data on that (though I think the correct concept is that the Git repo owners are org admins? Worth looking up).
I don't think we need a lot for formatting conversion, since they are both based on markdown. Probably one of the main Gitter->Zulip markdown conversions we want to do is if a message looks like this:

```code block

end of code block```

(without newlines at start/end), we should treat that as a code block in the import.

I'd like to convert a bunch of the code for constructing a user/subscription/etc. object into functions shared between Slack and Gitter import (even if it's just the low-level piece that makes a UserProfile, picking the defaults of ~10 fields that are the same for all imported users). We can put those in zerver/lib/export_util.py if you don't think we want them in the main zerver/lib/export.py. This should help reduce code duplication as we add more import tools in the future. Specifically I'm thinking of blocks like build_subscription, zulip_message = dict(, process_avatars, and the userprofile = UserProfile( one.
Can you give an example of what "issue mentions" are? I'm curious how hard it would be to address that, e.g. by generating a RealmFilter.

I think these could probably be burned through pretty quickly, so let's try to focus on them.

timabbott · 2018-07-23T16:04:20Z

docs/production/maintain-secure-upgrade.md has docs on manage.py knight that we can link to.

rheaparekh · 2018-07-23T16:52:11Z

@timabbott @rht thankyou for the reviews! I'll get started on the followups.

zulipbot added the size: XL label May 28, 2018

rheaparekh force-pushed the gitter-import branch 7 times, most recently from 3c37683 to ad7d8c3 Compare June 1, 2018 16:13

rheaparekh force-pushed the gitter-import branch 4 times, most recently from 895b09c to 9138063 Compare June 7, 2018 18:47

rheaparekh commented Jun 7, 2018

View reviewed changes

rheaparekh mentioned this pull request Jun 9, 2018

import: Process messages batchwise in slack import. #9723

Closed

rheaparekh force-pushed the gitter-import branch from 9138063 to 495a9ee Compare June 9, 2018 12:34

rheaparekh force-pushed the gitter-import branch 3 times, most recently from 123f0fc to c8254e4 Compare July 5, 2018 09:18

rheaparekh changed the title ~~[WIP] Import: Import from gitter.~~ import: Import from gitter. Jul 5, 2018

rheaparekh force-pushed the gitter-import branch 3 times, most recently from b5873bc to 1950200 Compare July 5, 2018 09:32

rheaparekh force-pushed the gitter-import branch from 1950200 to 116b57e Compare July 5, 2018 13:34

rheaparekh added 4 commits July 5, 2018 19:09

gitter import: Add gitter data conversion script.

7f3391d

gitter import: Add management command.

0489adf

gitter import: Add basic tests.

0a311de

gitter import: Write messages batch-wise.

fae48be

Messages can be bulky, and storing them in a single data structure can cause a memory error. In this commit, the messages are written to a file batch-wise, thus avoiding the memory error. Similar to commit 6b7b6b3

rheaparekh force-pushed the gitter-import branch from 116b57e to 8c62145 Compare July 9, 2018 08:37

gitter import: Add documentation.

8c62145

timabbott closed this Jul 23, 2018

rheaparekh deleted the gitter-import branch July 23, 2018 16:52

timabbott mentioned this pull request Jul 27, 2018

import from gitter #9374

Closed

rheaparekh mentioned this pull request Jul 28, 2018

docs: Update import-from-gitter doc. #10096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import: Import from gitter. #9569

import: Import from gitter. #9569

rheaparekh commented May 28, 2018 •

edited

rheaparekh Jun 7, 2018 •

edited

rheaparekh commented Jul 5, 2018 •

edited

timabbott commented Jul 5, 2018

rheaparekh commented Jul 5, 2018 •

edited

rht commented Jul 6, 2018

rht commented Jul 7, 2018

rheaparekh commented Jul 9, 2018

rheaparekh commented Jul 9, 2018

timabbott commented Jul 23, 2018

timabbott commented Jul 23, 2018

rheaparekh commented Jul 23, 2018

import: Import from gitter. #9569

import: Import from gitter. #9569

Conversation

rheaparekh commented May 28, 2018 • edited

rheaparekh Jun 7, 2018 • edited

Choose a reason for hiding this comment

rheaparekh commented Jul 5, 2018 • edited

timabbott commented Jul 5, 2018

rheaparekh commented Jul 5, 2018 • edited

rht commented Jul 6, 2018

rht commented Jul 7, 2018

rheaparekh commented Jul 9, 2018

rheaparekh commented Jul 9, 2018

timabbott commented Jul 23, 2018

timabbott commented Jul 23, 2018

rheaparekh commented Jul 23, 2018

rheaparekh commented May 28, 2018 •

edited

rheaparekh Jun 7, 2018 •

edited

rheaparekh commented Jul 5, 2018 •

edited

rheaparekh commented Jul 5, 2018 •

edited