Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic CSV export translator #742

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
5 participants
@zuphilip
Copy link
Collaborator

commented Jun 1, 2014

For some tasks in my work with books I had to search for some literature and save some rudimentary bibliographic facts about each of them in a list (to work further). Moreover, I have read somewhere in the forum, that the "wish" for a CSV translator exists or take a look at http://belencruz.com/2014/03/how-to-export-from-zotero-to-excel/ or http://royce.kimmons.me/tutorials/zotero_to_excel . I think that nowadays it is more important to search and analyse larger parts of data easily, maybe also for bibliographic data.

Here is a CSV export translator. The approach is to export all data, i.e. data fields from specific types are for the other rows empty. Moreover, the multiple fields and nested fields are translated as "paths" and can optionally also added appropriate.

Please let me know what you think.

CSV.js Outdated
"exportCharset": "UTF-8",
"Export Creators" : true,
"exportNotes": false,
"Export Attachements": false,

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jun 1, 2014

Collaborator

I'm not sure about creating new display options like this, @aurimasv would know, but in any case, attachment is spelled without an e after the ch.

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jun 1, 2014

Contributor

Technically speaking, you can add options like that and they will just display whatever you label them as, but I honestly don't think that we would want to allow tons of random customizations. For one, they are not going to be localized (if you want them to be localized, you'd need to come up with a good reason to have this option and then add it to Zotero code). And secondly, I don't think that there is any need to have such customization for a CSV export translator, where removing columns/rows of exported data is very trivial.

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 2, 2014

Author Collaborator

Maybe, we can delete the options and realy export everything (but this can be a lot of columns). I don't know what exactly is stored in seeAlso. And does it make any sense to export the uniqueFields?

Going this direction (export everything), I suggest to fix some ordering. In this way we can put everything which we think are less important (e.g. attachments, tags) in later columns. I tried to come up with a good ordering of the columns but I think my flexible approach is not always giving good results...

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2014

Although there's currently no way to perform export tests, could you at least provide some sample output of this translator? (Preferably for an item with multiple authors, tags, attachments, and related items) You can post the result in the comments here on the pull request.

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jun 1, 2014

In general, btw., I think having this is great

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 2, 2014

Here you see an example for the export (add everything):
https://gist.github.com/zuphilip/3a5b7efe318516e4142e

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 5, 2014

Please give me some feedback how to proceed...

@scottmckenzie1980

This comment has been minimized.

Copy link

commented Jul 5, 2014

I just gave this a test run for a project I am doing. Not to shabby! Good job.

It might be nice to be able to select the fields a person wants up front. But, it is not to hard to delete them in excel.

Thanks for your hard work, keep it up!

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 5, 2014

You can export individual items or a collection by selecting it/them and choose export from the context menue, cf. https://www.zotero.org/support/kb/exporting

Selecting fields or just export everything is a good question.

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jul 5, 2014

I'm with @aurimasv on this--I'd just export everything. To be honest, I have no idea what goes into "seeAlso," and I'm a little unclear on what the uniqueFields are--if I understand correctly those are fields that only exist for one item type? In that case we'd absolutely want to include them.

Deleting the options, further improvements
* delete options for exclude certain fields
* prepared customizations in the code (e.g. for excluding certain fields)
* further improvements for the ordering of the columns
* improved documentation
@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 6, 2014

The new version will export everything and therefore all options (except the encoding) are deleted. However, I tried to make certain custimization possible by filling out the function excludeFields. Some further improvements are also made. Please have a look. I am especially interested if the CSV is how you would except it should be, since there are different possibilities to create a CSV file.

CSV.js Outdated
} else {
Zotero.write(fieldWrapperCharacter);
if (typeof content === 'string') {//replace don't work on numbers e.g. itemID
content = content.replace(new RegExp(recordsDeliminator, 'g') , " ");// /\r?\n|\r/g

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jul 6, 2014

Author Collaborator

I am not sure if this is working and/or if it is needed. recordsDeliminator is the line break but since we put everything in double quotes it should work without any replacement, at least in theory. For example Excel seems to have problems with line breaks inside one field. What do you think?

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

to avoid issues with numbers, simply do content = "" + content

We should only need to escape the double quotes. Everything else should work as is (e.g. newlines inside quotes are ok). I see you do escape double quotes below, but I'm not sure if the same escaping works with other characters as wrappers. I would suggest to not make this variable (easily) configurable. The code is fairly clean and if someone needs to export in some non-standard format, I think they can easily figure out what to change and how to handle escaping.

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2014

Sorry for the delay. Here are my main issues at this point (in no particular order):

  • I don't think there's much point in exporting itemID. It's purely internal and I don't think users will ever encounter it. I would go with item key as the first column.
  • The order of fields in general should be more useful. I would export more useful fields first, then maybe alphabetically with some fields pushed all the way to the back. So for the first few fields, I would do item key, title, date, publication title, creators*, etc.
  • Export only base fields (i.e. uniqueFields), do not export item type-specific variations. E.g. websiteTitle should be exported as publication title. Don't export uniqueFields separately (i.e. no uniqueFields/...)
  • Always export all fields, even if none of the items have them filled in. (actually, it seems like this is already the case, but I would just like to specifically point it out, because it will come into play in the next point)
  • For creators, export first and second creator of each type into their own dedicated column. Export the remainder of the creators into a third column (per creator type), separating them with semicolon. (edit: always export first, second, and other creators even if they're not present. Push less common creator fields to the back, like translators, etc.)
  • One column per creator. Don't export creator ID. Instead of exporting creator type into a column, title the column appropriately. Instead of exporting first and last name into separate columns, export them in RIS format (last, first). Don't export field mode.
  • For notes, only output the contents of the note, don't output itemID, itemType, dateAdded, dateModified, key, sourceItemKey. Output all notes into a single column. Separate by newlines (not entirely sure how this works in CSV. I think quotes are supposed to protect newlines. Talking about quotes, make sure those get escaped in whatever way is appropriate). Nit: strip HTML, converting to newlines where appropriate.
  • For attachments, only output either URL, or zotero URI, nothing else. Split up into PDF, HTML, and Other columns based on mimeType. Separate multiple URLs/URIs with newlines.
  • For tags, output only the actual tag. Output all tags into one column. Separate tags by... semicolon? (not sure though. There may be a chance that a semicolon is actually part of a tag, but I think that chance is very low)

I think that's all for now. If you have a better suggestion for separators (esp. newline) feel free to suggest. Also, I would wait for @adam3smith to pitch in on the above points.

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 8, 2014

Because you are suggesting a lot of things which are contradicting to the approach I choose, I would like to respond to a few things and ask for a united path to continue:

  • Excluding some fields from export or including only a subset of fields: You make some good suggestions and they look reasonable, i.e. this looks mostly okay for me
  • Stick to a very rigid schema, e.g., exactly the first two authors should be written out in seperate fields and all remaining authors should be captured in another field. This reminds me of the "et al." in citations, but for storing data it is IMO (at least) unusual. What adavantage do you see to fix stuffs like the number of columns before? Moreover, a more flexible approach could mean less work in the future...
  • The "nested CSV structure": You suggested for example to put all tags in one fields and seperate them by a semicolon. Why is this preferable to put the tags into several fields? One could avoid this nesting. Actually, it is also easy for example in Excel to concatenate several fields in one field. The other direction may (at least in further computations in Excel) be more problematic...
@dstillman

This comment has been minimized.

Copy link
Member

commented Jul 8, 2014

You suggested for example to put all tags in one fields and seperate them by a semicolon. Why is this preferable to put the tags into several fields?

If you put them in multiple fields, wouldn't you potentially need, say, dozens of columns that might be empty for the vast majority of items, if a single item had dozens of tags? That seems much messier than having a single column and choosing a delimiter that's unlikely to appear in tags. (But I don't know how well Excel handles further delimiters within columns.)

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2014

Excluding some fields from export or including only a subset of fields

Just to sum up, I was suggesting that we omit most of the fields that are internal to Zotero and would probably not be beneficial to the user. I think we can export all of the actual metadata.

Stick to a very rigid schema, e.g., exactly the first two authors should be written out in seperate fields and all remaining authors should be captured in another field. This reminds me of the "et al." in citations, but for storing data it is IMO (at least) unusual

The et al style was precisely what I was going for. The reason for sticking to a very rigid schema (i.e. always the same number of columns) is because this would allow top export items from various collections and then copy paste them together or whatnot. The variable fields (creators, tags, attachments, etc.) are a big problem for this, so we need to come up with a way to concatenate them. For creators, first author is often useful separately, second author maybe not so much (maybe get rid of that as well). The rest of the creators will offer limited use in a spreadsheet. For tags, they're all equally "important" (maybe split automatic and user tags), so there's no rationale for splitting them. In all of these cases, if we choose a good delimiter, splitting text to columns (see Excel function titled that way) is fairly trivial if the user wishes to.

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 17, 2014

Sorry, it took me a little longer. I created a new version including most of your comments. I created a new output example: https://gist.github.com/zuphilip/3a5b7efe318516e4142e
Putting all tags into one fields etc. is okay. I am not sure about all creator types. I didn't include the mime type you suggested, because there could be a lot more. I guess it would be good to hear some more feedback from you at this step.

@dstillman

This comment has been minimized.

Copy link
Member

commented Jul 17, 2014

  1. attachements should be attachments
  2. There are commas at all the line endings (which probably doesn't matter, but might as well get rid of them).
CSV.js Outdated

var recordsDeliminator = "\n";
var fieldsDeliminator = ",";
var fieldWrapperCharacter = '"';

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

It may be nice if the user can change this, but I think that most users will not need to. I wouldn't mind leaving this, except that according to CSV spec (RFC4180 section 2), if a field is enclosed in double quotes and it contains a double quote, the double quote in the field needs to be escaped by placing another double quote before it (essentially "foo ""bar""" means foo "bar"). We need to make sure we perform this escape, but we can only do it if we follow the spec, which requires that fields are wrapped in double quotes.

(btw, it should be spelled "delimiter")

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jul 18, 2014

Author Collaborator

The fieldWrapperCharacter is already escaped if it occurs in a field, cf. zuphilip@aa0856e#diff-de73edee40372db6327952c29b0bba63L49


//The export will be stucked if you try to export to a csv-file
//which is already opend with Excel. Thus, close it before or rename
//the new csv-file.

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

I haven't tested this, but if we get stuck when trying to write to a locked file, we should throw an error, no? @dstillman?

This comment has been minimized.

Copy link
@dstillman

dstillman Jul 18, 2014

Member

We should at least throw the general translator failure dialog, yes.

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

Are you overhauling the whole disk IO to use OS.File? If so, I'll just wait until you push all of your changes to master.

This comment has been minimized.

Copy link
@dstillman

dstillman Jul 18, 2014

Member

Some, but not all (since a lot of Mozilla functions still take nsIFile and it's less of a performance issue than DB access). But the majority of changes outside of translators now will trigger merge conflicts with my overhaul, so might as well wait.

CSV.js Outdated

//It is possible to disable some multiple fields from the export.
//just set the corresponding flag below to false.
var multipleFieldsForExport = {'creators' : true, 'tags' : true, 'notes' : true, 'attachments' : true};

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

if you intend users to configure this, I would split these up onto separate lines.

CSV.js Outdated
var contentArray = [ [], [] ];
for (var k=0; k<tagsObject.length; k++) {
var test = evaluate(item, 'tags/' + k + '/type');
contentArray[test].push( evaluate(item, 'tags/' + k + '/tag') );//Is tags/0/tag = tags/0/fields/name ?

This comment has been minimized.

CSV.js Outdated
for (var k=0; k<attObject.length; k++) {
contentArray.push( evaluate(item, 'attachments/' + k + '/url') );
}
exportField( contentArray.join(' ') );

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

we can probably do newlines for this as well. No strong preference from me, but it would be more consistent and maybe less confusing.

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jul 18, 2014

Author Collaborator

Newlines are problematic with Excel, see below.

CSV.js Outdated
var creatorLastName = evaluate(item, 'creators/' + k + '/lastName');
var creatorFirstName = evaluate(item, 'creators/' + k + '/firstName');
var creatorFieldMode = evaluate(item, 'creators/' + k + '/fieldMode');
if (creatorFieldMode == "") {

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

fieldMode is actually either 1 (for institutional authors) or 0 (for last, first). I would just do if (!creatorFieldMode), but this does work, since "" == 0 == false

CSV.js Outdated
var creatorType = evaluate(item, 'creators/' + k + '/creatorType');
var creatorLastName = evaluate(item, 'creators/' + k + '/lastName');
var creatorFirstName = evaluate(item, 'creators/' + k + '/firstName');
var creatorFieldMode = evaluate(item, 'creators/' + k + '/fieldMode');

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

why not just get the creator once and access its properties directly? var creator = creatorsObject[k]; creator.lastName...

CSV.js Outdated
//export
for (var index=0; index<creatorsType.length; index++) {
var contentArray = contentCreators[ creatorsType[index] ];
if (contentArray.length > 0) {

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

[].join(';') == '' so you don't need the if-else

CSV.js Outdated



Z.debug(item);

This comment has been minimized.

Copy link
@aurimasv

aurimasv Jul 18, 2014

Contributor

remove for final version

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2014

Most of my comments above are just nits. My two major concerns are the order of the columns and first author column.

As I mentioned before, I do think that we should prioritize the more important metadata (i.e. title, authors, date, container title, URL, DOI, date added. Tags should probably be somewhere towards the top as well).

I also think that we should include first author on its own. Whether that means excluding first author from "other authors" field or not, I'm not sure. I think we should also add Year of publication.

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 18, 2014

Order of columns: This is certainly not yet optimized. Can you suggest a good ordering? Reorder single fields is trivial an I guess also it is somehow possible to rearrange multiple fields in between...

First author column: Why do you think this is needed? It should be simple to "calculate" the first author column from the column containing all authors with Excel. Moreover, I can imagine to order the records according to the first author's name, but still this would also be possible with one field containing all author. Moreover, I think it may be useful to calculate the number of authors, which is more elegant in the one field containing all authors setting. Would you include a first creator field just for authors or for every other creator type as well?

Creator Types: These are giving now 29 columns. Was this the way you imagine it? It would be good if you could check the completeness and maybe you want to suggest an ordering here as well.

Newlines: At the moment I replace every newline with spaces, because I couldn't manage to create a CSV-file with newlines in a field that Excel is reading correctly (although it is possible according the specification). There are results for googling "excel newline csv" but I couldn't come up with a working solution. Any ideas?
Update: The newlines in excel seem to work if one use semicolon ; as field seperate. The CSV-export from excel also uses \r\n\ as records delimiter and inside a field \n for newlines, but this seems not to be essential.

Year of Publication: Well, it is exported as part of uniqueFields/date. Do you suggest to add another column "publicationYear", with just the year? Or is this connected to the first author comment?

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2014

First author column: Why do you think this is needed? It should be simple to "calculate" the first author column from the column containing all authors with Excel. Moreover, I can imagine to order the records according to the first author's name, but still this would also be possible with one field containing all author. Moreover, I think it may be useful to calculate the number of authors, which is more elegant in the one field containing all authors setting. Would you include a first creator field just for authors or for every other creator type as well?

Eh, ok. No need for first author. Can't think of a good use case.

Creator Types: These are giving now 29 columns. Was this the way you imagine it? It would be good if you could check the completeness and maybe you want to suggest an ordering here as well.

Yes, sort of. We would still want to use the base name. E.g. "artist" should be under "author". Other than author and editor, i don't think the order matters. Could be alphabetical.

Year of Publication: Well, it is exported as part of uniqueFields/date. Do you suggest to add another column "publicationYear", with just the year? Or is this connected to the first author comment?

Yes, I think this is more necessary than first author, since Date field is very free form and Excel is not going to understand most of them for sorting.

Order of columns: This is certainly not yet optimized. Can you suggest a good ordering? Reorder single fields is trivial an I guess also it is somehow possible to rearrange multiple fields in between...

What I would like to see all the way on the left are itemType, publication year, author, title, publication title, ISSN/ISBN (actually, these are exclusive, so they can be the same column), DOI, URL, abstract, date added, date modified. The rest can be in alphabetical order (or whatever is easier to export). Might have missed something.

Newlines: At the moment I replace every newline with spaces, because I couldn't manage to create a CSV-file with newlines in a field that Excel is reading correctly (although it is possible according the specification). There are results for googling "excel newline csv" but I couldn't come up with a working solution. Any ideas?
Update: The newlines in excel seem to work if one use semicolon ; as field seperate. The CSV-export from excel also uses \r\n\ as records delimiter and inside a field \n for newlines, but this seems not to be essential.

If I take the CSV file you posted and I add newlines (Unix style, \n) into whatever field I want, they import properly for me. Do you have a specific example of where this is not working? I'm using Excel 2010

New version
* improved the ordering
* delete the last comma in each line
* improved handling for author types
* add publication year as seperate field
@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 27, 2014

Please have look at the new version. I tried to adjust the creators with their different creator types as I understand your comments. If you still want to handle them differently, then please give me some more information how to continue. The last comma in each line is deleted, ordering should be improved as well as spelling mistakes.

However, I think it is easier to handle ISBN and ISSN in two seperate fields. IMO we should think about what is useful for further computation, because I think this will be use cases. For example I want to check the availability of some books in a library (my use case) by copying the ISBN column. If ISBN and ISSN would be in one column, then I have first to filter it or just export books which could mean an extra step. On the other hand I cannot think of a use case, where we want to calculate something from a unified ISBN/ISSN field. Okay?

The example is updated as well: https://gist.github.com/zuphilip/3a5b7efe318516e4142e

This would also be the example where Excel is not handling the line breaks correct. It contains line breaks in fields and escape of the field delimiter ". My impression is that maybe Excel has not implemented everything what would be possible with CSV. There seems to be no problem to import the example in LibreOffice.

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2014

Re ISBN/ISSN you're absolutely right. My comment was poorly thought out.

This would also be the example where Excel is not handling the line breaks correct. It contains line breaks in fields and escape of the field delimiter ". My impression is that maybe Excel has not implemented everything what would be possible with CSV. There seems to be no problem to import the example in LibreOffice.

I'm not having any issues with newlines in that CSV output. Excel seems to be treating them correctly. Again, I'm using 2010, so maybe it's a difference between versions. OTOH if you are seeing these consistent issues, I'm not too committed to newlines, so we can replace them with spaces.

Otherwise the output looks great! The only remaining issue that probably only affects excel is Unicode. Excel does not recognize the text as Unicode and messes up the encoding. I did find out that we can fix this by adding a BOM ("\uFEFF") to the beginning of the file. Didn't find anything about this in the specs and I'm not sure if there would be issues with BOM in other software (at this point, there really shouldn't). It's also pretty trivial to remove if you're trying to do something with the data programmatically.

If we're ok with the output, I can look over the code one last time and we can merge.

Edit: had the wrong BOM character

Edit2: that was actually right. In any case, there's a better way to set BOM (though that's currently broken, but hopefully will be fixed soon). Use displayOptions: { exportCharset: "UTF-8xBOM" } in the translator info section at the top.

@aurimasv aurimasv referenced this pull request Aug 1, 2014

Open

Export fixes #517

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 1, 2014

Thank you for dealing with the encoding stuff.

I am fine with the output in general but I just corrected a small inconsistency in the exportField function. Moreover, I think we should change the encoding for empty fields. At the moment we export "", if the field is empty, but two times the double quotation marks are actually escaping for the double quotation mark. Maybe, it is better to export just , in this case. On the other hand it might be handy if the content of every field is surrounded by quotation marks. What do you think? I haven't found anything about empty fields in the specification.

UPDATE: As far as I understand the grammar in the specification, it is possible to handle empty fields as empty or as empty string "". Moreover, all test seem to handle them correct. Thus, I would leave it as it is now. If you have other suggestions, please let me know.

I am also using Excel 2010 but still it fails to import CSV files with line breaks. Actually, if I create a CSV file in Excel with line breaks and try to import it, this also fails. Maybe it is a problem for Excel on Windows or maybe the German localization is slightly different (for example we use normally , as decimal delimiter, e.g. 9,49 €). Maybe, the two tests on our computers are too small for a representative sample...

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 3, 2014

Code is now ready for your review. Let me know what you think about it.

@aurimasv aurimasv closed this in 4e2a735 Aug 14, 2014

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

Sorry @zuphilip, I rewrote most of it (though I kept you as the author and actually kept your initial commit, I hope you don't mind). I think this makes it easier for users to customize the translator. I don't think I changed anything from what we agreed upon above. Do let me know if I did.

I tested it a bit (not too much though), but I probably overlooked something anyway, so please give it a couple quick tests with your data set.

Thanks for working on this!

@zuphilip

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 14, 2014

Okay, this looks fine for me. You manage to write it much more compact (145 lines vs. 311 lines) and to improve with your knowledge and experience. I made some small comments in the commit and I will test it later. I am missing some creator type like 'artist', 'performer', 'sponsor' but maybe they are subsumed somewhere else?

Update: @aurimasv Some creator types are missing, e.g. artist for an artwork. We should either subsume then in the author field (and give it a different (appropriate) label!) or add individual fields.

aurimasv added a commit that referenced this pull request Aug 14, 2014

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

I think I addressed all of your comments. Thanks for testing!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.