Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

hp0404 · 2023-08-30T12:41:17Z

We currently use the library for parsing WOS file contents, which has been incredibly useful in our data processing pipeline. However, we have encountered a new requirement in our workflow. We need to convert the parsed WOS file contents (think of one long dictionary consisting of parsed content of multiple WOS files) into an original WOS plain format (.txt). For example, it's useful if you want to work with parsed files in R or CiteSpace.

Currently, the library focuses primarily on reading and parsing WOS files but doesn't provide functionality for converting the parsed data back into a text format.

sdspieg · 2023-08-30T13:23:02Z

Let me chime in here (working with Hryhorii on this). So what we are trying to do essentially is to merge and deduplicate (using CrossRef and DOIs as the deduplication key; and selecting, for every set of duplicates, the field for that publication which has the highest N of chars) the exports from various sources (Dimensions, Google Scholar (through scholarly) Lens, OpenAlex, Scopus, Web of Science) into a new file that we then want to get in a 'WoS' format that can be used in CiteSpace, Bibliometrix, etc. We are able to produce such a file, but for some reason, Bibliometrix, for instance, throws and error when we try to import it. Do you have any suggestions?

rafguns · 2023-08-30T13:49:27Z

Hi both! You're right, this is a use case that the library currently doesn't implement. But I think it would fit in the scope of the package, so I'm open to including writing functionality.

If I understand @sdspieg correctly, you already implemented this yourself but the resulting files cause an error in Bibliometrix? Would it be possible to share your current code and/or an example of such a file?

sdspieg · 2023-08-30T13:52:34Z

Sure - just give us a few minutes. Can we email it to you?

hp0404 · 2023-08-30T19:59:07Z

Here's the first 'working' solution I've developed - it creates a similar enough format that bibliometrix (R package) accepts the file without issues.

I wrote the code based on our own file (there are some weird parts such as 'cleaning' year published field), but in the example below, I'm using wos_plaintext.txt

import wosfile
from wosfile.tags import has_item_per_line, is_splittable

if __name__ == "__main__":

    records = [rec for rec in wosfile.records_from("data/wos_plaintext.txt")]

    with open("data/output.txt", "w", encoding="utf-8") as f:
        f.write("FN Thomson Reuters Web of Science™\n")
        f.write("VR 1.0\n")

        for record in records:
            # the first key is PT - specific to our case
            if "PT" in record:
                pt = record.pop("PT")
            else:
                # our default value is J=Journal
                pt = "J"
            f.write(f"PT {pt}\n")

            # fixing bad dates - specific to our case
            if "PY" in record:
                possible_py = record.pop("PY")
                if possible_py:
                    py = int(float(possible_py))
                    f.write(f"PY {py}\n")

            for abbr, content in record.items():
                if content is None or str(content) == "" or str(content) == "NA":
                    continue

                # splittable, stays on the same line
                if is_splittable[abbr] and not has_item_per_line[abbr]:
                    if isinstance(content, list):
                        content = "; ".join(content)
                    else:
                        content = "; ".join(
                            chunk.strip() for chunk in content.split(";") if chunk != ""
                        )
                    f.write(f"{abbr} {content}\n")

                # splittable acorss multiple lines
                elif is_splittable[abbr] and has_item_per_line[abbr]:
                    if isinstance(content, str):
                        content = content.split(";")
                    if not isinstance(content, list):
                        raise TypeError(f"{abbr} must be a list at this point.")
                    # empty
                    if content == []:
                        continue
                    first_mention = content[0]
                    f.write(f"{abbr} {first_mention.strip()}\n")
                    for next_mention in content[1:]:
                        f.write(f"   {next_mention.strip()}\n")

                # C1 field
                elif not is_splittable[abbr] and has_item_per_line[abbr]:
                    if content.startswith("["):
                        first_bracket = content.split("[")[1]
                        f.write(f"{abbr} [{first_bracket.strip()}\n")
                        for next_bracket in content.split("[")[2:]:
                            f.write(f"   [{next_bracket.strip()}\n")
                    else:
                        # in our file, C1 does not start with "["
                        f.write(f"{abbr} {content}\n")

                # regular key-value pairs
                else:
                    if isinstance(content, list):
                        raise TypeError(f"{abbr} must not be a list at this point.")
                    f.write(f"{abbr} {content}\n")

            # the last key is ER
            f.write("ER\n\n")
        f.write("EF")

rafguns · 2023-09-01T12:33:47Z

I'm reopening this since this functionality might be of interest to other users as well.

Thanks for the example code, @hp0404 . Can I base my implementation on yours?

hp0404 · 2023-09-01T17:20:43Z

sure thing!

hp0404 closed this as completed Sep 1, 2023

rafguns reopened this Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

hp0404 commented Aug 30, 2023

sdspieg commented Aug 30, 2023

rafguns commented Aug 30, 2023

sdspieg commented Aug 30, 2023

hp0404 commented Aug 30, 2023

rafguns commented Sep 1, 2023

hp0404 commented Sep 1, 2023

Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

Comments

hp0404 commented Aug 30, 2023

sdspieg commented Aug 30, 2023

rafguns commented Aug 30, 2023

sdspieg commented Aug 30, 2023

hp0404 commented Aug 30, 2023

rafguns commented Sep 1, 2023

hp0404 commented Sep 1, 2023