Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Converting Parsed WOS File Contents Back to Plain Text Format #17

Open
hp0404 opened this issue Aug 30, 2023 · 6 comments
Open

Comments

@hp0404
Copy link

hp0404 commented Aug 30, 2023

We currently use the library for parsing WOS file contents, which has been incredibly useful in our data processing pipeline. However, we have encountered a new requirement in our workflow. We need to convert the parsed WOS file contents (think of one long dictionary consisting of parsed content of multiple WOS files) into an original WOS plain format (.txt). For example, it's useful if you want to work with parsed files in R or CiteSpace.

Currently, the library focuses primarily on reading and parsing WOS files but doesn't provide functionality for converting the parsed data back into a text format.

@sdspieg
Copy link

sdspieg commented Aug 30, 2023

Let me chime in here (working with Hryhorii on this). So what we are trying to do essentially is to merge and deduplicate (using CrossRef and DOIs as the deduplication key; and selecting, for every set of duplicates, the field for that publication which has the highest N of chars) the exports from various sources (Dimensions, Google Scholar (through scholarly) Lens, OpenAlex, Scopus, Web of Science) into a new file that we then want to get in a 'WoS' format that can be used in CiteSpace, Bibliometrix, etc. We are able to produce such a file, but for some reason, Bibliometrix, for instance, throws and error when we try to import it. Do you have any suggestions?

@rafguns
Copy link
Owner

rafguns commented Aug 30, 2023

Hi both! You're right, this is a use case that the library currently doesn't implement. But I think it would fit in the scope of the package, so I'm open to including writing functionality.

If I understand @sdspieg correctly, you already implemented this yourself but the resulting files cause an error in Bibliometrix? Would it be possible to share your current code and/or an example of such a file?

@sdspieg
Copy link

sdspieg commented Aug 30, 2023

Sure - just give us a few minutes. Can we email it to you?

@hp0404
Copy link
Author

hp0404 commented Aug 30, 2023

Here's the first 'working' solution I've developed - it creates a similar enough format that bibliometrix (R package) accepts the file without issues.

I wrote the code based on our own file (there are some weird parts such as 'cleaning' year published field), but in the example below, I'm using wos_plaintext.txt

import wosfile
from wosfile.tags import has_item_per_line, is_splittable

if __name__ == "__main__":

    records = [rec for rec in wosfile.records_from("data/wos_plaintext.txt")]

    with open("data/output.txt", "w", encoding="utf-8") as f:
        f.write("FN Thomson Reuters Web of Science™\n")
        f.write("VR 1.0\n")

        for record in records:
            # the first key is PT - specific to our case
            if "PT" in record:
                pt = record.pop("PT")
            else:
                # our default value is J=Journal
                pt = "J"
            f.write(f"PT {pt}\n")

            # fixing bad dates - specific to our case
            if "PY" in record:
                possible_py = record.pop("PY")
                if possible_py:
                    py = int(float(possible_py))
                    f.write(f"PY {py}\n")

            for abbr, content in record.items():
                if content is None or str(content) == "" or str(content) == "NA":
                    continue

                # splittable, stays on the same line
                if is_splittable[abbr] and not has_item_per_line[abbr]:
                    if isinstance(content, list):
                        content = "; ".join(content)
                    else:
                        content = "; ".join(
                            chunk.strip() for chunk in content.split(";") if chunk != ""
                        )
                    f.write(f"{abbr} {content}\n")

                # splittable acorss multiple lines
                elif is_splittable[abbr] and has_item_per_line[abbr]:
                    if isinstance(content, str):
                        content = content.split(";")
                    if not isinstance(content, list):
                        raise TypeError(f"{abbr} must be a list at this point.")
                    # empty
                    if content == []:
                        continue
                    first_mention = content[0]
                    f.write(f"{abbr} {first_mention.strip()}\n")
                    for next_mention in content[1:]:
                        f.write(f"   {next_mention.strip()}\n")

                # C1 field
                elif not is_splittable[abbr] and has_item_per_line[abbr]:
                    if content.startswith("["):
                        first_bracket = content.split("[")[1]
                        f.write(f"{abbr} [{first_bracket.strip()}\n")
                        for next_bracket in content.split("[")[2:]:
                            f.write(f"   [{next_bracket.strip()}\n")
                    else:
                        # in our file, C1 does not start with "["
                        f.write(f"{abbr} {content}\n")

                # regular key-value pairs
                else:
                    if isinstance(content, list):
                        raise TypeError(f"{abbr} must not be a list at this point.")
                    f.write(f"{abbr} {content}\n")

            # the last key is ER
            f.write("ER\n\n")
        f.write("EF")

@hp0404 hp0404 closed this as completed Sep 1, 2023
@rafguns
Copy link
Owner

rafguns commented Sep 1, 2023

I'm reopening this since this functionality might be of interest to other users as well.

Thanks for the example code, @hp0404 . Can I base my implementation on yours?

@rafguns rafguns reopened this Sep 1, 2023
@hp0404
Copy link
Author

hp0404 commented Sep 1, 2023

sure thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants