-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Converting Parsed WOS File Contents Back to Plain Text Format #17
Comments
Let me chime in here (working with Hryhorii on this). So what we are trying to do essentially is to merge and deduplicate (using CrossRef and DOIs as the deduplication key; and selecting, for every set of duplicates, the field for that publication which has the highest N of chars) the exports from various sources (Dimensions, Google Scholar (through scholarly) Lens, OpenAlex, Scopus, Web of Science) into a new file that we then want to get in a 'WoS' format that can be used in CiteSpace, Bibliometrix, etc. We are able to produce such a file, but for some reason, Bibliometrix, for instance, throws and error when we try to import it. Do you have any suggestions? |
Hi both! You're right, this is a use case that the library currently doesn't implement. But I think it would fit in the scope of the package, so I'm open to including writing functionality. If I understand @sdspieg correctly, you already implemented this yourself but the resulting files cause an error in Bibliometrix? Would it be possible to share your current code and/or an example of such a file? |
Sure - just give us a few minutes. Can we email it to you? |
Here's the first 'working' solution I've developed - it creates a similar enough format that bibliometrix (R package) accepts the file without issues. I wrote the code based on our own file (there are some weird parts such as 'cleaning' year published field), but in the example below, I'm using wos_plaintext.txt import wosfile
from wosfile.tags import has_item_per_line, is_splittable
if __name__ == "__main__":
records = [rec for rec in wosfile.records_from("data/wos_plaintext.txt")]
with open("data/output.txt", "w", encoding="utf-8") as f:
f.write("FN Thomson Reuters Web of Science™\n")
f.write("VR 1.0\n")
for record in records:
# the first key is PT - specific to our case
if "PT" in record:
pt = record.pop("PT")
else:
# our default value is J=Journal
pt = "J"
f.write(f"PT {pt}\n")
# fixing bad dates - specific to our case
if "PY" in record:
possible_py = record.pop("PY")
if possible_py:
py = int(float(possible_py))
f.write(f"PY {py}\n")
for abbr, content in record.items():
if content is None or str(content) == "" or str(content) == "NA":
continue
# splittable, stays on the same line
if is_splittable[abbr] and not has_item_per_line[abbr]:
if isinstance(content, list):
content = "; ".join(content)
else:
content = "; ".join(
chunk.strip() for chunk in content.split(";") if chunk != ""
)
f.write(f"{abbr} {content}\n")
# splittable acorss multiple lines
elif is_splittable[abbr] and has_item_per_line[abbr]:
if isinstance(content, str):
content = content.split(";")
if not isinstance(content, list):
raise TypeError(f"{abbr} must be a list at this point.")
# empty
if content == []:
continue
first_mention = content[0]
f.write(f"{abbr} {first_mention.strip()}\n")
for next_mention in content[1:]:
f.write(f" {next_mention.strip()}\n")
# C1 field
elif not is_splittable[abbr] and has_item_per_line[abbr]:
if content.startswith("["):
first_bracket = content.split("[")[1]
f.write(f"{abbr} [{first_bracket.strip()}\n")
for next_bracket in content.split("[")[2:]:
f.write(f" [{next_bracket.strip()}\n")
else:
# in our file, C1 does not start with "["
f.write(f"{abbr} {content}\n")
# regular key-value pairs
else:
if isinstance(content, list):
raise TypeError(f"{abbr} must not be a list at this point.")
f.write(f"{abbr} {content}\n")
# the last key is ER
f.write("ER\n\n")
f.write("EF") |
I'm reopening this since this functionality might be of interest to other users as well. Thanks for the example code, @hp0404 . Can I base my implementation on yours? |
sure thing! |
We currently use the library for parsing WOS file contents, which has been incredibly useful in our data processing pipeline. However, we have encountered a new requirement in our workflow. We need to convert the parsed WOS file contents (think of one long dictionary consisting of parsed content of multiple WOS files) into an original WOS plain format (.txt). For example, it's useful if you want to work with parsed files in R or CiteSpace.
Currently, the library focuses primarily on reading and parsing WOS files but doesn't provide functionality for converting the parsed data back into a text format.
The text was updated successfully, but these errors were encountered: