-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible error while parsing structured abstracts. #47
Comments
Thanks for reporting @RudrakshTuwani! Is there anyway that you can upload XML somewhere so that I can try to fix the parser? |
See the updated description. |
That's perfect. I'll fix that by the weekend. Bug me if I forget tho! |
No problem, thanks a lot! Do you think this is the case for all structured abstracts or it's specific to a few? |
There are a lot of abstract like this. I haven't calculated the exact number. I guess it should be roughly a million from full Medline dataset. |
Okay, thanks! |
I seem to have found the bug.
So, the first if block matches the first tag under structured abstract and returns it. This can be solved by removing the condition for matching "AbstractText". Can you tell me why we are even matching AbstractText? Since, just matching on Abstract also returns all the relevant information. |
Yes @RudrakshTuwani, you're right. I'm fixing that part now. |
I fix it in commit b59e150, let me know if it works for you! |
Yes, this seems to be working. Thanks a lot! One suggestion for future, maybe we can add the corresponding field for the structured abstract in the beginning of the sentence. If you want, I can work on it and send a PR later. |
Haha, you read my mind! I did that add just push it, see commit 34dae58 |
Haha, yes this seems to be working perfectly! Thanks a lot, man! |
@RudrakshTuwani, sorry, I still see the problem in my parser at |
Oh, can you send me the PMID? I'll have a look as well. |
This is OBJECTIVE
To examine interleukin-12 (IL-12), IL-18, IFN-γ, intracellular adhesion molecule-1 (ICAM-1), leukemia inhibitory factor (LIF), and migration inhibitory factor (MIF) levels in precisely-timed blood and endometrial tissue samples from women with idiopathic recurrent pregnancy loss (RPL).
METHODS
Case-control study.
METHODS
University hospital. It should be a better way to concatenate all the same section in one long string. |
We are actually doing it correctly, you can see at: https://www.ncbi.nlm.nih.gov/pubmed/?term=26368793. However, I chose to get |
Yes, there's a dilemma here. Maybe we should simply use Or do something like What do you think? |
So I fixed it in f3fe97f as dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz', year_info_only=False, nlm_category=False) If you set |
Okay, I fixed it in a slightly different way. The output I get is the following:
Label if not equal to NLM category are added in quotation marks in front of the NLMCategory. I guess we can close this now. |
Definitely, we can close it for now. I'll discuss with @daniel-acuna a bit more about this issue. Let me know there is any use cases that you want output to be in particular format. Thanks again for the issue tho! |
The check for UNASSIGNED is still failing. Replacing 'is not' by != fixes it. Also, we want to remove the content under UNASSIGNED section right? In which case, we may want to put No problem! Thanks for being so prompt on this. Let me know if I can contribute in any other way. |
Fixed I think putting |
Okay, that works too :) |
Hi, first of all big thanks for this life-saver of a package.
I think there is some problem with parsing XML for structured abstracts. Consider the following example:
The parse returned by medline_parser is as follows:
As you can see, it completely misses a major portion of the text. I wonder if this is the case for all structured abstracts or only limited ones. As additional info, the file I'm using is
ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0763.xml.gz
and the PMID of the abstract is 23826455.Thanks!
The text was updated successfully, but these errors were encountered: