Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract endnotes as well #19

Merged
merged 1 commit into from
Dec 31, 2017
Merged

Extract endnotes as well #19

merged 1 commit into from
Dec 31, 2017

Conversation

rmzelle
Copy link
Owner

@rmzelle rmzelle commented Dec 31, 2017

Follow-up of #17, to extract Zotero citations from endnotes as well. (cc @zuphilip)

@rmzelle
Copy link
Owner Author

rmzelle commented Dec 31, 2017

Generally this just works, but in your https://github.com/rmzelle/ref-extractor/files/1595117/Dok9-endnotes2.docx document one of the endnotes is split over multiple <w:instrText/> elements, which I haven't seen before (in my limited testing). I'm not sure what the logic behind this is (maximum field length?), which makes it a little difficult to know what to expect in other Word documents, but for now these cases are lost during the extraction.

			<w:r>
				<w:instrText xml:space="preserve">
					ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"oI6eP5QH","properties":{"formattedCitation":"{\\rtf Paras Mandal u.\\uc0\\u160{}a., \\uc0\\u8222{}A Novel Approach to Forecast Electricity Price for PJM Using Neural Network and Similar Days
					Method\\uc0\\u8220{}, {\\i{}IEEE Transactions on Power Systems} 22, Nr. 4 (November 2007): 5, https://doi.org/10.1109/TPWRS.2007.907386; Safder Alladina, \\uc0\\u8222{}Second Language Teaching through Maths: Learning Maths through a Second
					Language\\uc0\\u8220{}, {\\i{}Educational Studies in Mathematics} 16, Nr. 2 (1. Mai 1985): 215\\uc0\\u8211{}19.}","plainCitation":"Paras Mandal u. a., „A Novel Approach to Forecast Electricity Price for PJM Using Neural Network and Similar Days
					Method“, IEEE Transactions on Power Systems 22, Nr. 4 (November 2007): 5, https://doi.org/10.1109/TPWRS.2007.907386; Safder Alladina, „Second Language Teaching through Maths: Learning Maths through a Second Language“, Educational Studies in
					Mathematics 16, Nr. 2 (1. Mai 1985): 215–19."},"citationItems":[{"id":11835,"uris":["http://zotero.org/users/96641/items/ANPFGCCI"],"uri":["http://zotero.org/users/96641/items/ANPFGCCI"],"itemData":{"id":11835,"type":"article-journal","title":"A
					Novel Approach to Forecast Electricity Price for PJM Using Neural Network and Similar Days Method","container-title":"IEEE Transactions on Power Systems","page":"2058-2065","volume":"22","issue":"4","source":"EBSCOhost","abstract":"Price forecasting
					in competitive electricity markets is critical for consumers and producers in planning their operations and managing their price risk, and it also plays a key role in the economic optimization of the electric energy industry. This paper explores a
					technique of artificial neural network (ANN) model based on similar days (SD) method in order to forecast day-ahead electricity price in the PJM market. To demonstrate the superiority of the proposed model, publicly available data acquired from the
					PJM Interconnection were used for training and testing the ANN. The factors impacting the electricity price forecasting, including time factors, load factors, and historical price factors, are discussed. Comparison of forecasting performance of the
					proposed ANN model with that of forecasts obtained from similar days method is presented. Daily and weekly mean absolu</w:instrText>
			</w:r>
			<w:r w:rsidRPr="00993842">
				<w:rPr><w:lang w:val="en-US"/></w:rPr>
				<w:instrText xml:space="preserve">te percentage error (MAPE) of reasonably small value and forecast mean square error (FMSE) of less than 7$/MWh were obtained for the PJM data, which has correlation coefficient of determination (R²) of 0.6744 between
					load and electricity price. Simulation results show that the proposed ANN model based on similar days method is capable of forecasting locational marginal price (LMP) in the PJM market efficiently and
					accurately.","DOI":"10.1109/TPWRS.2007.907386","ISSN":"08858950","journalAbbreviation":"IEEE Transactions on Power
					Systems","author":[{"family":"Mandal","given":"Paras"},{"family":"Senjyu","given":"Tomonobu"},{"family":"Urasaki","given":"Naomitsu"},{"family":"Funabashi","given":"Toshihisa"},{"family":"Srivastava","given":"Anurag
					K."}],"issued":{"date-parts":[["2007",11]]}},"locator":"5"},{"id":1074,"uris":["http://zotero.org/users/96641/items/5DNR6EWT"],"uri":["http://zotero.org/users/96641/items/5DNR6EWT"],"itemData":{"id":1074,"type":"article-journal","title":"Second
					Language Teaching through Maths: Learning Maths through a Second Language","container-title":"Educational Studies in Mathematics","page":"215-219","volume":"16","issue":"2","source":"JSTOR","ISSN":"0013-1954","shortTitle":"Second Language Teaching
					through Maths","journalAbbreviation":"Educational Studies in
					Mathematics","author":[{"family":"Alladina","given":"Safder"}],"issued":{"date-parts":[["1985",5,1]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}
				</w:instrText>
			</w:r>

(in contrast, the other endnotes in the document are stored in a single <w:instrText/> element:

				<w:instrText xml:space="preserve">
					ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"kxO9MnbO","properties":{"formattedCitation":"{\\rtf vgl. {\\i{}Returns to Education in West Germany over Time\\uc0\\u8239{}: Educational Expansion, Occupational Upgrading and the Job Matching Process},
					2013.}","plainCitation":"vgl. Returns to Education in West Germany over Time : Educational Expansion, Occupational Upgrading and the Job Matching Process,
					2013."},"citationItems":[{"id":795,"uris":["http://zotero.org/users/96641/items/2RDF5IIE"],"uri":["http://zotero.org/users/96641/items/2RDF5IIE"],"itemData":{"id":795,"type":"book","title":"Returns to education in West Germany over time : educational
					expansion, occupational upgrading and the job matching process","number-of-pages":"379","source":"Primo","abstract":"Mannheim, Univ., Diss., 2013","shortTitle":"Returns to education in West Germany over
					time","language":"en","author":[{"family":"Klein","given":"Markus"}],"issued":{"date-parts":[["2013"]]}},"suppress-author":true,"prefix":"vgl."}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}
				</w:instrText>

)

@zuphilip
Copy link
Contributor

Another test shows that 100 M⊙M⊙M⊙M⊙ in the abstract will be encoded as:

...
  <w:r>
    <w:instrText xml:space="preserve">... 100 M</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:rPr>
      <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" w:cs="Cambria Math"/>
    </w:rPr>
    <w:instrText>&#x2299;</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:instrText>M</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:rPr>
      <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" w:cs="Cambria Math"/>
    </w:rPr>
    <w:instrText>&#x2299;</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:instrText>M</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:rPr>
      <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" w:cs="Cambria Math"/>
    </w:rPr>
    <w:instrText>&#x2299;</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:instrText>M</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:rPr>
      <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" w:cs="Cambria Math"/>
    </w:rPr>
    <w:instrText>&#x2299;</w:instrText>
  </w:r>
  <w:r w:rsidR="00EE6AF0">
    <w:instrText xml:space="preserve">...</w:instrText>
  </w:r>

However, it seems that you can concatenate all <w:instrText> fields together. Right?

@rmzelle
Copy link
Owner Author

rmzelle commented Dec 31, 2017

However, it seems that you can concatenate all <w:instrText> fields together. Right?

Sure. I'm just wondering if this also happens for footnote or in-text citations. For endnotes the logic to glue the pieces back together looks reasonably clear, since each endnote is wrapped in its own <w:endnote w:id="2"> element, but I don't know if it's that easy for the other two formats.

Anyway, I'll just merge this for some endnote support, even if it doesn't work perfect yet.

@rmzelle rmzelle merged commit bc5f16b into master Dec 31, 2017
@rmzelle rmzelle deleted the endnotes branch December 31, 2017 23:52
@zuphilip
Copy link
Contributor

zuphilip commented Jan 1, 2018

I'm just wondering if this also happens for footnote or in-text citations.

I tried out both cases. Yes, the same seems to happen for footnotes and in-text citations:

Dok9-footnotes2.docx
Dok9-authordate2.docx

Do you need more examples?

@rmzelle
Copy link
Owner Author

rmzelle commented Jan 1, 2018

Do you need more examples?

I think this is enough, although I don't know enough about the .docx format to know how to reliably extract these split citations for author-date styles. In your example, it looks like:

		<w:p w:rsidR="005002F6" w:rsidRDefault="00BD3B66">
			<w:r><w:fldChar w:fldCharType="begin"/></w:r>
			<w:r>
				<w:rPr><w:lang w:val="en-US"/></w:rPr>
				<w:instrText xml:space="preserve">
					ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"oI6eP5QH","properties":{"formattedCitation":"{\\rtf (Mandal u.\\uc0\\u160{}a. 2007, 5; Alladina 1985)}","plainCitation":"(Mandal u. a. 2007, 5; Alladina
					1985)"},"citationItems":[{"id":11835,"uris":["http://zotero.org/users/96641/items/ANPFGCCI"],"uri":["http://zotero.org/users/96641/items/ANPFGCCI"],"itemData":{"id":11835,"type":"article-journal","title":"A Novel Approach to Forecast Electricity
					Price for PJM Using Neural Network and Similar Days Method","container-title":"IEEE Transactions on Power Systems","page":"2058-2065","volume":"22","issue":"4","source":"EBSCOhost","abstract":"Price forecasting in competitive electricity markets is
					critical for consumers and producers in planning their operations and managing their price risk, and it also plays a key role in the economic optimization of the electric energy industry. This paper explores a technique of artificial neural network
					(ANN) model based on similar days (SD) method in order to forecast day-ahead electricity price in the PJM market. To demonstrate the superiority of the proposed model, publicly available data acquired from the PJM Interconnection were used for
					training and testing the ANN. The factors impacting the electricity price forecasting, including time factors, load factors, and historical price factors, are discussed. Comparison of forecasting performance of the proposed ANN model with that of
					forecasts obtained from similar days method is presented. Daily and weekly mean absolu</w:instrText>
			</w:r>
			<w:r w:rsidRPr="00BD3B66">
				<w:instrText xml:space="preserve">te percentage error (MAPE) of reasonably small value and forecast mean square error (FMSE) of less than 7$/MWh were obtained for the PJM data, which has correlation coefficient of determination (R²) of 0.6744 between
					load and electricity price. Simulation results show that the proposed ANN model based on similar days method is capable of forecasting locational marginal price (LMP) in the PJM market efficiently and
					accurately.","DOI":"10.1109/TPWRS.2007.907386","ISSN":"08858950","journalAbbreviation":"IEEE Transactions on Power
					Systems","author":[{"family":"Mandal","given":"Paras"},{"family":"Senjyu","given":"Tomonobu"},{"family":"Urasaki","given":"Naomitsu"},{"family":"Funabashi","given":"Toshihisa"},{"family":"Srivastava","given":"Anurag
					K."}],"issued":{"date-parts":[["2007",11]]}},"locator":"5"},{"id":1074,"uris":["http://zotero.org/users/96641/items/5DNR6EWT"],"uri":["http://zotero.org/users/96641/items/5DNR6EWT"],"itemData":{"id":1074,"type":"article-journal","title":"Second
					Language Teaching through Maths: Learning Maths through a Second Language","container-title":"Educational Studies in Mathematics","page":"215-219","volume":"16","issue":"2","source":"JSTOR","ISSN":"0013-1954","shortTitle":"Second Language Teaching
					through Maths","journalAbbreviation":"Educational Studies in
					Mathematics","author":[{"family":"Alladina","given":"Safder"}],"issued":{"date-parts":[["1985",5,1]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}
				</w:instrText>
			</w:r>
                        ...
		</w:p>

My main question is whether the structure here is always:

		<w:p>
			<w:r>
				<w:instrText>...</w:instrText>
			</w:r>
			<w:r>
				<w:instrText>...</w:instrText>
			</w:r>
                        ...
		</w:p>

or whether the outer element can ever be something different than w:p.

@rmzelle
Copy link
Owner Author

rmzelle commented Jan 1, 2018

(and I created a dedicated ticket for this, per above)

@rmzelle
Copy link
Owner Author

rmzelle commented Jan 9, 2018

@zuphilip, by the way, would it be okay if I add the Word documents you shared to the repo itself? It would be good to have some test data available. I checked one with https://www.get-metadata.com/ and it doesn't look like it didn't contain any sensitive from a privacy standpoint.

@zuphilip
Copy link
Contributor

zuphilip commented Jan 9, 2018

@rmzelle Yes, that should be no problem. As long as you don't look at the referenced data too closely 😄 (it is a random subset of my Zotero library with varying metadata quality)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants