# Ash

## Install and import packages

In [8]:
!pip install ash-williams urllib3



In [20]:
from pathlib import Path
from pprint import pprint

import ash
import urllib3

## Load retraction database

In [56]:
DB_PATH = Path("./retractions.csv")

The following should download the complete Retraction Watch CSV, per instructions at https://doi.org/10.13003/c23rw1d9. If this doesn't work, try manually downloading from the provided URL -- with your own email at the end -- and then uploading to the file system accessible by this notebook. Alternatively, you might try [mounting a Drive or Sheet](https://colab.research.google.com/notebooks/io.ipynb).

In [40]:
# Enter your email address for download, per Crossref request
EMAIL = 
CSV_URL = f"https://api.labs.crossref.org/data/retractionwatch?{EMAIL}"

In [57]:
resp = urllib3.request("GET", CSV_URL, timeout=20)
print(f"Status: {resp.status}")
if resp.status != 200:
    raise RuntimeError(f"Did not successfully retrieve {CSV_URL}")
DB_PATH.write_bytes(resp.data)
db = ash.RetractionDatabase(DB_PATH)
db

Status: 200


RetractionDatabase('retractions.csv')

In [54]:
len(db.dois)

45590

## Report on text containing DOIs

In [69]:
TEXT = """
References
1. Teixeira da Silva JA, Dobránszki J. Highly cited retracted papers. Scientometrics.
2017 Mar;110(3):1653–61. doi: 10.1007/s11192-016-2227-4
2. Barbour V, Kleinert S, Wager E, Yentis S. Guidelines for retracting articles.
Committee on Publication Ethics; 2009 Sep. doi: 10.24318/cope.2019.1.4
3. Budd JM, Sievert M, Schultz TR. Phenomena of Retraction: Reasons for Retraction
and Citations to the Publications. JAMA. 1998 Jul 15;280(3):296. doi:
10.1001/jama.280.3.296
4. Lu SF, Jin GZ, Uzzi B, Jones B. The Retraction Penalty: Evidence from the Web of
Science. Sci Rep. 2013 Dec;3(1):3146. doi: 10.1038/srep03146
5. Azoulay P, Bonatti A, Krieger JL. The career effects of scandal: Evidence from
scientific retractions. Res Policy. 2017 Nov;46(9):1552–69. doi:
10.1016/j.respol.2017.07.003
6. Mongeon P, Larivière V. Costly collaborations: The impact of scientific fraud on co-
authors’ careers: Costly Collaborations: The Impact of Scientific Fraud on Co-
Authors’ Careers. J Assoc Inf Sci Technol. 2016 Mar;67(3):535–42. doi:
10.1002/asi.23421
7. Shuai X, Rollins J, Moulinier I, Custis T, Edmunds M, Schilder F. A
Multidimensional Investigation of the Effects of Publication Retraction on Scholarly
Impact. J Assoc Inf Sci Technol. 2017 Sep;68(9):2225–36. doi: 10.1002/asi.23826
8. Feng L, Yuan J, Yang L. An observation framework for retracted publications in
multiple dimensions. Scientometrics. 2020 Nov;125(2):1445–57. doi:
10.1007/s11192-020-03702-3
9. Bolland MJ, Grey A, Avenell A. Citation of retracted publications: A challenging
problem. Account Res. 2021 Feb 15;1–8. doi: 10.1080/08989621.2021.1886933
10. Bar-Ilan J, Halevi G. Post retraction citations in context: a case study. Scientometrics.
2017 Oct;113(1):547–65. doi: 10.1007/s11192-017-2242-0
11. Jan R, Bano S, Mehraj M, others. Context Analysis of Top Seven Retracted Articles:
Should Retraction Watch Revisit the List? Context [Internet]. 2018; Available from:
https://digitalcommons.unl.edu/libphilprac/2016/
12. Chen C, Leydesdorff L. Patterns of connections and movements in dual-map
overlays: A new method of publication portfolio analysis. J Assoc Inf Sci Technol.
2014 Feb;65(2):334–51. doi: 10.1002/asi.22968
13. Schneider J, Ye D, Hill AM, Whitehorn AS. Continued post-retraction citation of a
fraudulent clinical trial report, 11 years after it was retracted for falsifying data.
Scientometrics. 2020 Dec;125(3):2877–913. doi: 10.1007/s11192-020-03631-1
14. Wakefield A, Murch S, Anthony A, Linnell J, Casson D, Malik M, et al.
RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and
pervasive developmental disorder in children. The Lancet. 1998 Feb;351(9103):637–
41. doi: 10.1016/S0140-6736(97)11096-0
15. Heibi I, Peroni S, Shotton D. Software review: COCI, the OpenCitations Index of
Crossref open DOI-to-DOI citations. Scientometrics. 2019 Nov;121(2):1213–28. doi:
10.1007/s11192-019-03217-6
16. Suppe F. The structure of a scientific paper. Philos Sci. 1998;65(3):381–405.
17. Peroni S, Shotton D. FaBiO and CiTO: Ontologies for describing bibliographic
resources and citations. J Web Semant. 2012 Dec;17:33–43. doi:
10.1016/j.websem.2012.08.001
18. Bengfort B, Bilbro R, Ojeda T. Applied text analysis with Python: enabling language-
aware data products with machine learning. First edition. Sebastopol, CA: O’Reilly
Media, Inc; 2018. 310 p.
19. Truica C-O, Radulescu F, Boicea A. Comparing Different Term Weighting Schemas
for Topic Modeling. In: 2016 18th International Symposium on Symbolic and
Numeric Algorithms for Scientific Computing (SYNASC) [Internet]. Timisoara,
Romania: IEEE; 2016 [cited 2020 Jul 21]. p. 307–10. doi:
10.1109/SYNASC.2016.055
20. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, et al. Latent Dirichlet allocation
(LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl.
2019 Jun;78(11):15169–211. doi: 10.1007/s11042-018-6894-4
21. Zhao W, Chen JJ, Perkins R, Liu Z, Ge W, Ding Y, et al. A heuristic approach to
determine an appropriate number of topics in topic modeling. BMC Bioinformatics.
2015 Dec;16(S13):S8. doi: 10.1186/1471-2105-16-S13-S8
22. Arun R, Suresh V, Veni Madhavan CE, Narasimha Murthy MN. On Finding the
Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In:
Zaki MJ, Yu JX, Ravindran B, Pudi V, editors. Advances in Knowledge Discovery
and Data Mining [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010
[cited 2021 Jan 12]. p. 391–402. (Hutchison D, Kanade T, Kittler J, Kleinberg JM,
Mattern F, Mitchell JC, et al., editors. Lecture Notes in Computer Science; vol. 6118).
doi: 10.1007/978-3-642-13657-3_43
23. Schmiedel T, Müller O, vom Brocke J. Topic Modeling as a Strategy of Inquiry in
Organizational Research: A Tutorial With an Application Example on Organizational
Culture. Organ Res Methods. 2019 Oct;22(4):941–68. doi:
10.1177/1094428118773858
24. Ferri P, Heibi I, Pareschi L, Peroni S. MITAO: A User Friendly and Modular
Software for Topic Modelling. PuntOorg Int J. 2020;5(2):135–49. doi:
10.19245/25.05.pij.5.2.3
25. Sievert C, Shirley KE. LDAvis: A method for visualizing and interpreting topics.
2014 [cited 2020 Jul 27]; doi: 10.13140/2.1.1394.3043
26. Chuang J, Manning CD, Heer J. Termite: visualization techniques for assessing
textual topic models. In: Proceedings of the International Working Conference on
Advanced Visual Interfaces - AVI ’12 [Internet]. Capri Island, Italy: ACM Press;
2012 [cited 2020 May 21]. p. 74. doi: 10.1145/2254556.2254572
27. Heibi I, Peroni S. A qualitative and quantitative citation analysis toward retracted
articles: a case of study. ArXiv201211475 Cs [Internet]. 2020 Dec 21 [cited 2021 Jan
24]; Available from: http://arxiv.org/abs/2012.11475
28. Wang K, Shen Z, Huang C, Wu CH, Dong Y, Kanakia A. Microsoft academic graph:
When experts are not enough. Quantitative Science Studies. 2020 Feb;1(1):396-413.
doi: 10.1162/qss_a_00021
29. Pentz E. CrossRef: a collaborative linking network. Issues in science and technology
librarianship. 2001;10:F4CR5RBK. doi: 10.1162/qss_a_00022
30. Peroni S, Shotton D. OpenCitations, an infrastructure organization for open
scholarship. Quantitative Science Studies. 2020 Feb;1(1):428-44. doi:
10.1162/qss_a_00023
31. Ramos J. Using tf-idf to determine word relevance in document queries. In
Proceedings of the first instructional conference on machine learning 2003 Dec 3
(Vol. 242, No. 1, pp. 29-48).
32. Brownlee J. A Gentle Introduction to the Bag-of-Words Model. 2017 Oct 09.
Available from: https://machinelearningmastery.com/gentle-introduction-bag-words-
model/
33. Iorio AD, Nuzzolese AG, Peroni S. Towards the automatic identification of the nature
of citations. SePublica. 2013. Available from: http://ceur-ws.org/Vol-994/paper-
06.pdf
34. Ciancarini P, Di Iorio A, Nuzzolese AG, Peroni S, Vitali F. Evaluating Citation
Functions in CiTO: Cognitive Issues. In: Presutti V, d’Amato C, Gandon F, d’Aquin
M, Staab S, Tordai A, editors. The Semantic Web: Trends and Challenges [Internet].
Cham: Springer International Publishing; 2014 [cited 2021 May 7]. p. 580–94.
(Hutchison D, Kanade T, Kittler J, Kleinberg JM, Kobsa A, Mattern F, et al., editors.
Lecture Notes in Computer Science; vol. 8465). doi: 10.1007/978-3-319-07443-6_39
35. Heibi I, Peroni S. LCC and Scimago indexes. Zenodo [Data set]. 2021. doi:
10.5281/zenodo.4767023
"""
paper = ash.Paper(TEXT, mime_type="text/plain")
pprint(paper.report(db), width=120)


{'dois': {'10.1001/jama.280.3.296': {'DOI is valid': True, 'Retracted': False},
          '10.1002/asi.22968': {'DOI is valid': True, 'Retracted': False},
          '10.1002/asi.23421': {'DOI is valid': True, 'Retracted': False},
          '10.1002/asi.23826': {'DOI is valid': True, 'Retracted': False},
          '10.1007/978-3-319-07443-6_39': {'DOI is valid': True, 'Retracted': False},
          '10.1007/978-3-642-13657-3_43': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11042-018-6894-4': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11192-016-2227-4': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11192-017-2242-0': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11192-019-03217-6': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11192-020-03631-1': {'DOI is valid': True, 'Retracted': False},
          '10.1007/s11192-020-03702-3': {'DOI is valid': True, 'Retracted': False},
          '10.1016/S01