Parser for Biodiversity checklists

thomvee edited this page Mar 24, 2017 · 13 revisions

Background

Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.

Related work

R does have few packages like httr, rvest and hunspell to do some basic operations of fetching files and trying to parse them. But it is important to have a taxonomy specific package since taxonomy has it’s own unique structure and complexities.

Details of your coding project

  • A functions to search Names of organisms within supplied text
  • Functions to manipulate taxon names like assigning ranks to a name string e.g. string ‘Papilio machaon Seyer, 1976’ into Genus = ‘Papilio’, Species = ‘machaon’, Author = ‘Seyer’ and Year = ‘1977’
  • Functions to parse taxonomic lists and return the information in table format
  • Recursive functions to crawl websites

Expected impact

There is an increase in Biodiversity research community using R in their data analysis workflows. This package would add a tool to extract taxonomic name lists and related data from different file formats like txt, html or pdf to quickly build checklists

Mentors

Please contact Vijay Barve vijay.barve@gmail.com after solving at least one of the tests below.

Tests

  • Easy: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the genus name
  • Easy: List out all the species from the list [https://www.abdb-africa.org/genus/Papilio]
  • Medium: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the all the species names
  • Medium: Convert the above task into a function
  • Hard: Read in the file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo01.txt] and output the parsed data in the form of .csv file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo_out01.csv]
  • Hard: Convert above task into a function

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.