Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Parser for Biodiversity checklists
Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.
R does have few packages like httr, rvest and hunspell to do some basic operations of fetching files and trying to parse them. But it is important to have a taxonomy specific package since taxonomy has it’s own unique structure and complexities.
Details of your coding project
- A functions to search Names of organisms within supplied text
- Functions to manipulate taxon names like assigning ranks to a name string e.g. string ‘Papilio machaon Seyer, 1976’ into Genus = ‘Papilio’, Species = ‘machaon’, Author = ‘Seyer’ and Year = ‘1977’
- Functions to parse taxonomic lists and return the information in table format
- Recursive functions to crawl websites
There is an increase in Biodiversity research community using R in their data analysis workflows. This package would add a tool to extract taxonomic name lists and related data from different file formats like txt, html or pdf to quickly build checklists
- [[http://vijaybarve.net/][Vijay Barve]] firstname.lastname@example.org
- Rohit George email@example.com
- Thomas Vattakaven firstname.lastname@example.org
- Narayani Barve email@example.com
Please contact Vijay Barve firstname.lastname@example.org after solving at least one of the tests below.
- Easy: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the genus name
- Easy: List out all the species from the list [https://www.abdb-africa.org/genus/Papilio]
- Medium: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the all the species names
- Medium: Convert the above task into a function
- Hard: Read in the file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo01.txt] and output the parsed data in the form of .csv file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo_out01.csv]
- Hard: Convert above task into a function
Solutions of tests
Students, please post a link to your test results here.
- Sumedh Mool (https://github.com/Sumedh04/Praser)
- Vishwajeet shukla (https://github.com/vishwajeet993511/gsoc2017tests)
- Xing Xiong (https://github.com/XingXiong/gsoc2017)
- Qingyue Xu (https://github.com/qingyuexu/Parser-for-Biodiversity-checklists)