# Update profiles for functional groups of the IUCN Global Ecosystem typology

Scripts by José R. Ferrer-Paris

The scripts described in this document are used to:

- Read from document in `docx` format
- Update content in database

## Set-up
Load all the libraries we will need in this script:

In [1]:
from docx import Document
import os
from pathlib import Path
import re
from datetime import datetime
from configparser import ConfigParser
import psycopg2

Initially, when I ran this code, the text from paragraphs did not include hyperlinks to biomes and efgs.
Here I add a hack from https://github.com/python-openxml/python-docx/issues/85 to extract text and hyperlinks as a single text element.

In [2]:
from docx.text.paragraph import Paragraph

Paragraph.text = property(lambda self: GetParagraphText(self))

def GetParagraphText(paragraph):

    def GetTag(element):
        return "%s:%s" % (element.prefix, re.match("{.*}(.*)", element.tag).group(1))

    text = ''
    runCount = 0
    for child in paragraph._p:
        tag = GetTag(child)
        if tag == "w:r":
            text += paragraph.runs[runCount].text
            runCount += 1
        if tag == "w:hyperlink":
            for subChild in child:
                if GetTag(subChild) == "w:r":
                    text += subChild.text
    return text

Read configuration parameters for the connection to the current version of the database:

In [3]:
filename = Path(os.path.expanduser('~')) / ".database.ini"
section = 'psqlaws'

parser = ConfigParser()
parser.read(filename)
db = {}
if parser.has_section(section):
    params = parser.items(section)
    for param in params:
        db[param[0]] = param[1]
else:
    raise Exception('Section {0} not found in the {1} file'.format(section, filename))

params = db

Declare path for inputdir and filename: 

In [4]:
inputdir = Path(os.path.expanduser('~')) / "tmp" / "typology-web-update-content"
filename='Keith_etal_EarthsEcosystems_Appendix S4_3rdrevisions_subm+DK.docx'

## Read content from document

In [5]:
profiles=Document(inputdir / filename)
profiles

<docx.document.Document at 0x107c5cac0>

Check total number of paragraphs in document

In [6]:
len(profiles.paragraphs)

1430

Read list of contributors from the headers of each section

In [7]:
contribs=list()
for section in profiles.sections:
    h = section.header
    for p in h.paragraphs:
        if re.search("^Contributors",p.text):
            m=p.text.split(':')
            contribs.append(m[1])

There should be one element for each profile (110 profiles):

In [8]:
len(contribs)

110

Proceed to read content, each profile has three paragraphs of content starting with a title ('Ecosystem properties','Distribution','Biome','Ecological drivers'), other paragraphs are either references or extra content (photo attributions). We will add the list of author matching each profile to the sequence of contributors in the list above (this only work if all profiles have this information in their headers). 

In [9]:
item=None
items=list()
nr=0
for k in range(19,len(profiles.paragraphs)):
    p=profiles.paragraphs[k].text
    if re.search("^[TMFS]+[0-9].[0-9]",p):
        if item is not None:
            items.append(item)
        m = p.split(" ")
        item={'paragraph':k,'Code':m[0],'Name':p,'References':list(),'Extra':list(),'Contributors':list()}
        addref=False
        authors=contribs[nr].split(',')
        for aut in authors:
            item['Contributors'].append(aut.strip(' '))
        nr=nr+1
    else:
        if item is not None:
            m=p.split(':')
            if m[0] in ('Ecosystem properties','Distribution','Biome','Ecological drivers'):
                #print(m[0])
                item[m[0]]=":".join(m[1:]).strip(" ")
            elif m[0] in ('References'):
                addref=True
            else:
                if addref:
                    item['References'].append(p)
                else:
                    item['Extra'].append(p)
items.append(item)
#items[1]

We will check that one of the newest profiles is correct:

In [10]:
items[105]

{'paragraph': 1366,
 'Code': 'MT2.2',
 'Name': 'MT2.2 Large seabird and pinniped colonies',
 'References': ['Ellis JC (2005) Marine Birds on land: A review of plant biomass, species richness, and community composition in seabird colonies. Plant Ecology 181, 227–241. ',
  'Otero XL, De La Peña-Lastra S,  Pérez-Alberti A, Ferreira TO, Huerta-Diaz MA (2018) Seabird colonies as important global drivers in the nitrogen and phosphorus cycles. Nature Communications 9, 246.',
  'Riddick SN, Dragosits U, Blackall TD, Daunt F, Wanless S, Sutton MA (2012) The global distribution of ammonia emissions from seabird colonies. Atmospheric Environment 55, 319e327',
  'MT3. Anthropogenic shorelines biome',
  'Exmouth Marina, Devon, UK.',
  'Credit: Red Zeppelin on Unsplash',
  'The Anthropogenic shorelines biome is distributed globally where urbanised and industrial areas adjoin the coast, and includes some more remote structures such as artificial islands. It includes marine interfaces constructed from

Some content from the Biome descriptions has been added in the 'extra' element of the dictionary, but we will ignore those for now.

Does the text from paragraphs include references to biomes ?

In [11]:
print(items[33]['paragraph'])
items[33]['Ecosystem properties']

555


'Extensive ‘semi-natural’ grasslands and open shrublands exist where woody components of vegetation have been removed or greatly modified for agricultural land uses. Hence they have been ‘derived’ from a range of other ecosystems (mostly from biomes T1, T2, T3, T4, a few from T5). Remaining vegetation includes a substantial component of local indigenous species, as well as an introduced exotic element, providing habitat for a mixed indigenous and non-indigenous fauna. Although structurally simpler at site scales than the systems from which they were derived, spatial complexity may be greater in fragmented landscapes and they often harbour appreciable diversity of native organisms, including some no longer present in ‘natural’ ecosystems. Dominant plant growth forms include tussock or stoloniferous grasses and forbs, with or without non-vascular plants, shrubs and scattered trees. These support microbial decomposers and diverse invertebrate groups that function as detritivores, herbivor

We whould add the keyword biome, so that it will be consistent with previous version in the database. Also we will remove the map caveats warning if present:

In [28]:
def add_biome_kwd(x):
    y=re.sub('([MFTS]+[0-9][ ,\\)])','biome \\1',x)
    y.replace(' See map caveats (Table S4.1).','')
    return(y)

add_biome_kwd(items[33]['Ecosystem properties'])



'Extensive ‘semi-natural’ grasslands and open shrublands exist where woody components of vegetation have been removed or greatly modified for agricultural land uses. Hence they have been ‘derived’ from a range of other ecosystems (mostly from biomes biome T1, biome T2, biome T3, biome T4, a few from biome T5). Remaining vegetation includes a substantial component of local indigenous species, as well as an introduced exotic element, providing habitat for a mixed indigenous and non-indigenous fauna. Although structurally simpler at site scales than the systems from which they were derived, spatial complexity may be greater in fragmented landscapes and they often harbour appreciable diversity of native organisms, including some no longer present in ‘natural’ ecosystems. Dominant plant growth forms include tussock or stoloniferous grasses and forbs, with or without non-vascular plants, shrubs and scattered trees. These support microbial decomposers and diverse invertebrate groups that func

## Update content to database
We will first open the connection and activate a normal cursor:

In [29]:
conn = psycopg2.connect(**params)
cur = conn.cursor()


This code will read from each element in the list `items`, prepare the insert queries for three tables in the database, execute the queries and count the number of affected rows, should be three for each profile:

In [30]:
qrystr_traits=   "INSERT INTO efg_ecological_traits(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_ecological_traits_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"
qrystr_drivers=   "INSERT INTO efg_key_ecological_drivers(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_key_ecological_drivers_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"
qrystr_dist=   "INSERT INTO efg_distribution(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_distribution_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"

affected_rows = 0
for item in items:
    desc=re.sub('([MFTS]+[0-9][ ,\\)])','biome \\1',item['Ecosystem properties'])

    if 'Ecosystem properties' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Ecosystem properties']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_traits,(values,))
        affected_rows=affected_rows+cur.rowcount
    if 'Ecological drivers' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Ecological drivers']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_drivers,(values,))
        affected_rows=affected_rows+cur.rowcount
    if 'Distribution' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Distribution']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_dist,(values,))
        affected_rows=affected_rows+cur.rowcount
print(affected_rows)

330


This is a good time to commit the changes:

In [31]:
conn.commit()

Before we close the connection, we will check that all information was correctly loaded:

In [32]:
qry=" ".join(("select code,LENGTH(t.description), LENGTH(k.description), LENGTH(d.description)",
"from efg_ecological_traits t",
"full join efg_key_ecological_drivers k",
"USING(code,version,language)",
"full join efg_distribution d",
"USING(code,version,language)",
"where version='v2.1' AND language='en' order by code;"))
cur.execute(qry)
updated_codes=cur.fetchall()

In [33]:
updated_codes


[('F1.1', 1601, 863, 158),
 ('F1.2', 1597, 1148, 155),
 ('F1.3', 2014, 848, 119),
 ('F1.4', 1280, 726, 84),
 ('F1.5', 1381, 815, 85),
 ('F1.6', 1944, 1079, 100),
 ('F1.7', 1501, 967, 152),
 ('F2.1', 2223, 660, 58),
 ('F2.10', 2011, 718, 152),
 ('F2.2', 2250, 800, 79),
 ('F2.3', 2039, 1111, 135),
 ('F2.4', 1817, 1029, 129),
 ('F2.5', 1419, 955, 89),
 ('F2.6', 1395, 1265, 113),
 ('F2.7', 1962, 832, 85),
 ('F2.8', 1522, 881, 165),
 ('F2.9', 2279, 807, 179),
 ('F3.1', 1053, 1108, 245),
 ('F3.2', 1600, 1040, 121),
 ('F3.3', 1516, 788, 170),
 ('F3.4', 1538, 1032, 179),
 ('F3.5', 1143, 1080, 153),
 ('FM1.1', 2021, 1176, 85),
 ('FM1.2', 2118, 742, 65),
 ('FM1.3', 1458, 671, 324),
 ('M1.1', 1298, 754, 76),
 ('M1.10', 2378, 601, 147),
 ('M1.2', 1956, 807, 190),
 ('M1.3', 1990, 735, 108),
 ('M1.4', 1772, 942, 211),
 ('M1.5', 1779, 734, 185),
 ('M1.6', 2081, 902, 69),
 ('M1.7', 1550, 682, 58),
 ('M1.8', 1910, 904, 80),
 ('M1.9', 1020, 1119, 278),
 ('M2.1', 2324, 664, 70),
 ('M2.2', 1643, 908, 87),

We can now close the database connection:

In [34]:

cur.close()
        
if conn is not None:
    conn.close()
    print('Database connection closed.')

Database connection closed.


# That's it!