# Update profiles for functional groups of the IUCN Global Ecosystem typology

Scripts by José R. Ferrer-Paris

The scripts described in this document are used to:

- Read from document in `docx` format
- Update content in database

## Set-up
Load all the libraries we will need in this script:

In [205]:
import docx
from docx import Document
import os
from pathlib import Path
import re
from datetime import datetime
from configparser import ConfigParser
import psycopg2
from psycopg2.extensions import AsIs
import docx2txt


Initially, when I ran this code, the text from paragraphs did not include hyperlinks to biomes and efgs.
Here I add a hack from https://github.com/python-openxml/python-docx/issues/85 to extract text and hyperlinks as a single text element.

In [3]:
from docx.text.paragraph import Paragraph

Paragraph.text = property(lambda self: GetParagraphText(self))

def GetParagraphText(paragraph):

    def GetTag(element):
        return "%s:%s" % (element.prefix, re.match("{.*}(.*)", element.tag).group(1))

    text = ''
    runCount = 0
    for child in paragraph._p:
        tag = GetTag(child)
        if tag == "w:r":
            text += paragraph.runs[runCount].text
            runCount += 1
        if tag == "w:hyperlink":
            for subChild in child:
                if GetTag(subChild) == "w:r":
                    text += subChild.text
    return text

Read configuration parameters for the connection to the current version of the database:

In [4]:
filename = Path(os.path.expanduser('~')) / ".database.ini"
section = 'psqlaws'

parser = ConfigParser()
parser.read(filename)
db = {}
if parser.has_section(section):
    params = parser.items(section)
    for param in params:
        db[param[0]] = param[1]
else:
    raise Exception('Section {0} not found in the {1} file'.format(section, filename))

params = db

Declare path for inputdir and filename: 

In [5]:
inputdir = Path(os.path.expanduser('~')) / "tmp" / "typology-web-update-content"
filename='Keith_etal_EarthsEcosystems_Appendix S4_3rdrevisions_subm+DK.docx'

## Read content from document

In [7]:
profiles=Document(inputdir / filename)
profiles

<docx.document.Document at 0x7f756d07b840>

Check total number of paragraphs in document

In [8]:
len(profiles.paragraphs)

1430

### Extract Diagrammatic Assembly Models

This function extracts text and puts all images in an outputdir:

In [26]:


# Extract the images to img_folder/
outputdir=inputdir / 'img_folder'
if not os.path.exists(outputdir):
    os.makedirs(outputdir)
text = docx2txt.process(inputdir / filename, outputdir)


Save all 'rId:filenames' relationships in an dictionary named rels

In [27]:
rels = {}
for r in profiles.part.rels.values():
    if isinstance(r._target, docx.parts.image.ImagePart):
        rels[r.rId] = os.path.basename(r._target.partname)


In [None]:
We can then go through the list of paragraphs and process text or images:

In [36]:


for paragraph in profiles.paragraphs:
    # If you find an image
    if 'Graphic' in paragraph._p.xml:
        # Get the rId of the image
        for rId in rels:
            if rId in paragraph._p.xml:
                # Your image will be in os.path.join(img_path, rels[rId])
                target=paragraph
    #else:
        # It's not an image

Example of xml content of a paragraph

In [88]:
target._p.xml

'<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/d

In [97]:
#target._p.xml.find("name=")
re.findall('descr="(.+)"',target._p.xml)

['MAP/MFT1.3.png']

But our DAMs are not labelled nicely, we can move the DAMs manually to a different directory and then match the EFG code afterwards, see below.

### List of contributors

Read list of contributors from the headers of each section

In [9]:
contribs=list()
for section in profiles.sections:
    h = section.header
    for p in h.paragraphs:
        if re.search("^Contributors",p.text):
            m=p.text.split(':')
            contribs.append(m[1])

There should be one element for each profile (110 profiles):

In [10]:
len(contribs)

110

### Extract profile content

Proceed to read content, each profile has three paragraphs of content starting with a title ('Ecosystem properties','Distribution','Biome','Ecological drivers'), other paragraphs are either references or extra content (photo attributions). We will add the list of author matching each profile to the sequence of contributors in the list above (this only work if all profiles have this information in their headers). 

In [98]:
item=None
items=list()
nr=0
for k in range(19,len(profiles.paragraphs)):
    x=profiles.paragraphs[k]._p.xml            
    p=profiles.paragraphs[k].text
    if re.search("^[TMFS]+[0-9].[0-9]",p):
        if item is not None:
            items.append(item)
        m = p.split(" ")
        item={'paragraph':k,'Code':m[0],'Name':p,'References':list(),'Extra':list(),'Contributors':list(),'Images':list()}
        addref=False
        authors=contribs[nr].split(',')
        for aut in authors:
            item['Contributors'].append(aut.strip(' '))
        nr=nr+1
    else:
        if item is not None:
            if 'Graphic' in x:
                for rId in rels:
                    if rId in x:
                    # Your image will be in os.path.join(img_path, rels[rId])
                        item['Images'].append({'descr':re.findall('descr="(.+)"',x),'file':rels[rId]})

            m=p.split(':')
            if m[0] in ('Ecosystem properties','Distribution','Biome','Ecological drivers'):
                #print(m[0])
                item[m[0]]=":".join(m[1:]).strip(" ")
            elif m[0] in ('References'):
                addref=True
            else:
                if addref:
                    item['References'].append(p)
                else:
                    item['Extra'].append(p)
items.append(item)
#items[1]

We will check that one of the newest profiles is correct:

In [102]:
items[104]

{'paragraph': 1358,
 'Code': 'MT2.1',
 'Name': 'MT2.1 Coastal shrublands and grasslands',
 'References': ['van der Maarel E (2001) Dry coastal ecosystems: General aspects. Ecosystems of the world 2C. Elsevier, Amsterdam.'],
 'Extra': ['Coastal shrubland, Strait of Magellan, Chile.',
  'Credit: David Keith'],
 'Contributors': ['DA Keith', 'J Loidi', 'ATR Acosta'],
 'Images': [{'descr': ['FOTO/mt_2_1.jpg'], 'file': 'image341.jpg'},
  {'descr': [], 'file': 'image342.png'},
  {'descr': ['MAP/MT2.1.png'], 'file': 'image343.png'}],
 'Ecosystem properties': 'Relatively low productivity grasslands, shrublands, and low forests on exposed coastlines are limited by salt influx, water deficit, and recurring disturbances. Diversity is low across taxa and trophic networks are simple, but virtually all plants and animals have strong dispersal traits and most consumers move between adjacent terrestrial and marine ecosystems. Vegetation and substrates are characterised by strong gradients from sea to l

Some content from the Biome descriptions has been added in the 'extra' element of the dictionary, but we will ignore those for now.

Does the text from paragraphs include references to biomes ?

In [13]:
print(items[33]['paragraph'])
items[33]['Ecosystem properties']

555


'Extensive ‘semi-natural’ grasslands and open shrublands exist where woody components of vegetation have been removed or greatly modified for agricultural land uses. Hence they have been ‘derived’ from a range of other ecosystems (mostly from biomes T1, T2, T3, T4, a few from T5). Remaining vegetation includes a substantial component of local indigenous species, as well as an introduced exotic element, providing habitat for a mixed indigenous and non-indigenous fauna. Although structurally simpler at site scales than the systems from which they were derived, spatial complexity may be greater in fragmented landscapes and they often harbour appreciable diversity of native organisms, including some no longer present in ‘natural’ ecosystems. Dominant plant growth forms include tussock or stoloniferous grasses and forbs, with or without non-vascular plants, shrubs and scattered trees. These support microbial decomposers and diverse invertebrate groups that function as detritivores, herbivor

We whould add the keyword biome, so that it will be consistent with previous version in the database. Also we will remove the map caveats warning if present:

In [14]:
def add_biome_kwd(x):
    y=re.sub('([MFTS]+[0-9][ ,\\)])','biome \\1',x)
    y.replace(' See map caveats (Table S4.1).','')
    return(y)

add_biome_kwd(items[33]['Ecosystem properties'])



'Extensive ‘semi-natural’ grasslands and open shrublands exist where woody components of vegetation have been removed or greatly modified for agricultural land uses. Hence they have been ‘derived’ from a range of other ecosystems (mostly from biomes biome T1, biome T2, biome T3, biome T4, a few from biome T5). Remaining vegetation includes a substantial component of local indigenous species, as well as an introduced exotic element, providing habitat for a mixed indigenous and non-indigenous fauna. Although structurally simpler at site scales than the systems from which they were derived, spatial complexity may be greater in fragmented landscapes and they often harbour appreciable diversity of native organisms, including some no longer present in ‘natural’ ecosystems. Dominant plant growth forms include tussock or stoloniferous grasses and forbs, with or without non-vascular plants, shrubs and scattered trees. These support microbial decomposers and diverse invertebrate groups that func

### DAM again

In [134]:
outputdir = inputdir / 'DAM'
DAMfiles = os.listdir(outputdir)
content_repo =  Path(os.path.expanduser('~')) / 'proyectos' / 'typology-website' / 'typology-map-content' / 'assets' / 'uploads'
#os.listdir(content_repo)

In [139]:
import shutil
# construct full file path
        


copied mft_1_3-diagram.png


In [152]:
for record in items[89:90]:
    code = record['Code'].replace('.','_').lower()
    code=re.sub(r'([tsmf]+)',r'\1_',code)
    outfile=('%s-diagram.png' % (code))
    for img in record['Images']:
        if img['file'] in DAMfiles:
            source =  inputdir / 'DAM' / img['file']
            destination =  content_repo / outfile
            # copy only files
            if os.path.isfile(source):
                shutil.copy(source, destination)
             

            print("DAM for EFG %s-diagram.png in file %s" % (code,img['file']))


copied m_2_4-diagram.png
DAM for EFG m_2_4-diagram.png in file image18.png
copied m_2_4-diagram.png
DAM for EFG m_2_4-diagram.png in file image18.png
copied m_2_4-diagram.png
DAM for EFG m_2_4-diagram.png in file image291.png
copied m_2_4-diagram.png
DAM for EFG m_2_4-diagram.png in file image18.png


In [157]:
# manual edits
fixes=({'code':'f_2_8', 'file': 'image215.png'},
{'file': 'image218.png', 'code':'f_2_9'},
{'file': 'image237.png', 'code':'f_3_5'},
{'file': 'image291.png', 'code':'m_2_4'},
{'file': 'image296.png', 'code':'m_2_5'},
{'file': 'image312.png', 'code':'m_3_5'},
#falta M3.6?
{'file': 'image318.png', 'code':'m_3_7'},
{'file': 'image322.png', 'code':'m_4_1'},
{'file': 'image135.png', 'code':'sf_2_2'},
{'file': 'image168.png', 'code':'tf_1_7'})

for record in fixes:
    source =  inputdir / 'DAM' / record['file']
    outfile = '%s-diagram.png' % (record['code'])
    destination =  content_repo / outfile
    if os.path.isfile(source):
        shutil.copy(source, destination)
        print("DAM for EFG %s-diagram.png in file %s" % (code,record['file']))

DAM for EFG m_2_4-diagram.png in file image215.png
DAM for EFG m_2_4-diagram.png in file image218.png
DAM for EFG m_2_4-diagram.png in file image237.png
DAM for EFG m_2_4-diagram.png in file image291.png
DAM for EFG m_2_4-diagram.png in file image296.png
DAM for EFG m_2_4-diagram.png in file image312.png
DAM for EFG m_2_4-diagram.png in file image318.png
DAM for EFG m_2_4-diagram.png in file image322.png
DAM for EFG m_2_4-diagram.png in file image135.png
DAM for EFG m_2_4-diagram.png in file image168.png


## Update content to database
We will first open the connection and activate a normal cursor:

In [158]:
conn = psycopg2.connect(**params)
cur = conn.cursor()


This code will read from each element in the list `items`, prepare the insert queries for three tables in the database, execute the queries and count the number of affected rows, should be three for each profile:

In [30]:
qrystr_traits=   "INSERT INTO efg_ecological_traits(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_ecological_traits_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"
qrystr_drivers=   "INSERT INTO efg_key_ecological_drivers(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_key_ecological_drivers_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"
qrystr_dist=   "INSERT INTO efg_distribution(code,language,description,contributors,version,update) values %s ON CONFLICT ON CONSTRAINT efg_distribution_pkey DO UPDATE SET description=EXCLUDED.description, contributors=EXCLUDED.contributors, update=CURRENT_TIMESTAMP(0)"

affected_rows = 0
for item in items:
    desc=re.sub('([MFTS]+[0-9][ ,\\)])','biome \\1',item['Ecosystem properties'])

    if 'Ecosystem properties' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Ecosystem properties']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_traits,(values,))
        affected_rows=affected_rows+cur.rowcount
    if 'Ecological drivers' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Ecological drivers']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_drivers,(values,))
        affected_rows=affected_rows+cur.rowcount
    if 'Distribution' in item.keys():
        values=tuple([item['Code'],'en',add_biome_kwd(item['Distribution']),item['Contributors'],'v2.1',datetime.today().date()])
        cur.execute(qrystr_dist,(values,))
        affected_rows=affected_rows+cur.rowcount
print(affected_rows)

330


This is a good time to commit the changes:

In [31]:
conn.commit()

Before we close the connection, we will check that all information was correctly loaded:

In [32]:
qry=" ".join(("select code,LENGTH(t.description), LENGTH(k.description), LENGTH(d.description)",
"from efg_ecological_traits t",
"full join efg_key_ecological_drivers k",
"USING(code,version,language)",
"full join efg_distribution d",
"USING(code,version,language)",
"where version='v2.1' AND language='en' order by code;"))
cur.execute(qry)
updated_codes=cur.fetchall()

In [33]:
updated_codes


[('F1.1', 1601, 863, 158),
 ('F1.2', 1597, 1148, 155),
 ('F1.3', 2014, 848, 119),
 ('F1.4', 1280, 726, 84),
 ('F1.5', 1381, 815, 85),
 ('F1.6', 1944, 1079, 100),
 ('F1.7', 1501, 967, 152),
 ('F2.1', 2223, 660, 58),
 ('F2.10', 2011, 718, 152),
 ('F2.2', 2250, 800, 79),
 ('F2.3', 2039, 1111, 135),
 ('F2.4', 1817, 1029, 129),
 ('F2.5', 1419, 955, 89),
 ('F2.6', 1395, 1265, 113),
 ('F2.7', 1962, 832, 85),
 ('F2.8', 1522, 881, 165),
 ('F2.9', 2279, 807, 179),
 ('F3.1', 1053, 1108, 245),
 ('F3.2', 1600, 1040, 121),
 ('F3.3', 1516, 788, 170),
 ('F3.4', 1538, 1032, 179),
 ('F3.5', 1143, 1080, 153),
 ('FM1.1', 2021, 1176, 85),
 ('FM1.2', 2118, 742, 65),
 ('FM1.3', 1458, 671, 324),
 ('M1.1', 1298, 754, 76),
 ('M1.10', 2378, 601, 147),
 ('M1.2', 1956, 807, 190),
 ('M1.3', 1990, 735, 108),
 ('M1.4', 1772, 942, 211),
 ('M1.5', 1779, 734, 185),
 ('M1.6', 2081, 902, 69),
 ('M1.7', 1550, 682, 58),
 ('M1.8', 1910, 904, 80),
 ('M1.9', 1020, 1119, 278),
 ('M2.1', 2324, 664, 70),
 ('M2.2', 1643, 908, 87),

In [None]:
We have problems with the contributors, but it seems to be a problem in the input word docx:

In [168]:
qry="""
WITH A AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v1.0'),
B AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v2.0'),
C AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v2.1')
SELECT C.code, array_to_string(A.contributors,', ') AS contribs_v1d0, 
array_to_string(B.contributors,', ') AS contribs_v2d0, array_to_string(C.contributors,', ') AS contribs_v2d1
FROM C 
LEFT JOIN B USING(code)
LEFT JOIN A USING(code)
ORDER BY CODE""";
cur.execute(qry)
list_contrib=cur.fetchall()

In [211]:
import pandas as pd
df=pd.DataFrame(list_contrib,columns=['code','v1.0','v2.0','v2.1'])
df

Unnamed: 0,code,v1.0,v2.0,v2.1
0,F1.1,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, PS Giller, MC Rains...","RT Kingsford, R Mac Nally, PS Giller, MC Rains..."
1,F1.2,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, PS Giller, MC, Rain...","RT Kingsford, R Mac Nally, PS Giller, MC, Rain..."
2,F1.3,"RT Kingsford, DA Keith","RT Kingsford, PS Giller, DA Keith","RT Kingsford, PS Giller, DA Keith"
3,F1.4,"RT Kingsford, DA Keith","RT Kingsford, B Robson, PS Giller, AH Arthingt...","RT Kingsford, B Robson, PS Giller, AH Arthingt..."
4,F1.5,"RT Kingsford, DA Keith","RT Kingsford, B Robson, PS Giller, AH Arthingt...","RT Kingsford, B Robson, PS Giller, AH Arthingt..."
...,...,...,...,...
105,TF1.3,"RT Kingsford, DA Keith","RT Kingsford, JA Catford, MC Rains, B Robson, ...","RT Kingsford, JA Catford, MC Rains, B Robson, ..."
106,TF1.4,"DA Keith, RT Kingsford, RC Mac Nally, KM Rodri...","DA Keith, RT Kingsford, R Mac Nally, B Robson,...","DA Keith, RT Kingsford, R Mac Nally, B Robson,..."
107,TF1.5,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, AH Arthington, JA C...","RT Kingsford, R Mac Nally, AH Arthington, JA C..."
108,TF1.6,"DA Keith, RT Kingsford, F Essl, LJ Jackson, T ...","DA Keith, RT Kingsford, F Essl, LJ Jackson, M ...","DA Keith, RT Kingsford, F Essl, LJ Jackson, M ..."


In [172]:
df.to_csv(inputdir / 'contributors.csv')

In [212]:
dfx=pd.read_excel(inputdir / 'Check-list-contributors.xlsx', sheet_name='contributors')  


In [213]:
tbls=("efg_ecological_traits", "efg_key_ecological_drivers", "efg_distribution")
       
for records in dfx.query("status=='looks fine'")[['code','v2.1']].values:
    autors=records[1].split(",")
    auts = [x.strip(' ') for x in autors]
    for tbl in tbls:
        qry="""UPDATE %s SET contributors=%s WHERE code=%s and version='v2.1'"""
        cur.execute(qry,(AsIs(tbl),auts,records[0]))
    

In [214]:
conn.commit()

In [215]:
qry="""
WITH A AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v1.0'),
B AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v2.0'),
C AS (SELECT code,version,contributors FROM efg_ecological_traits WHERE version='v2.1')
SELECT C.code, array_to_string(A.contributors,', ') AS contribs_v1d0, 
array_to_string(B.contributors,', ') AS contribs_v2d0, array_to_string(C.contributors,', ') AS contribs_v2d1
FROM C 
LEFT JOIN B USING(code)
LEFT JOIN A USING(code)
ORDER BY CODE""";
cur.execute(qry)
list_contrib=cur.fetchall()

In [216]:
df=pd.DataFrame(list_contrib,columns=['code','v1.0','v2.0','v2.1'])
df

Unnamed: 0,code,v1.0,v2.0,v2.1
0,F1.1,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, PS Giller, MC Rains...","RT Kingsford, R Mac Nally, PS Giller, MC Rains..."
1,F1.2,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, PS Giller, MC, Rain...","RT Kingsford, R Mac Nally, PS Giller, MC, Rain..."
2,F1.3,"RT Kingsford, DA Keith","RT Kingsford, PS Giller, DA Keith","RT Kingsford, PS Giller, DA Keith"
3,F1.4,"RT Kingsford, DA Keith","RT Kingsford, B Robson, PS Giller, AH Arthingt...","RT Kingsford, B Robson, PS Giller, AH Arthingt..."
4,F1.5,"RT Kingsford, DA Keith","RT Kingsford, B Robson, PS Giller, AH Arthingt...","RT Kingsford, B Robson, PS Giller, AH Arthingt..."
...,...,...,...,...
105,TF1.3,"RT Kingsford, DA Keith","RT Kingsford, JA Catford, MC Rains, B Robson, ...","RT Kingsford, JA Catford, MC Rains, B Robson, ..."
106,TF1.4,"DA Keith, RT Kingsford, RC Mac Nally, KM Rodri...","DA Keith, RT Kingsford, R Mac Nally, B Robson,...","DA Keith, RT Kingsford, R Mac Nally, B Robson,..."
107,TF1.5,"RT Kingsford, RC Mac Nally, DA Keith","RT Kingsford, R Mac Nally, AH Arthington, JA C...","RT Kingsford, R Mac Nally, AH Arthington, JA C..."
108,TF1.6,"DA Keith, RT Kingsford, F Essl, LJ Jackson, T ...","DA Keith, RT Kingsford, F Essl, LJ Jackson, M ...","DA Keith, RT Kingsford, F Essl, LJ Jackson, M ..."


We can now close the database connection:

In [217]:

cur.close()
        
if conn is not None:
    conn.close()
    print('Database connection closed.')

Database connection closed.


# That's it!