# Exploring World Bank Land Governance Investments

This mini-research project asks the question:

* To what extent are World Bank investments in cadastral systems supporting the creation of open data on land ownership?

To answer this we will:

* Identify substantial World Bank investments in land governance.
* Fetch project reports, and look for mentions of (a) online availability; and (b) open availability of data. 

This notebook is also intended to document an approach to using documents available through the Internationl Aid Transparency Initiative for exploratory research.

**Status:** Proof of concept & rough draft. 



## Part 1: Fetching data

We can use [d-portal.org](http://d-portal.org/) to look for [World Bank projects that mention the term 'cadastral'](http://www.d-portal.org/ctrack.html?search=cadastral&publisher=44000#view=main). Looking at the XML for these projects suggests that these are commonly classified against the World Bank sector code '00725' for Land Administration and Management

```xml
<sector vocabulary="98" vocabulary-uri="http://pubdocs.worldbank.org/en/275841490966525495/Theme-Taxonomy-and-definitions.pdf" code="000725" percentage="62" xml:lang="en">
    <narrative>Land Administration and Management</narrative>
</sector>
```

Even though this is a custom sector classification, we can substitute this into the sector codes generated by the [IATI Data Store Query Builder](http://datastore.iatistandard.org/query/) to fetch a CSV file with all the projects with this code subdivided by sector (so that we can filter on only projects with a major contribution to Land Administration and Management). 


In [7]:
# Download the data from the IATI Data Store
import pandas as pd

df = pd.read_csv("http://datastore.iatistandard.org/api/1/access/activity/by_sector.csv?reporting-org=44000&sector=000725&stream=True")

df

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,iati-identifier,hierarchy,last-updated-datetime,default-language,reporting-org,...,default-aid-type-code,default-tied-status-code,currency,total-Commitment,total-Disbursement,total-Expenditure,total-Incoming Funds,total-Interest Repayment,total-Loan Repayment,total-Reimbursement
0,000075,,18.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
1,000022,,100.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
2,000851,,13.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
3,000753,,6.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
4,000033,,6.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
5,000752,,6.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
6,000751,,6.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
7,000331,,6.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
8,000052,,25.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0
9,000072,,13.0,,,44000-P074106,1,2018-11-29 00:00:00,en,World Bank,...,,5,,76654770,77431175,0,0,-3173765,-2150243,0


In [99]:
# Now we get just those rows for sector code
df = df[df['sector-code']== '000725']

# And filter for projects with > 50% allocation to this sector
df[df['sector-percentage'] > 20.0]

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,iati-identifier,hierarchy,last-updated-datetime,default-language,reporting-org,...,default-aid-type-code,default-tied-status-code,currency,total-Commitment,total-Disbursement,total-Expenditure,total-Incoming Funds,total-Interest Repayment,total-Loan Repayment,total-Reimbursement
46,000725,,94.0,,,44000-P159692,1,2018-11-29 00:00:00,en,World Bank,...,,5,,43000000,0,0,0,0,0,0
64,000725,,30.0,,,44000-P132306,1,2018-11-29 00:00:00,en,World Bank,...,,5,,60800000,45938751,0,0,0,0,0
82,000725,,100.0,,,44000-P154387,1,2018-11-29 00:00:00,en,World Bank,...,,5,,150000000,0,0,0,0,0,0
139,000725,,50.0,,,44000-P122219,1,2018-11-29 00:00:00,en,World Bank,...,,5,,47870000,19506935,0,0,-503375,-2700421,0
187,000725,,100.0,,,44000-P121289,1,2018-11-29 00:00:00,en,World Bank,...,,5,,80000000,35289637,0,0,-1416077,-1252921,0
388,000725,,28.0,,,44000-P066051,1,2018-11-29 00:00:00,en,World Bank,...,,5,,65216463,66403031,0,0,-4669596,-4104691,0
445,000725,,39.0,,,44000-P090157,1,2018-11-29 00:00:00,en,World Bank,...,,5,,29407500,21780471,0,0,-572615,0,0
547,000725,,100.0,,,44000-P160661,1,2018-11-29 00:00:00,en,World Bank,...,,5,,200000000,0,0,0,0,0,0
739,000725,,23.0,,,44000-P107343,1,2018-11-29 00:00:00,en,World Bank,...,,5,,68000000,49449750,0,0,0,0,0
794,000725,,50.0,,,44000-P106284,1,2018-11-29 00:00:00,en,World Bank,...,,5,,281126500,196661613,0,0,-4219262,-7908316,0


In [100]:
# Then we want to fetch the document list for these projects.
# For this we use the datastore again, but we need the XML which contains documents
# We build a new data frame with just the information we need.
from lxml import etree
import requests

documents = []
for index, activity in df[df['sector-percentage'] > 20.0].iterrows():
    xml = requests.get("http://datastore.iatistandard.org/api/1/access/activity.xml?iati-identifier=%s"%activity['iati-identifier'])
    tree = etree.fromstring(xml.content)
    docindex = 0
    for document in tree.xpath("/result/iati-activities/iati-activity/document-link"):
        docindex = docindex + 1
        docdata = {
            'iati_identifier': activity['iati-identifier'],
            'title': activity['title'],
            'value': activity['total-Commitment'],
            'country': activity['recipient-country'],
            'document_title': document.xpath('title/narrative')[0].text,
            'document_url': document.get('url'),
            'document_index': str(docindex),
            'text':''
           }
        documents.append(docdata)

docf = pd.DataFrame(data=documents)

docf

Unnamed: 0,country,document_index,document_title,document_url,iati_identifier,text,title,value
0,Lebanon,1,Contract Awards,http://www.worldbank.org/projects/P159692?lang...,44000-P159692,,Land Administration System Modernization,43000000
1,Lebanon,2,Project URL,http://www.worldbank.org/projects/P159692?lang=en,44000-P159692,,Land Administration System Modernization,43000000
2,Lebanon,3,Procurement Notices,http://www.worldbank.org/projects/P159692?lang...,44000-P159692,,Land Administration System Modernization,43000000
3,Niger,1,Project Information Document,http://documents.worldbank.org/curated/en/8272...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
4,Niger,2,Project Information Document,http://documents.worldbank.org/curated/en/5031...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
5,Niger,3,Project Appraisal Document,http://documents.worldbank.org/curated/en/8624...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
6,Niger,4,Implementation Status and Results Report,http://documents.worldbank.org/curated/en/7828...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
7,Niger,5,Implementation Status and Results Report,http://documents.worldbank.org/curated/en/1604...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
8,Niger,6,Implementation Status and Results Report,http://documents.worldbank.org/curated/en/7198...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000
9,Niger,7,Implementation Status and Results Report,http://documents.worldbank.org/curated/en/3045...,44000-P132306,,NIGER COMMUNITY ACTION PROGRAM PHASE 3,60800000


## Part 2: Fetching the documents

We now want to download the documents and get them into a form that we can search. 

For simple code we are using [Textract](https://textract.readthedocs.io/) which has some [dependencies that need to be installed before the python packages are](https://textract.readthedocs.io/en/latest/installation.html).

The following code downloads PDFs, or fetches them from a cache, before converting to text files, and then storing that text in the dataframe.

In [103]:
import os
import textract

def get_document(url,identifier):
    documentName = "documentcache/%s - %s"%(identifier,url.replace("http://documents.worldbank.org/curated/","").replace("/","-"))
    text = ""
    
    if(os.path.isfile(documentName + ".txt")):
        print("- Getting PDF from cache")
        with open(documentName + ".txt","r") as tf:
            text = tf.read()
    else:
        try:
            print("- Downloading PDF")
            document = requests.get(url)
            if document.ok:
                with open(documentName,"wb") as f:
                    f.write(document.content)
                text = textract.process(os.getcwd() + "/" + documentName, encoding='ascii')  
                with open(documentName + ".txt","w") as tf:
                    tf.write(text.decode("ascii"))
        except:
            print("Error working with %s"%url)
            
    return text


for index, row in docf.iterrows():
    if row['document_title'] in ['Project Information Document','Implementation Status and Results Report']:
        text = get_document(row['document_url'],row['iati_identifier'])
        docf.at[index,'text'] = text


- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF from cache


- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Downloading PDF
- Getting PDF from cache
- Getting PDF from cache
- Getting PDF 

## Part 3: Search the data

We can now search the documents for key words relating to open data.

When we find a mention of a key phrase (such as 'open data' or 'online' or 'online access') we can investigate the document in question in more depth, and then decide whether or not this indicates the project has a focus on this. 

In [129]:
#1. Check for mentions of 'open' (note, this is case sensitive)

def search(df,term):
    return df[df['text'].str.contains(term,case=False,na=False)]

search(docf,'public')['iati_identifier'].unique()

array(['44000-P132306', '44000-P154387', '44000-P122219', '44000-P121289',
       '44000-P066051', '44000-P090157', '44000-P107343', '44000-P106284',
       '44000-P083126', '44000-P126440', '44000-P118518', '44000-P123923',
       '44000-P082651', '44000-P073206', '44000-P096439', '44000-P096181',
       '44000-P096418', '44000-P101214'], dtype=object)

In [113]:
# 2. Check for Open Data

search(docf,'open data')['iati_identifier'].unique()

array([], dtype=object)

In [112]:
# 3. Check for online 
search(docf,'online')['iati_identifier'].unique()

array(['44000-P122219', '44000-P083126', '44000-P082651'], dtype=object)

In [130]:
# 4. Check for 'data'

search(docf,'Internet')['iati_identifier'].unique()


array(['44000-P122219', '44000-P106284', '44000-P083126', '44000-P082651',
       '44000-P096418'], dtype=object)

# Part 4: Analysis

The brief analysis above finds no mention of the phrase 'open data' in any of the documents from 66 World Bank projects that have > 20% sector allocation to 'Land Administration and Management'.



# Caveats

I need to still check whether older projects have documents attached, and whether all text is being loaded into my index, as it appears not all projects show up even for basic search terms (e.g. and). 