## Methodology for ANDE Report on Donor Contributions to SGBs   

This notebook is intended to demonstrate the methodology used by the Devex Analytics team to arrive at total figures for funding channeled by 20 different donor institutions to the "thematic area" of *entrepreneurship and small-and-growing businesses*. 

The methodology uses data from the International Aid Transparency Initiative (IATI). The full methodology used to create the figures in the Devex report summed the figures from this methodology, as well as other figures found through additional desk research. Thus this methodology does not represent a complete data collection process. However, the code and descriptions here illustrate well the process followed for IATI data, and the human decisions made about how to analyse that data. For donors that do *not* consistently contribute quality data to IATI, additional collection, processing, and analysis will be required.

In [1]:
# Importing the needed libraries
import pandas as pd
import numpy as np

Let's import the data. First - there are generally 56 columns of data in a IATI activities datafile. We only want some of them:

In [6]:
iati_fields = ['iati-identifier','reporting-org','default-language', 'title','description','start-planned','end-planned',\
               'start-actual', 'end-actual','recipient-country-code','recipient-country', 'recipient-country-percentage',\
               'sector','sector-code', 'sector-percentage','sector-vocabulary','sector-vocabulary-code', 'default-currency',\
               'total-Commitment','total-Disbursement','total-Expenditure']
date_fields = ['start-planned','end-planned','start-actual','end-actual']

We'll use this to only import 21 columns relevant to our analysis.

I've downloaded three large csv files from donors who are good IATI contributors - the World Bank, DFID, and Sida. In the future, we'll want to replace this with a request to IATI's API. 

In [7]:
# NB! Replace the read_csv commands with IATI API query in the future
wbg_raw = pd.read_csv('WBG_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
dfid_raw = pd.read_csv('DFID_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
sida_raw = pd.read_csv('SIDA_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)

Next we put the data together in a single Pandas dataframe.

In [9]:
data = pd.concat([wbg_raw,dfid_raw,sida_raw], ignore_index=True, sort=False)

Now let's see what the data looks like by seeing how many rows of data each donor contributed to the data:

In [16]:
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development    18258
Sweden                                      85131
World Bank                                   3028
World Bank Group                               30
Name: iati-identifier, dtype: int64

Sida's rows are referred to with "Sweden". Let's change that to *Swedish International Development Agency (Sida)* for clarity's sake.

There's also a curious division between 30 rows from the *World Bank Group* and the other 3028 from *World Bank*. Let's compare the two to see if there are differences and what we want to do about them. The most important fields for our analysis will be the dates, countries, sectors, and funding commitments, so we'll look at them.

In [22]:
# Rename 'Sweden' to 'Swedish International Development Agency (Sida)'
data.loc[data['reporting-org'] == 'Sweden', 'reporting-org'] = 'Swedish International Development Agency (Sida)'

for field in date_fields:
    print()