## Methodology for ANDE Report on Donor Contributions to SGBs   

## 1. Preparing the data

This notebook is intended to demonstrate the methodology used by the Devex Analytics team to arrive at total figures for funding channeled by 20 different donor institutions to the "thematic area" of *entrepreneurship and small-and-growing businesses*. 

The methodology uses data from the International Aid Transparency Initiative (IATI). The final figures in the Devex report summed the figures from this methodology, as well as other figures found through additional desk research. Thus this methodology does not represent a complete data collection process. However, the code and descriptions here illustrate well the process followed for IATI data, and the human decisions made about how to analyse that data. 

Note that some donors are consistent, high-quality contributors of IATI data, while others contribute less frequently or less detailed data, and others do not contribute data at all to IATI. For donors that do *not* consistently contribute quality data to IATI, additional collection, processing, and analysis will be required.

In [1]:
# Importing the needed libraries
import pandas as pd
import numpy as np

Let's import the data. First - there are generally 56 columns of data in a IATI activities datafile. We only want some of them:

In [2]:
iati_fields = ['iati-identifier','reporting-org','default-language', 'title','description','start-planned','end-planned',\
               'start-actual', 'end-actual','recipient-country-code','recipient-country', 'recipient-country-percentage',\
               'sector','sector-code', 'sector-percentage','sector-vocabulary','sector-vocabulary-code', 'default-currency',\
               'total-Commitment','total-Disbursement','total-Expenditure']
date_fields = ['start-planned','end-planned','start-actual','end-actual']

We'll use this to only import 21 columns relevant to our analysis.

I've downloaded three large csv files from donors who are good IATI contributors - the World Bank, DFID, and Sida. In the future, we'll want to replace this with a request to IATI's API. 

In [3]:
# NB! Replace the read_csv commands with IATI API query in the future
wbg_raw = pd.read_csv('WBG_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
dfid_raw = pd.read_csv('DFID_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
sida_raw = pd.read_csv('SIDA_IATI_Activities_20190307.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)

Next we put the data together in a single Pandas dataframe.

In [4]:
data = pd.concat([wbg_raw, dfid_raw, sida_raw], ignore_index=True, sort=False)

Now let's see what the data looks like by seeing how many rows of data each donor contributed to the data:

In [5]:
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development    18258
Sweden                                      85131
World Bank                                   3028
World Bank Group                               30
Name: iati-identifier, dtype: int64

Sida's rows are referred to with "Sweden". Let's change that to *Swedish International Development Agency (Sida)* for clarity's sake.

In [6]:
# Rename 'Sweden' to 'Swedish International Development Agency (Sida)'
data.loc[data['reporting-org'] == 'Sweden', 'reporting-org'] = 'Swedish International Development Agency (Sida)'
# Let's check that it worked.
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development           18258
Swedish International Development Agency (Sida)    85131
World Bank                                          3028
World Bank Group                                      30
Name: iati-identifier, dtype: int64

It worked. There's also a curious division between 30 rows from the *World Bank Group* and the other 3028 from *World Bank*. Let's compare the two to see if there are differences and what we want to do about them. The most important fields for our analysis will be the dates, countries, sectors, and funding commitments, so we'll look at them.

In particular, for this methodology to work well, we'll need rows that denote IATI activities which are tagged with OECD sector vocabulary tags. Since there are only 30 *World Bank Group*, we can quickly look at all the rows there.

In [7]:
wb_condition = data['reporting-org'] == 'World Bank'
wbg_condition = data['reporting-org'] == 'World Bank Group'
wb_rows = data[wb_condition].reset_index()
wbg_rows = data[wbg_condition].reset_index()

In [8]:
wbg_rows.loc[:, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,FY,,100,,WBSector
1,TH,,100,,WBSector
2,AC,,100,,WBSector
3,AI,,100,,WBSector
4,AL,,100,,WBSector
5,TY,,100,,WBSector
6,TH,,100,,WBSector
7,AA,,100,,WBSector
8,WW,,100,,WBSector
9,AL,,100,,WBSector


Looks like the first 15 rows have no sector data, which is of no use to us. The last 15 rows have only WBSector or WBTheme as a sector vocabulary, which is not terribly useful to us at this point. We can drop these 30 rows from our data.

In [9]:
data.drop(data[data['reporting-org'] == 'World Bank Group'].index, inplace=True)
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development           18258
Swedish International Development Agency (Sida)    85131
World Bank                                          3028
Name: iati-identifier, dtype: int64

Great. We will also want to remove any other rows that don't contain OECD as a sector vocabulary. Let's have a look at a few example rows from each of the three donors to see if they all have a consistent manner of encoding OECD sectors (they should, according to the IATI standard). 

In [10]:
sida_condition = data['reporting-org'] == 'Swedish International Development Agency (Sida)'
sida_rows = data[sida_condition].reset_index()
dfid_condition = data['reporting-org'] == 'Department for International Development'
dfid_rows = data[dfid_condition].reset_index()

In [11]:
wb_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,25010;24030;15111;24010;YZ;FL;FA;FD;BC;000021;...,Business support services and institutions;For...,37;25;25;13;38;4;29;4;25;13;13;13;13;13;13;13;...,;;;;;;;;;;;;;;;;;;;;;,1;1;1;1;99;99;99;99;99;98;98;98;98;98;98;98;98...
1,15114;11110;32161;32130;24030;24010;23010;1511...,;Education policy and administrative managemen...,22;12;11;11;11;11;11;11;11;34;22;11;22;11;22;1...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,1;1;1;1;1;1;1;1;99;99;99;99;99;98;98;98;98;98;...
2,13040;SA;HG;HF;000621;000051;000513;000062,STD control including HIV/AIDS;;;;;;;,100;7;58;35;85;15;15;85,;;;;;;;,1;99;99;99;98;98;98;98
3,21010;21020;43030;TI;TC;TF;000072;000711;00001...,Transport policy and administrative management...,14;64;22;66;22;12;12;37;38;37;10;12;12;12;38,;;;;;;;;;;;;;;,1;1;1;99;99;99;98;98;98;98;98;98;98;98;98
4,14010;14021;23010;23030;23040;LT;LM;LP;000221;...,Water resources policy and administrative mana...,2;45;2;5;46;91;3;6;25;1;47;25;10;1;25;47;1;25,;;;;;;;;;;;;;;;;;,1;1;1;1;1;99;99;99;98;98;98;98;98;98;98;98;98;98
5,14021;14022;WA;WC;WF;000023;000822;000072;0000...,Water supply - large systems;Sanitation - larg...,79;21;14;72;14;28;28;40;28;15;60;100;15;100;60...,;;;;;;;;;;;;;;;;;,1;1;99;99;99;98;98;98;98;98;98;98;98;98;98;98;...


It appears that the World Bank rows of data have no data in *sector-vocabulary*, but have a semi-colon delimited list of sector vocabulary codes in the *sector-vocabulary-code* field. Interestingly, the Bank tags its activities with both OECD and WB sector vocabularies: 

<p>'1' indicates an OECD vocabulary sector.
<p>'99' indicates a WB vocabulary sector.
<p>'98' indicates a WB vocabulary theme.

More info on sector vocabulary codes and their meanings can be found in IATI's codelists. For our purposes, we are interested in OECD sectors. We'll come back to this later. Let's check the Sida and DFID data.

In [12]:
sida_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,72010,Material relief assistance and services,100,,1
1,72010,Material relief assistance and services,100,,1
2,72010,Material relief assistance and services,100,,1
3,13040,STD control including HIV/AIDS,100,,1
4,15150,Democratic participation and civil society,100,,1
5,43040,Rural development,100,,1


Sida's data is simple - one OECD sector per activity. 

In [13]:
dfid_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,11330,Vocational training,,OECD Development Assistance Committee,DAC
1,72010,Material relief assistance and services,,OECD Development Assistance Committee,DAC
2,11330,Vocational training,,OECD Development Assistance Committee,DAC
3,11330,Vocational training,,OECD Development Assistance Committee,DAC
4,21020;21010,Road transport;Transport policy and administra...,52;47,OECD Development Assistance Committee;OECD Dev...,DAC;DAC
5,21010,Transport policy and administrative management,,OECD Development Assistance Committee,DAC


DFID's data is also fairly simple - though row 4 demonstrates that some activities have percentage splits between more than one OECD sector. 

## 2. Filtering for relevant OECD Sectors

The methodology is based on a fairly simple concept that can be customized and made more complex fairly easily. In this notebook, we'll illustrate a simplistic, but easy-to-implement approach.

The methodology identifies OECD DAC CRS 5-digit sectors of relevance to the topic of interest, and filters IATI data for these sectors. This generally works well, as nearly all IATI contributors tag their IATI activities data with the OECD sectors. As a reminder, in this analysis the topic of interest was "*entrepreneurship and small-and-growing businesses*". There are several OECD sectors that approximate this idea, but we'll pick one: *32130 - Small and medium enterprises (SME) development*. 

First let's remove any rows that don't contain a reference to the OECD sector vocabulary:

In [14]:
wb_oecd_rows = wb_rows[wb_rows['sector-vocabulary-code'].str.contains("1", na=False)]
dfid_oecd_rows = dfid_rows[dfid_rows['sector-vocabulary-code'].str.contains("DAC", na=False)]
sida_oecd_rows = sida_rows[sida_rows['sector-vocabulary-code'].str.contains("1", na=False)]

In [15]:
data = pd.concat([wb_oecd_rows, dfid_oecd_rows, sida_oecd_rows])
data.shape

(106297, 22)

Great, now our *data* dataframe only contains rows that contain at least one OECD sector tag. We still have the majority of the data, since all three donors use OECD sector tags quite consistently.

Now we follow a similar process to reduce the rows to only those that contain reference to OECD sector 32130:

In [16]:
sector_of_interest = '32130'
data = data[data['sector-code'].str.contains(sector_of_interest, na=False)]
data.shape

(1267, 22)

Now we can see that the amount of data is greatly reduced - just over 1% of our original data remains. 

We can even have a quick, dirty look at the sums of funding commitments to this OECD sector from each of the three donors - though this comparison isn't necessarily valid yet, for reasons we'll see...

In [17]:
data['commitment-in-billions'] = data.apply(lambda row: row['total-Commitment'] / 1000000000, axis=1)
data.groupby(['reporting-org'])['commitment-in-billions'].sum()

reporting-org
Department for International Development            6.032279
Swedish International Development Agency (Sida)     0.341563
World Bank                                         16.938528
Name: commitment-in-billions, dtype: float64

Interesting. Keep in mind that the three donors use different currencies - GBP, SEK, and USD, respectively. So across all DFID's IATI data tagged with OECD sector 32130 there are funding commitments of 6.03 billion GBP. Similarly, 341 million SEK for Sida, and 16.9 billion USD for the World Bank Group.

However, we are comparing totals over different time periods, which isn't valid. Additionally, some activities' funding commitments are only *partially* commited to the sector of interest. We need to restrict the data to a time window of analysis, and account for any partial funding commitments.

Last thing - to keep the data clean, we'll delete the *commitment-in-billions* field we just created.

In [18]:
data = data.drop(['commitment-in-billions'], axis=1).reset_index()

## 3. Choosing a timeframe for analysis and scaling the commitments

We want to commit funding committed to IATI activities in our area of interest for a single year. IATI activities have *total-commitment* as a field that indicates the total amount of funding committed by a donor to that activity over its life. However, activities' lives are frequently more than one year, and may begin in the middle of one year, and end in the middle of another. 

Thus, we need to determine the 'average commitment per year' of each activity, and determine whether or not they were active (and for how long, in years) during our timframe of interest. 

To this end, we have four date fields in the data: the planned start and end dates of each activity, and the actual start and end dates of each activity. However there are lots of missing values. 

In [19]:
data.loc[0:5, 'start-planned':'end-actual']

Unnamed: 0,start-planned,end-planned,start-actual,end-actual
0,2018-02-28,2019-02-28,2018-03-30,NaT
1,2015-09-22,2022-12-31,2017-09-28,NaT
2,2012-03-29,2019-08-30,2013-06-11,NaT
3,2017-10-31,2023-12-28,2018-07-06,NaT
4,2010-05-27,2017-12-31,2010-12-16,2017-12-31
5,2017-06-15,2022-07-31,2017-07-13,NaT


To calculate the length in years of each activity, we'll need to check that data exists in these fields for these fields. To do so, we'll do a quick IF - THEN analysis of each activity, and calculate the difference in years accordingly:

        IF (start-actual exists AND end-actual exists)
        THEN difference = (end-actual - start-actual), then divide by 365.25 to arrive at years diff
        ELIF (start-actual exists AND end-planned exists)
        THEN difference = (end-planned - start-actual), then divide by 365.25 to arrive at years diff
        ELIF (start-planned exists and end-planned exists)
        THEN difference = (end-planned - start-planned), then divide by 365.25 to arrive at years diff
        ELSE leave difference blank for now
        
To apply this logic to our data, we'll write  function to determine which IF scenario applies, calculate the difference in days, and then convert to a difference in years.

In [42]:
def dateDiff(dataFrame):
    
    for i in dataFrame.index:
        
        # both end-actual and start-actual are not null - i.e. they are dates
        if not pd.isnull(dataFrame.at[i,'end-actual']) and not pd.isnull(dataFrame.at[i,'start-actual']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-actual'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
        
        # both end-planned and start-actual are not null - i.e. they are dates
        elif not pd.isnull(dataFrame.at[i,'end-planned']) and not pd.isnull(dataFrame.at[i,'start-actual']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-planned'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
            
        # both end-planned and start-planned are not null - i.e. they are dates
        elif not pd.isnull(dataFrame.at[i,'end-planned']) and not pd.isnull(dataFrame.at[i,'start-planned']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-actual'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
        
        # otherwise, not enough info. Need to impute later
        else:
            dataFrame.at[i,'date-diff-days'] = pd.NaT
            dataFrame.at[i,'date-diff-years'] = np.NaN

    return dataFrame

data = dateDiff(data)

What does this function do? It evaluates what start and end date data is available. For data points where this data *is* available, it calculates the difference between the two dates in days, then in years. For data points without this data, these fields are left blank.

Let's delete the *date-diff-days* column, as we won't need it anymore, and then let's check how many blank *date-diff-years* rows exist in our data.

In [43]:
#data = data.drop(['date-diff-days'], axis=1)
data['date-diff-years'].isna().sum()

1

So there are still 98 rows without values in *date-diff-years*. We will need to estimate these values - what we'll do is assume that for each donor, the average *date-diff-years* value is a good guess for the value missing in that donor's blank rows. 

In [44]:
data_temp = data[~pd.isnull(data['date-diff-years'])]
dfid_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[0]  # avg duration of DFID activities
sida_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[1]  # avg duration of Sida activities
wbg_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[2]   # avg duration of WBG activities

for i in data.index:
    if pd.isnull(data.at[i,'date-diff-years']):
        if data.at[i,'reporting-org'] == 'Department for International Development':
            data.at[i,'date-diff-years'] = dfid_years_avg
        elif data.at[i,'reporting-org'] == 'Swedish International Development Agency (Sida)':
            data.at[i,'date-diff-years'] = sida_years_avg
        elif data.at[i,'reporting-org'] == 'World Bank':
            data.at[i,'date-diff-years'] = wbg_years_avg

Now there are no more rows without a *date-diff-years*, we can use this column to calculate each activity's *annual-Commitment*. 

The annual commitment is the total commitment to the project (*total-Commitment*), divided by its duration in years (*date-diff-years*). While in reality, activities' budgets are not spent uniformly over the course of the activity, this is an easy approximation that will make it easier to compare donors' commitments for certain, specific time frames of interest.

In [45]:
data.at[i, 'annual-Commitment'] = data.at[i, 'total-Commitment'] / data.at[i, 'date-diff-years']

In [49]:
data.groupby(['reporting-org'])['annual-Commitment'].sum()

reporting-org
Department for International Development                    inf
Swedish International Development Agency (Sida)    3.364024e+08
World Bank                                         1.723712e+10
Name: annual-Commitment, dtype: float64