## Methodology for ANDE Report on Donor Contributions to SGBs   

## 1. Preparing the data

This notebook is intended to demonstrate the methodology used by the Devex Analytics team to arrive at total figures for funding channeled by 20 different donor institutions to the "thematic area" of *entrepreneurship and small-and-growing businesses*. 

The methodology uses data from the International Aid Transparency Initiative (IATI). The final figures in the Devex report summed the figures from this methodology, as well as other figures found through additional desk research. Thus this methodology does not represent a complete data collection process. However, the code and descriptions here illustrate well the process followed for IATI data, and the human decisions made about how to analyse that data. 

Note that some donors are consistent, high-quality contributors of IATI data, while others contribute less frequently or less detailed data, and others do not contribute data at all to IATI. For donors that do *not* consistently contribute quality data to IATI, additional collection, processing, and analysis will be required.

In [3]:
# Importing the needed libraries
import pandas as pd
import numpy as np

Let's import the data. First - there are generally 56 columns of data in a IATI activities datafile. We only want some of them:

In [4]:
iati_fields = ['iati-identifier','reporting-org','default-language', 'title','description','start-planned','end-planned',\
               'start-actual', 'end-actual','recipient-country-code','recipient-country', 'recipient-country-percentage',\
               'sector','sector-code', 'sector-percentage','sector-vocabulary','sector-vocabulary-code', 'default-currency',\
               'total-Commitment','total-Disbursement','total-Expenditure']
date_fields = ['start-planned','end-planned','start-actual','end-actual']

We'll use this to only import 21 columns relevant to our analysis.

I've downloaded three large csv files from donors who are good IATI contributors - the World Bank, DFID, and Sida. In the future, we'll want to replace this with a request to IATI's API. 

In [5]:
# NB! Replace the read_csv commands with IATI API query in the future
wbg_raw = pd.read_csv('WBG_IATI_Activities_20190315.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
dfid_raw = pd.read_csv('DFID_IATI_Activities_20190315.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)
sida_raw = pd.read_csv('SIDA_IATI_Activities_20190315.csv', low_memory=False, usecols=iati_fields, parse_dates=date_fields)

Next we put the data together in a single Pandas dataframe.

In [6]:
data = pd.concat([wbg_raw, dfid_raw, sida_raw], ignore_index=True, sort=False)

Now let's see what the data looks like by seeing how many rows of data each donor contributed to the data:

In [7]:
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development    18353
Sweden                                      85601
World Bank                                   3035
World Bank Group                               30
Name: iati-identifier, dtype: int64

Sida's rows are referred to with "Sweden". Let's change that to *Swedish International Development Agency (Sida)* for clarity's sake.

In [8]:
# Rename 'Sweden' to 'Swedish International Development Agency (Sida)'
data.loc[data['reporting-org'] == 'Sweden', 'reporting-org'] = 'Swedish International Development Agency (Sida)'
# Let's check that it worked.
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development           18353
Swedish International Development Agency (Sida)    85601
World Bank                                          3035
World Bank Group                                      30
Name: iati-identifier, dtype: int64

It worked. There's also a curious division between 30 rows from the *World Bank Group* and the other 3028 from *World Bank*. Let's compare the two to see if there are differences and what we want to do about them. The most important fields for our analysis will be the dates, countries, sectors, and funding commitments, so we'll look at them.

In particular, for this methodology to work well, we'll need rows that denote IATI activities which are tagged with OECD sector vocabulary tags. Since there are only 30 *World Bank Group*, we can quickly look at all the rows there.

In [9]:
wb_condition = data['reporting-org'] == 'World Bank'
wbg_condition = data['reporting-org'] == 'World Bank Group'
wb_rows = data[wb_condition].reset_index()
wbg_rows = data[wbg_condition].reset_index()

In [10]:
wbg_rows.loc[:, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,FY,,100,,WBSector
1,TH,,100,,WBSector
2,AC,,100,,WBSector
3,AI,,100,,WBSector
4,AL,,100,,WBSector
5,TY,,100,,WBSector
6,TH,,100,,WBSector
7,AA,,100,,WBSector
8,WW,,100,,WBSector
9,AL,,100,,WBSector


Looks like the first 15 rows have no sector data, which is of no use to us. The last 15 rows have only WBSector or WBTheme as a sector vocabulary, which is not terribly useful to us at this point. We can drop these 30 rows from our data.

In [11]:
data.drop(data[data['reporting-org'] == 'World Bank Group'].index, inplace=True)
data.groupby(['reporting-org'])['iati-identifier'].count()

reporting-org
Department for International Development           18353
Swedish International Development Agency (Sida)    85601
World Bank                                          3035
Name: iati-identifier, dtype: int64

Great. We will also want to remove any other rows that don't contain OECD as a sector vocabulary. Let's have a look at a few example rows from each of the three donors to see if they all have a consistent manner of encoding OECD sectors (they should, according to the IATI standard). 

In [12]:
sida_condition = data['reporting-org'] == 'Swedish International Development Agency (Sida)'
sida_rows = data[sida_condition].reset_index()
dfid_condition = data['reporting-org'] == 'Department for International Development'
dfid_rows = data[dfid_condition].reset_index()

In [13]:
wb_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,000072;000023;000322;000211;000321;000032;0003...,;;;;;;;;;;;;;;;;;;Financial policy and adminis...,13;13;13;13;25;38;13;13;13;13;13;13;13;25;4;29...,;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...
1,000032;000811;000661;000243;000081;000322;0004...,;;;;;;;;;;;;;;;;;;;;;;;;;Public finance manage...,22;;11;11;;11;22;11;11;11;11;11;11;22;11;22;11...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...
2,000066;000212;000662;000021;000043;000014;0006...,;;;;;;;;;;;;;;Rural development;Privatisation;...,14;29;7;29;29;28;7;29;28;20;20;20;20;20;6;59;35,;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;99;99;99;99;99;1;1;1
3,000835;000834;000083;000084;000022;000725;0000...,;;;;;;;;Agricultural development,7;46;53;19;100;28;28;100;100,;;;;;;;;,98;98;98;98;98;98;98;99;1
4,000052;000651;000065;000521;000041;000657;0006...,;;;;;;;;;;;;;;;;;Higher education;Primary educ...,17;21;50;17;9;4;4;26;17;9;17;4;9;14;81;2;3;2;8...,;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...
5,000513;000662;000072;000663;000022;000066;0000...,;;;;;;;;;;;;;;;Social/ welfare services,22;11;22;11;100;22;23;22;23;11;22;20;8;64;8;100,;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;99;99;99;99;1


It appears that the World Bank rows of data have no data in *sector-vocabulary*, but have a semi-colon delimited list of sector vocabulary codes in the *sector-vocabulary-code* field. Interestingly, the Bank tags its activities with both OECD and WB sector vocabularies: 

<p>'1' indicates an OECD vocabulary sector.
<p>'99' indicates a WB vocabulary sector.
<p>'98' indicates a WB vocabulary theme.

More info on sector vocabulary codes and their meanings can be found in IATI's codelists. For our purposes, we are interested in OECD sectors. We'll come back to this later. Let's check the Sida and DFID data.

In [14]:
sida_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,72010,Material relief assistance and services,,,1
1,72010,Material relief assistance and services,,,1
2,43010,Multisector aid,,,1
3,72010,Material relief assistance and services,,,1
4,72010,Material relief assistance and services,,,1
5,72010,Material relief assistance and services,,,1


Sida's data is simple - one OECD sector per activity. 

In [15]:
dfid_rows.loc[0:5, 'sector-code':'sector-vocabulary-code']

Unnamed: 0,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code
0,15160;15150;13020,;Democratic participation and civil society;Re...,30;30;40,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC
1,15160;15150;13020,;Democratic participation and civil society;Re...,30;30;40,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC
2,15160;15150;13020,;Democratic participation and civil society;Re...,30;30;40,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC
3,15170;15160;15153;15150;15110;15151,Women's equality organisations and institution...,10;10;10;10;25;35,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC;DAC;DAC;DAC
4,15170;15160;15153;15150;15110;15151,Women's equality organisations and institution...,10;10;10;10;25;35,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC;DAC;DAC;DAC
5,15170;15160;15153;15150;15110;15151,Women's equality organisations and institution...,10;10;10;10;25;35,OECD Development Assistance Committee;OECD Dev...,DAC;DAC;DAC;DAC;DAC;DAC


DFID's data is also fairly simple - though row 4 demonstrates that some activities have percentage splits between more than one OECD sector. 

## 2. Filtering for relevant OECD Sectors

The methodology is based on a fairly simple concept that can be customized and made more complex fairly easily. In this notebook, we'll illustrate a simplistic, but easy-to-implement approach.

The methodology identifies OECD DAC CRS 5-digit sectors of relevance to the topic of interest, and filters IATI data for these sectors. This generally works well, as nearly all IATI contributors tag their IATI activities data with the OECD sectors. As a reminder, in this analysis the topic of interest was "*entrepreneurship and small-and-growing businesses*". There are several OECD sectors that approximate this idea, but we'll pick one: *32130 - Small and medium enterprises (SME) development*. 

First let's remove any rows that don't contain a reference to the OECD sector vocabulary:

In [16]:
wb_oecd_rows = wb_rows[wb_rows['sector-vocabulary-code'].str.contains("1", na=False)]
dfid_oecd_rows = dfid_rows[dfid_rows['sector-vocabulary-code'].str.contains("DAC", na=False)]
sida_oecd_rows = sida_rows[sida_rows['sector-vocabulary-code'].str.contains("1", na=False)]

In [17]:
data = pd.concat([wb_oecd_rows, dfid_oecd_rows, sida_oecd_rows])
data = data.drop(['index'], axis=1)
data.head()

Unnamed: 0,iati-identifier,default-language,reporting-org,title,description,start-planned,end-planned,start-actual,end-actual,recipient-country-code,...,recipient-country-percentage,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,default-currency,total-Commitment,total-Disbursement,total-Expenditure
0,44000-P157469,en,World Bank,Development Policy Credit 2: Fiscal Sustainabi...,The development objectives of the Second Fisca...,2016-12-20,2018-06-30,2016-12-21,2018-06-30,BT,...,100,000072;000023;000322;000211;000321;000032;0003...,;;;;;;;;;;;;;;;;;;Financial policy and adminis...,13;13;13;13;25;38;13;13;13;13;13;13;13;25;4;29...,;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...,USD,24000000.0,23832725.0,0.0
1,44000-P164290,en,World Bank,Strengthening Fiscal Management & Private Sect...,The Development Policy Credit (DPC) of US$30 m...,2018-02-28,2019-02-28,2018-03-30,NaT,BT,...,100,000032;000811;000661;000243;000081;000322;0004...,;;;;;;;;;;;;;;;;;;;;;;;;;Public finance manage...,22;;11;11;;11;22;11;11;11;11;11;11;22;11;22;11...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,30000000.0,29202766.0,0.0
2,44000-P071144,en,World Bank,DR Congo Private Sector Development and Compet...,The objective of the Private Sector Developmen...,2001-09-14,2014-06-30,2003-07-29,2014-06-30,CD,...,100,000066;000212;000662;000021;000043;000014;0006...,;;;;;;;;;;;;;;Rural development;Privatisation;...,14;29;7;29;29;28;7;29;28;20;20;20;20;20;6;59;35,;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;99;99;99;99;99;1;1;1,USD,168226510.0,176251560.0,0.0
3,44000-P083813,en,World Bank,DRC National Parks Network Rehabilitation Project,The objective of the National Parks Network Re...,2009-04-02,2018-12-31,2013-12-12,2018-12-31,CD,...,100,000835;000834;000083;000084;000022;000725;0000...,;;;;;;;;Agricultural development,7;46;53;19;100;28;28;100;100,;;;;;;;;,98;98;98;98;98;98;98;99;1,USD,3000000.0,2460068.0,0.0
4,44000-P086294,en,World Bank,DRC Education Sector Project,The objective of the Education Sector Project ...,2005-01-13,2014-10-31,2007-06-05,2014-10-31,CD,...,100,000052;000651;000065;000521;000041;000657;0006...,;;;;;;;;;;;;;;;;;Higher education;Primary educ...,17;21;50;17;9;4;4;26;17;9;17;4;9;14;81;2;3;2;8...,;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...,USD,149859258.0,151912628.0,0.0


Great, now our *data* dataframe only contains rows that contain at least one OECD sector tag. We still have the majority of the data, since all three donors use OECD sector tags quite consistently.

Now we follow a similar process to reduce the rows to only those that contain reference to OECD sector 32130:

In [18]:
sector_of_interest = '32130'
data = data[data['sector-code'].str.contains(sector_of_interest, na=False)]
data.head()

Unnamed: 0,iati-identifier,default-language,reporting-org,title,description,start-planned,end-planned,start-actual,end-actual,recipient-country-code,...,recipient-country-percentage,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,default-currency,total-Commitment,total-Disbursement,total-Expenditure
1,44000-P164290,en,World Bank,Strengthening Fiscal Management & Private Sect...,The Development Policy Credit (DPC) of US$30 m...,2018-02-28,2019-02-28,2018-03-30,NaT,BT,...,100,000032;000811;000661;000243;000081;000322;0004...,;;;;;;;;;;;;;;;;;;;;;;;;;Public finance manage...,22;;11;11;;11;22;11;11;11;11;11;11;22;11;22;11...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,30000000.0,29202766.0,0.0
18,44000-P124720,en,World Bank,Dem Rep Congo - Western Growth Poles,The development objective of the Western Growt...,2012-03-29,2019-08-30,2013-06-11,NaT,CD,...,100,000022;000332;000723;000221;000032;000023;0002...,;;;;;;;;;;;;;;;;;;;;;;Agricultural development...,19;9;19;19;9;10;9;19;6;6;9;19;9;9;9;3;28;4;28;...,;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,110000000.0,65055811.0,0.0
31,44000-P160806,en,World Bank,DRC- SME Development and Growth Project,The Small Medium Enterprise Development and Gr...,2017-10-31,2023-12-28,2018-07-06,NaT,CD,...,100,000323;000241;000061;000024;000223;000032;0002...,;;;;;;;;;;;;Small and medium-sized enterprises...,67;45;41;87;35;67;72;35;14;5;37;44;100,;;;;;;;;;;;;,98;98;98;98;98;98;98;98;99;99;99;99;1,USD,100000000.0,0.0,0.0
38,44000-P118561,en,World Bank,CG Rep. Support to Economic Diversification Pr...,The objective of the Support to Economic Diver...,2010-05-27,2017-12-31,2010-12-16,2017-12-31,CG,...,100,000022;000723;000221;000021;000032;000023;0002...,;;;;;;;;;;;;;;;;;;;;Small and medium-sized ent...,4;4;4;40;21;10;21;4;21;4;6;21;8;6;32;4;14;12;3...,;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,8407683.0,7607203.0,0.0
48,44000-P159979,en,World Bank,Commercial Agriculture Project,The development objective of Commercial Agricu...,2017-06-15,2022-07-31,2017-07-13,NaT,CG,...,100,000812;000721;000067;000332;000811;000033;0000...,;;;;;;;;;;;;;;Small and medium-sized enterpris...,1;100;63;55;8;55;55;67;100;10;17;59;13;11;5;25;70,;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;99;99;99;99;1;1;1,USD,100000000.0,2662662.0,0.0


Now we can see that the amount of data is greatly reduced - just over 1% of our original data remains. 

We can even have a quick, dirty look at the sums of funding commitments to this OECD sector from each of the three donors - though this comparison isn't necessarily valid yet, for reasons we'll see...

In [19]:
data['commitment-in-billions'] = data.apply(lambda row: row['total-Commitment'] / 1000000000, axis=1)
data.groupby(['reporting-org'])['commitment-in-billions'].sum()

reporting-org
Department for International Development            6.050498
Swedish International Development Agency (Sida)     0.383166
World Bank                                         16.963528
Name: commitment-in-billions, dtype: float64

Interesting. Keep in mind that the three donors use different currencies - GBP, SEK, and USD, respectively.

However, we are comparing totals over different time periods, which isn't valid. Additionally, some activities' funding commitments are only *partially* commited to the sector of interest. We need to restrict the data to a time window of analysis, and account for any partial funding commitments.

Last thing - to keep the data clean, we'll delete the *commitment-in-billions* field we just created.

In [21]:
data = data.drop(['commitment-in-billions'], axis=1).reset_index()
data = data.drop(['index'], axis=1)
data.head()

KeyError: "['commitment-in-billions'] not found in axis"

## 3. Choosing a timeframe for analysis and scaling the commitments

We want to commit funding committed to IATI activities in our area of interest for a single year. IATI activities have *total-commitment* as a field that indicates the total amount of funding committed by a donor to that activity over its life. However, activities' lives are frequently more than one year, and may begin in the middle of one year, and end in the middle of another. 

Thus, we need to determine the 'average commitment per year' of each activity, and determine whether or not they were active (and for how long, in years) during our timframe of interest. 

To this end, we have four date fields in the data: the planned start and end dates of each activity, and the actual start and end dates of each activity. However there are lots of missing values. 

In [22]:
data.loc[0:5, 'start-planned':'end-actual']

Unnamed: 0,start-planned,end-planned,start-actual,end-actual
0,2018-02-28,2019-02-28,2018-03-30,NaT
1,2012-03-29,2019-08-30,2013-06-11,NaT
2,2017-10-31,2023-12-28,2018-07-06,NaT
3,2010-05-27,2017-12-31,2010-12-16,2017-12-31
4,2017-06-15,2022-07-31,2017-07-13,NaT
5,2017-07-12,2023-05-30,2018-05-30,NaT


To calculate the length in years of each activity, we'll need to check that data exists in these fields for these fields. To do so, we'll do a quick IF - THEN analysis of each activity, and calculate the difference in years accordingly:

        IF (start-actual exists AND end-actual exists)
        THEN difference = (end-actual - start-actual), then divide by 365.25 to arrive at years diff
        ELIF (start-actual exists AND end-planned exists)
        THEN difference = (end-planned - start-actual), then divide by 365.25 to arrive at years diff
        ELIF (start-planned exists and end-planned exists)
        THEN difference = (end-planned - start-planned), then divide by 365.25 to arrive at years diff
        ELSE leave difference blank for now
        
To apply this logic to our data, we'll write  function to determine which IF scenario applies, calculate the difference in days, and then convert to a difference in years.

In [23]:
def dateDiff(dataFrame):
    
    for i in dataFrame.index:
        
        # both end-actual and start-actual are not null - i.e. they are dates
        if not pd.isnull(dataFrame.at[i,'end-actual']) and not pd.isnull(dataFrame.at[i,'start-actual']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-actual'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
        
        # both end-planned and start-actual are not null - i.e. they are dates
        elif not pd.isnull(dataFrame.at[i,'end-planned']) and not pd.isnull(dataFrame.at[i,'start-actual']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-planned'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
            
        # both end-planned and start-planned are not null - i.e. they are dates
        elif not pd.isnull(dataFrame.at[i,'end-planned']) and not pd.isnull(dataFrame.at[i,'start-planned']):
            dataFrame.at[i,'date-diff-days'] = dataFrame.at[i,'end-actual'] - dataFrame.at[i,'start-actual']
            # if less than 365.25 days, round the duration in years to 1
            if dataFrame.at[i,'date-diff-days'].days / 365.25 > 1:
                dataFrame.at[i,'date-diff-years'] = dataFrame.at[i,'date-diff-days'].days / 365.25
            else:
                dataFrame.at[i,'date-diff-years'] = 1
        
        # otherwise, not enough info. Need to impute later
        else:
            dataFrame.at[i,'date-diff-days'] = pd.NaT
            dataFrame.at[i,'date-diff-years'] = np.NaN

    return dataFrame

data = dateDiff(data)

What does this function do? It evaluates what start and end date data is available. For data points where this data *is* available, it calculates the difference between the two dates in days, then in years. For data points without this data, these fields are left blank.

Let's delete the *date-diff-days* column, as we won't need it anymore, and then let's check how many blank *date-diff-years* rows exist in our data.

In [24]:
data = data.drop(['date-diff-days'], axis=1)
data['date-diff-years'].isna().sum()

1

In [25]:
data_temp = data[~pd.isnull(data['date-diff-years'])]
dfid_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[0]  # avg duration of DFID activities
sida_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[1]  # avg duration of Sida activities
wbg_years_avg = data_temp.groupby(['reporting-org'])['date-diff-years'].mean()[2]   # avg duration of WBG activities

for i in data.index:
    if pd.isnull(data.at[i,'date-diff-years']):
        if data.at[i,'reporting-org'] == 'Department for International Development':
            data.at[i,'date-diff-years'] = dfid_years_avg
        elif data.at[i,'reporting-org'] == 'Swedish International Development Agency (Sida)':
            data.at[i,'date-diff-years'] = sida_years_avg
        elif data.at[i,'reporting-org'] == 'World Bank':
            data.at[i,'date-diff-years'] = wbg_years_avg

In [26]:
data.head()

Unnamed: 0,iati-identifier,default-language,reporting-org,title,description,start-planned,end-planned,start-actual,end-actual,recipient-country-code,...,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,default-currency,total-Commitment,total-Disbursement,total-Expenditure,date-diff-years
0,44000-P164290,en,World Bank,Strengthening Fiscal Management & Private Sect...,The Development Policy Credit (DPC) of US$30 m...,2018-02-28,2019-02-28,2018-03-30,NaT,BT,...,000032;000811;000661;000243;000081;000322;0004...,;;;;;;;;;;;;;;;;;;;;;;;;;Public finance manage...,22;;11;11;;11;22;11;11;11;11;11;11;22;11;22;11...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,30000000.0,29202766.0,0.0,1.0
1,44000-P124720,en,World Bank,Dem Rep Congo - Western Growth Poles,The development objective of the Western Growt...,2012-03-29,2019-08-30,2013-06-11,NaT,CD,...,000022;000332;000723;000221;000032;000023;0002...,;;;;;;;;;;;;;;;;;;;;;;Agricultural development...,19;9;19;19;9;10;9;19;6;6;9;19;9;9;9;3;28;4;28;...,;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,110000000.0,65055811.0,0.0,6.217659
2,44000-P160806,en,World Bank,DRC- SME Development and Growth Project,The Small Medium Enterprise Development and Gr...,2017-10-31,2023-12-28,2018-07-06,NaT,CD,...,000323;000241;000061;000024;000223;000032;0002...,;;;;;;;;;;;;Small and medium-sized enterprises...,67;45;41;87;35;67;72;35;14;5;37;44;100,;;;;;;;;;;;;,98;98;98;98;98;98;98;98;99;99;99;99;1,USD,100000000.0,0.0,0.0,5.478439
3,44000-P118561,en,World Bank,CG Rep. Support to Economic Diversification Pr...,The objective of the Support to Economic Diver...,2010-05-27,2017-12-31,2010-12-16,2017-12-31,CG,...,000022;000723;000221;000021;000032;000023;0002...,;;;;;;;;;;;;;;;;;;;;Small and medium-sized ent...,4;4;4;40;21;10;21;4;21;4;6;21;8;6;32;4;14;12;3...,;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,8407683.0,7607203.0,0.0,7.041752
4,44000-P159979,en,World Bank,Commercial Agriculture Project,The development objective of Commercial Agricu...,2017-06-15,2022-07-31,2017-07-13,NaT,CG,...,000812;000721;000067;000332;000811;000033;0000...,;;;;;;;;;;;;;;Small and medium-sized enterpris...,1;100;63;55;8;55;55;67;100;10;17;59;13;11;5;25;70,;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;99;99;99;99;1;1;1,USD,100000000.0,2662662.0,0.0,5.048597


In [27]:
(data['date-diff-years'] == 0).astype(int).sum(axis=0)

0

Now there are no more rows without a *date-diff-years*, we can use this column to calculate each activity's *annual-Commitment*. 

The annual commitment is the total commitment to the project (*total-Commitment*), divided by its duration in years (*date-diff-years*). While in reality, activities' budgets are not spent uniformly over the course of the activity, this is an easy approximation that will make it easier to compare donors' commitments for certain, specific time frames of interest.

In [28]:
data['annual-Commitment'] = data['total-Commitment'] / data['date-diff-years']

In [29]:
year = 2017

In [49]:
from methods import activityInYear

In [50]:
activityInYear(data, year)

NameError: name 'dataFrame' is not defined