## IATI Data Analysis Methodology Testing Notebook

This notebook is intended to let non-technical users experiment with the data aggregation methodology invented for a previous client project: namely, that one filter IATI activities by sectors (from the OECD DAC sector framework) and by keywords.

It uses a small sample of 400 IATI activities from 8 donors:
<ul>
<li>Bill and Melinda Gates Foundation</li>
<li>EuropeAid DEVCO</li>
<li>DFID</li>
<li>GAC Canada</li>
<li>Global Fund</li>
<li>MFA of the Netherlands</li>
<li>Sida Sweden</li>
<li>World Bank</li>
</ul>

To experiment in this notebook, follow the instructions below.

#### 1. Import necessary libraries

First we need to import Pandas and some functions I've written elsewhere in this repository.

Select the cell below and press the "Run Cell" button above, or hit CNTL + ENTER

In [16]:
import pandas as pd
from data_setup import data_sectors, data_keywords, data_investigate

#### 2a. Import our data

Second we'll import our 400 rows of example data. For reference, the full dataset I'm currently experimenting with on my local machine is about ~300,000 rows long.

Select the cell below and press the "Run Cell" button above, or hit CNTL + ENTER

In [10]:
date_fields = ['start-planned','end-planned','start-actual','end-actual']
data = pd.read_csv('shareable_csv.csv', parse_dates=date_fields)
del data['Unnamed: 0']

#### 2b. Inspect our data

If you want, you can have a look at some of the rows of the example data set.

Below, you can change two variable <b>number_of_rows</b> and <b>donor</b> to determine the rows you view.

Changing number of rows will, obviously, change the number of example data rows returned. Keep in mind there are only 50 per donor, so enter an integer less than 50.

Changing the donor will change which donor's data you see. Copy <b>exactly</b> the text of the donor names below and paste it in the single quotes in the code below.

<b>Donors</b>: B&MGF | DEVCO | DFID | GAC | Global Fund | MFA Netherlands | Sida | World Bank

Then select the cell below and press the "Run Cell" button above, or hit CNTL + ENTER

In [17]:
# Edit the green number below to change the number of rows you view!
number_of_rows = 10

# Enter the names of 1 (and only one) of the donors mentioned above to view data for that donor.
donor = 'World Bank'

data[data['reporting-org']==donor].head(number_of_rows)

Unnamed: 0,iati-identifier,default-language,reporting-org,title,description,start-planned,end-planned,start-actual,end-actual,recipient-country-code,...,recipient-country-percentage,sector-code,sector,sector-percentage,sector-vocabulary,sector-vocabulary-code,default-currency,total-Commitment,total-Disbursement,total-Expenditure
350,44000-P157469,en,World Bank,Development Policy Credit 2: Fiscal Sustainabi...,The development objectives of the Second Fisca...,2016-12-20,2018-06-30,2016-12-21,2018-06-30,BT,...,100.0,000072;000023;000322;000211;000321;000032;0003...,;;;;;;;;;;;;;;;;;;Financial policy and adminis...,13;13;13;13;25;38;13;13;13;13;13;13;13;25;4;29...,;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...,USD,24000000.0,23832725.0,0.0
351,44000-P164290,en,World Bank,Strengthening Fiscal Management & Private Sect...,The Development Policy Credit (DPC) of US$30 m...,2018-02-28,2019-02-28,2018-03-30,NaT,BT,...,100.0,000032;000811;000661;000243;000081;000322;0004...,;;;;;;;;;;;;;;;;;;;;;;;;;Public finance manage...,22;;11;11;;11;22;11;11;11;11;11;11;22;11;22;11...,;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;98;9...,USD,30000000.0,29202766.0,0.0
352,44000-P071144,en,World Bank,DR Congo Private Sector Development and Compet...,The objective of the Private Sector Developmen...,2001-09-14,2014-06-30,2003-07-29,2014-06-30,CD,...,100.0,000066;000212;000662;000021;000043;000014;0006...,;;;;;;;;;;;;;;Rural development;Privatisation;...,14;29;7;29;29;28;7;29;28;20;20;20;20;20;6;59;35,;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;99;99;99;99;99;1;1;1,USD,168226510.0,176251560.0,0.0
353,44000-P083813,en,World Bank,DRC National Parks Network Rehabilitation Project,The objective of the National Parks Network Re...,2009-04-02,2018-12-31,2013-12-12,2018-12-31,CD,...,100.0,000835;000834;000083;000084;000022;000725;0000...,;;;;;;;;Agricultural development,7;46;53;19;100;28;28;100;100,;;;;;;;;,98;98;98;98;98;98;98;99;1,USD,3000000.0,2460068.0,0.0
354,44000-P086294,en,World Bank,DRC Education Sector Project,The objective of the Education Sector Project ...,2005-01-13,2014-10-31,2007-06-05,2014-10-31,CD,...,100.0,000052;000651;000065;000521;000041;000657;0006...,;;;;;;;;;;;;;;;;;Higher education;Primary educ...,17;21;50;17;9;4;4;26;17;9;17;4;9;14;81;2;3;2;8...,;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;99;99;9...,USD,149859258.0,151912628.0,0.0
355,44000-P086874,en,World Bank,Democratic Republic of Congo Emergency Social ...,The objective of the Additional Financing for ...,2004-07-01,2013-06-30,2004-08-26,2013-06-30,CD,...,100.0,000513;000662;000072;000663;000022;000066;0000...,;;;;;;;;;;;;;;;Social/ welfare services,22;11;22;11;100;22;23;22;23;11;22;20;8;64;8;100,;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;99;99;99;99;1,USD,101409112.0,103927904.0,0.0
356,44000-P088751,en,World Bank,DRC Health Sector Rehabilitation Support Project,The development objective of the Additional Fi...,2005-04-26,2014-12-31,2005-09-01,2014-12-31,CD,...,100.0,000063;000637;000621;000631;000635;000062;0006...,;;;;;;;;;;;Malaria control;Basic health care;M...,63;25;13;25;13;37;24;10;7;3;90;20;69;8;3,;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;99;99;99;1;1;1;1,USD,332945982.0,332324124.0,0.0
357,44000-P091092,en,World Bank,DRC Urban Water Supply Project,The Urban Water Supply Project objective is to...,2006-02-23,2019-06-30,2008-12-18,NaT,CD,...,100.0,000071;000043;000712;000432;000023;WC;WF;12220...,;;;;;;;Basic health care;Health policy and adm...,76;24;76;24;10;95;5;52;48,;;;;;;;;,98;98;98;98;98;99;99;1;1,USD,356000000.0,231310417.0,0.0
358,44000-P092537,en,World Bank,DRC Multi-modal Transport,The development objectives of the Additional F...,2005-12-09,2018-06-30,2010-06-29,2018-06-30,CD,...,100.0,000711;000072;000211;000023;000043;000022;0007...,;;;;;;;;;;;;;;;;;;;Trade facilitation;Water tr...,3;3;10;10;60;3;3;10;50;20;20;3;3;10;70;10;10;5...,;;;;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;98;98;98;99;9...,USD,411532816.0,385554991.0,0.0
359,44000-P092724,en,World Bank,DRC - Agriculture Rehabilitation and Recovery ...,The development objective of the Agriculture R...,2006-03-15,2020-03-16,2010-03-30,NaT,CD,...,100.0,000725;000083;000723;000721;000851;000085;0008...,;;;;;;;;;;;;;;Agro-industries;Livestock;Food c...,2;4;68;12;4;4;2;2;10;82;10;14;50;36;39;15;15;1...,;;;;;;;;;;;;;;;;;;,98;98;98;98;98;98;98;98;98;98;98;99;99;99;1;1;...,USD,195000000.0,128218262.0,0.0


#### 3. Choose you sectors

Now <b>you</b> get to choose how to experiment. 

Below is a list of OECD DAC sectors. You can leave them as they are, or you can look up your own OECD DAC sector codes for which to filter. You can find a list of the codes <a href=http://www.oecd.org/dac/stats/dacandcrscodelists.htm>at this link</a>.

Please note:
<ul>
<li>The list must be formatted as below: square brackets around the whole list, single quotes around the sector codes, and commas separating the individual codes. For example: ['code', 'code', 'code']</li>
<li>The codes should be 5-digit DAC sector codes only, for now. </li>
</ul>
Enter your sectors of interest in the cell below and press the "Run Cell" button above, or hit CNTL + ENTER

In [13]:
#Choose your sectors!
sectors = ['13010', '13020', '13030', '13040', '13081']

#### 4. Choose you keywords

Below is a list of text keywords. You can leave them as they are, or you can make up your own keywords for which to filter. 

Please note:
<ul>
<li>Enter lowercase letters only - the sample data is all converted to lowercase, so only all-lowercase keywords will be found.</li>
<li>The same formatting appies for the list below: IE ['keyword', 'keyword', 'keyword']</li>
</ul>
Enter your keywords of interest in the cell below and press the "Run Cell" button above, or hit CNTL + ENTER

In [14]:
#Choose your keywords!
keywords = ["reproductive", "family planning", "contraceptive", "abortion", \
            "pregnancy", "sexual", "gender-based violence", "domestic violence",\
            "female genital mutilation", "fgm", "population census", "hiv", "std",\
            "aids", "obstetric", "antenatal", "perinatal","neonatal", "postnatal",\
            "newborn", "health personnel", "childhood", "immunization", "polio",\
            "measles", "tetanus", "congenital","disabilities", "breastfeeding"\
            "infant feeding", "doctor", "nurse", "midwive", "pharmacist", \
            "community health worker", "health specialists", "medical device",\
            "m-Health", "e-Health", "mobile health","health data", "medical products"\
            "health products", "medical services", "health services", "clinical studies"\
            "clinical trials", "medicine", "vaccine"]

#### 5. Run analysis

The single line below filters the example data by the sectors and keywords you chose above.

Then it returns some basic stats (I hope to improve the function to add more useful info and visualisations soon!). 
These stats relate to 5 datasets that the function analyzes: 
<ul>
<li>(1) the original data,</li> 
<li>(2)the data filtered only by sector, </li>
<li>(3)the data filtered only by keyword, </li>
<li>(4)an inner join of the sector-filtered and keyword-filtered data, and</li>
<li>(5) an outer join of the sector-filtered and keyword-filtered data. (<a href=https://www.thedataschool.co.uk/harry-cooney/what-are-data-joins/>More on joins here</a>)</li>
</ul>
Currently, the function then returns the size of each dataset (in rows and columns), the number of rows per donor in that dataset, and the total "committed funding" per donor in that dataset. Keep in mind that the committed funding figures are in different currencies!

More to come soon.


In [15]:
data_investigate(data, sectors, keywords)

For full data set:
Rows   : 400
Columns: 21

Rows per donor: 
reporting-org
B&MGF              50
DEVCO              50
DFID               50
GAC                50
Global Fund        50
MFA Netherlands    50
Sida               50
World Bank         50
Name: iati-identifier, dtype: int64

Commitments per donor: 
reporting-org
B&MGF              9.615906e+07
DEVCO              5.751914e+06
DFID               2.289710e+08
GAC                6.729376e+08
Global Fund        6.520032e+08
MFA Netherlands    2.204541e+08
Sida               0.000000e+00
World Bank         4.949346e+09
Name: total-Commitment, dtype: float64

For data filtered by sectors:
Rows   : 55
Columns: 21

Rows per donor: 
reporting-org
B&MGF               7
DFID               12
GAC                10
Global Fund        19
MFA Netherlands     5
World Bank          2
Name: iati-identifier, dtype: int64

Commitments per donor: 
reporting-org
B&MGF              3.186821e+07
DFID               2.988223e+08
GAC                1