# Title: The Name of the Report

> Open Data Services, November 2017


## Introduction

A Jupyter notebook is a convenient way of combining analysis with presentation. It's composed of 'cells' which can either execute code (in this case, Python), or include markdown text (which is how this cell has been written). 

### Markdown

Markdown allows writers to format text (i.e. using *italics*, **bold**), include links like [this one for wikipedia](www.wikipedia.org), use `monospace`, and quote text, like this:

> here is some quoted text

It's syntax is quite simple. The sentences above were written with the following syntax:

------

```markdown
Markdown allows writers to format text (i.e. using *italics*, **bold**), include links like [this one for wikipedia](www.wikipedia.org), use `monospace`, and quote text, like this:

> here is some quoted text
```
-------

### Python

In addition to text, Jupyter includes a programming languages too, in this case embedding a python kernel which allows the author to write, execute and display the output of arbitrary code. For example:

In [1]:
# a small python script importing a library which deals with time, and displaying the current date and time
import datetime

print(datetime.datetime.now())

2017-11-23 14:52:02.428129


## Handling Data

### Tables

The simplest way to look at traditional data is to use a dataframe. Think of a CSV file but held as an object rather than viewed in Excel. Using a package called Pandas, we can lift a CSV file straight from the internet and start analysing it. Let's begin with the list of IATI publishers collated on the [IATI Dashboard](http://dashboard.iatistandard.org/publishers.html), which can be downloaded as CSV from this url: http://dashboard.iatistandard.org/publishers.csv

In [2]:
# Importing the Pandas library and set the maximum number of columns to 30
import pandas as pd
pd.set_option("display.max_columns", 30)

# creating a variable for the Publishers data frame
publishers = pd.read_csv("http://dashboard.iatistandard.org/publishers.csv")

publishers = publishers.set_index("Publisher Name")

First, to get a feel for the data, let's look at the first three rows:
> if you're running this notbook live, try changing the number in brackets below to display a different number of rows

In [3]:
publishers.head(3)

Unnamed: 0_level_0,Publisher Registry Id,Activities,Organisations,Files,Activity Files,Organisation Files,Total File Size,Reporting Org on Registry,Reporting Orgs in Data (count),Reporting Orgs in Data,Hierarchies (count),Hierarchies
Publisher Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Acid Survivors Foundation,_admin,2,0,1,1,0,7969,PK-VSWA-511-2007,2,PK-VSWA-511-2007;XI-IATI-VSWA-511-2007,1,1.0
ActionAid International,aai,96,0,1,1,0,395966,NL-KVK-27264198,1,NL-KVK-27264198,1,
Stichting ActionAId,aanl,4,1,2,1,1,154370,NL-KVK-41217595,1,NL-KVK-41217595,1,1.0


We can filter and present the data easily, for instance showing only the publishers who have published more than 50k activities and only show a few columns:

In [4]:
# filter out activities with fewer than 50k activities
fifty_k_activitis = publishers[publishers['Activities'] > 50000]

# present data from the columns we're interested in
fifty_k_activitis[["Organisations", "Activity Files", "Activities"]]

Unnamed: 0_level_0,Organisations,Activity Files,Activities
Publisher Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
European Commission - Development and Cooperation-EuropeAid,1,219,50007
"Sweden, through Swedish International Development Cooperation Agency (Sida)",2,164,79245
United States,12,269,130920
United States Agency for International Development (USAID),0,173,72022


We can also quickly perform pivots just as in Excel. The case below shows the total activities published by organisations who list different numbers of reporting organisations in their files:

In [5]:
publishers.pivot_table(
    index="Reporting Orgs in Data (count)",  # use the column as the new index
    values="Activity Files",  # calculate the values based on this column...
    aggfunc=sum  # ... by summing them
)

Unnamed: 0_level_0,Activity Files
Reporting Orgs in Data (count),Unnamed: 1_level_1
0,0
1,4273
2,313
3,39
4,7
10,58
14,2
19,269
21,2


### Graphs

In [6]:
from bokeh.plotting import figure, output_file, show  # plotting libraries
from bokeh.io import output_notebook  # allows us to put graphs in notebooks
from bokeh.models import ColumnDataSource, HoverTool, Axis
from bokeh.charts import Bar, output_notebook, show
output_notebook(hide_banner=True)

In [12]:
bar = Bar(
    fifty_k_activitis,
    'Publisher Name',
    values='Activities',
    title="Number of activities by reporting orgs in data")
yaxis = bar.select(dict(type=Axis, layout="left"))[0]
yaxis.formatter.use_scientific = False
show(bar)

In [13]:
hover = HoverTool(tooltips=[
    ("Publisher", "@{Publisher Name}"), 
    ("Number of Activities", "@{Activities}"),
    ("Total File Size", "@{Total File Size}")])

p = figure(
    plot_height=600,
    plot_width=700,
    title="Publishers Activities vs File Size",
    tools=[hover],
    y_axis_type="log",
    x_axis_type="log")
p.circle(
    x='Activities',
    y='Total File Size',
    size=15,
    source=publishers,
    fill_color='blue',
    fill_alpha=0.2,
    line_color='#7c7e71',
    line_width=0.5,
    line_alpha=0.8,
)

# yaxis = p.select(dict(type=Axis, layout="left"))[0]
# yaxis.formatter.use_scientific = False

show(p)