# Introduction

In the previous [tutorial](https://github.com/yanoak/data-extractors-tutorials/blob/master/python-for-extractives/01_Intro_to_Python/Intro_to_Python.ipynb), we learned the basics of programming with python and using the pandas library to work with data. We also introduced some programming concepts such as loops and functions.

This tutorial will build on the programming skills that you acquired in the first tutorial. Specifically, we will cover concepts that are the nuts and bolts of programming in any language: loops, decisions and functions.

Of course, we will also use examples from the extractives industry so that after working through this tutorial, you can get something concrete and useful that can be used in your work analysing the extractives industry.

Mind you, the examples we will be showing in the beginning of this tutorial might seem not as directly useful, but trust me, you will be building up skills to write some powerful programs once you master loops, decisions and functions.

# Getting the data

The data that we will use for this tutorial will be from 2 sources. The first will be the same data from UK companies' mandatory disclosures that we used in the last tutorial. The other will be EITI (Extractive Industrities Transparency Initiative) data that is aggregated by NRGI (Natural Resources Governance Institute).

## Mandatory Disclosures Data

We will be reading in CSV files that are downloaded from the UK Companies House Extractives Website https://extractives.companieshouse.gov.uk/

I have downloaded all the files for all the companies that have reported as of April 2017 for financial year 2015. I've added them all to the 'data' folder of the Github repo that contains this tutorial.

## EITI Data

NRGI has a wonderful data portal called [resourcedata.org](https://www.resourcedata.org/) which gathers a lot of data related to the extractives industry including EITI data. We will specifically use this [dataset](https://www.resourcedata.org/dataset/eiti-complete-summary-table/resource/bba3b646-131b-47c5-9f4c-37019f896575) that consolidates the summary datasheets of EITI data from all the reporting countries. I have downloaded the dataset and placed in the 'data' folder in the same Github repo as well.

The easiest way to download all the data files as well as this tutorial is to go to the repo's main page and click on 'Clone or download' to get all the files as a zipped file.

# What we will learn in this tutorial

In this tutorial, we will cover basic programming concepts that anyone who is learning to code has to be familiar with.

It is structured as follows:
1. For loops
2. While loops
3. Using 'if' and 'else' to make decisions
4. Writing your own functions
5. Writing a function that summarises EITI payments
6. Writing a function that summarises mandatory disclosure payments

1. For loops
    * simple example  
    * loop through EITI for UK License Fees in 2015
2. While loops
    * loop through 1, 2, 3 till hits a certain sum  
    * loop through sorted payments till it exceeds a certain sum
3. Using 'if' and 'else' to make decisions
    * silly example "a" and "an" for names
    * 
4. Writing your own functions

In [7]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import glob

# for index, row in df.iterrows():

# 1. For Loops

We have already encountered 'for' loops in tutorial 1. Let's do a quick refresher of what it is.

We use a loop to tell a computer to do something repeatedly. A 'for' loop is one of the ways of writing a loop. Every programming language has it in some shape or form. In Python, a for loop iterates through a list of things, and it will repeated some lines of code that you have given it for each of the items in that list of things.


## Silly Example

As a silly first example, let's say you have a list of friends' names, and you want the computer to display the first letter of their names. Let's say your friends are Alex, Beth and Cedric. You want the computer to display:

`Alex's name starts with A!`  
`Beth's name starts with B!`  
`Cedric's name starts with C!`

...and so on for every name in the list. It's silly, I know, but it will help you understand how this stuff works, we will move on to something more useful very soon.

Let's write some code to do that!

In [5]:
friends_list = ["Alice", "Bob", "Cedric", "Dana", "Elizabeth", "Fergie"]

for f in friends_list:
    first_letter = f[0]
    print(f + "'s name starts with " + first_letter)

Alice's name starts with A
Bob's name starts with B
Cedric's name starts with C
Dana's name starts with D
Elizabeth's name starts with E
Fergie's name starts with F


In three lines of code, we made the computer display as many lines of text as we want. Let's break it down.

`friends_list = ["Alice", "Bob", "Cedric", "Dana", "Elizabeth", "Fergie"]` declares a list of strings, same as what we did in tutorial 1. This one is a list of six people's names.

The next chunk declares the loop. The first line of the loop tells the computer what we are looping through. `for f in friends_list:` says that what follows will be the commands you want the program to run repeateded, for each element in `friends_list`. The `f` part of the line says that in each iteration, we will refer to the item in the list tha we are now at, as `f`. The `:` part says these following lines are what you want to run repeatedly.

The lines that follow the first line have to be indented to let the computer know which lines are in the loop. Once you go back to unindented lines of code, the computer will know that those lines are not part of the loop.

In the first iteration of the loop, `f` will be "Alice", then in the next iteration, `f` will be "Bob", and so on.

`first_letter = f[0]` declares a new variable called `first_letter` which you assign the values `f[0]`. Remember `f` is equal to "Alice" in the fist iteration. Python treats strings of letters as lists in themselves as well. So, imagine the string `"Alice"` as a list `"[A,l,i,c,e]"`. Just like we can use `[]` to refer to things in a dataframe, we can use them to refer to elements of a list. so `f[0]` is the first element of `"[A,l,i,c,e]"`, i.e. `"A"`, and `f[2]` is `"i"`, etc.

This means that in the first iteration, `first_letter = f[0]` assigns the value `"A"` to the variable `first_letter`.

The function `print()` in the next line is used to display text on the screen. It will display whatever thing you put inside the `()` as a parameter. The `+` symbol can be used to combine bits of text, or "concetenate strings" in fancy promgramming speak. So, `print(f + "'s name starts with " + first_letter)` means it will display the value of `f` followed by `"'s name starts with "` followed by the value of `first_letter`.

For the first iteration of the loop, `f` is `"Alice"` and `first_letter` is `"A"`. This means the `print()` function will display the line

`Alice's name starts with A`.

Following the same logic, in the next iteration, the program takes the next thing in the list, `"Bob"`, and assigns it to `f`. This means that `first_letter = f[0]` will now make `first_letter` take on the value `"B"`. Thus, the `print(f + "'s name starts with " + first_letter)` function will produce 

`Bob's name starts with B`  

and so on till we have not more items in the list to loop through.

Got it? Go back to the code block above and change `friend_list` to something else, add more things or remove some things and run it again and see what you get.

## Real World Example

Now, let's use a for loop to work on some extractives data. We will use the EITI summary dataset that is in the "company-payments.csv" file inside the "data" folder. We will use a loop to display all the Licence Fees paid to the UK government in 2015.

First we load the CSV file into a dataframe and filter for UK Licence Fee payments in 2015. If you have forgotten how to load and filter dataframes, please refer back to tutorial 1.

In [9]:
eiti_data = mydata = pd.read_csv('data/company-payments.csv')

In [10]:
eiti_data.head()

Unnamed: 0,created,changed,country,iso3,year,start_date,end_date,company_name,gfs_code,gfs_description,name_of_revenue_stream,currency_code,currency_rate,value_reported,value_reported_as_USD,reporting_url
0,2017-03-11T07:00:46+0000,2017-03-11T07:00:46+0000,Afghanistan,AFG,2009,"Mar 21, 2008","Mar 20, 2009",KUSHAK BROTHERS COMPANY,1415-E1,Royalties,Revenue Stream: 1415E1 [Royalties] - Royalty,AFN,50.39,21950340.0,435609.0,https://eiti.org/api/v1.0/organisation/32872
1,2017-03-11T07:00:46+0000,2017-03-11T07:00:46+0000,Afghanistan,AFG,2009,"Mar 21, 2008","Mar 20, 2009",KUSHAK BROTHERS COMPANY,1415-E5,Other rent payments,Revenue Stream: 1415E5 [Other rent payments] -...,AFN,50.39,1000000.0,19845.0,https://eiti.org/api/v1.0/organisation/32872
2,2017-03-11T07:00:46+0000,2017-03-11T07:00:46+0000,Afghanistan,AFG,2009,"Mar 21, 2008","Mar 20, 2009",KUSHAK BROTHERS COMPANY,1421-E,Sales of goods and services by government units,Revenue Stream: 1421E [Sales of goods and serv...,AFN,50.39,78000.0,1548.0,https://eiti.org/api/v1.0/organisation/32872
3,2017-03-11T07:00:46+0000,2017-03-11T07:00:46+0000,Afghanistan,AFG,2009,"Mar 21, 2008","Mar 20, 2009",NORTHERN COAL ENTERPRISE,1112-E1,"Ordinary taxes on income, profits and capital ...",Revenue Stream: 1112E1 [Ordinary taxes on inco...,AFN,50.39,111217004.0,2207125.0,https://eiti.org/api/v1.0/organisation/32873
4,2017-03-11T07:00:46+0000,2017-03-11T07:00:46+0000,Afghanistan,AFG,2009,"Mar 21, 2008","Mar 20, 2009",NORTHERN COAL ENTERPRISE,1112-E1,"Ordinary taxes on income, profits and capital ...",Revenue Stream: 1112E1 [Ordinary taxes on inco...,AFN,50.39,14842880.0,294560.0,https://eiti.org/api/v1.0/organisation/32873


In [29]:
# The "\" at the end of the first line below is simply used to wrap a line of code
# that is too long so that it's easier to read. It doesn't add any functionality to the code

uk_2015 = eiti_data.loc[(eiti_data['country']=='United Kingdom') & (eiti_data['year']==2015) \
                        & (eiti_data['gfs_description']=="Licence fees")]

In [30]:
uk_2015.head()

Unnamed: 0,created,changed,country,iso3,year,start_date,end_date,company_name,gfs_code,gfs_description,name_of_revenue_stream,currency_code,currency_rate,value_reported,value_reported_as_USD,reporting_url
40759,2017-05-16T07:00:55+0000,2017-05-16T07:00:55+0000,United Kingdom,GBR,2015,"Jan 1, 2015","Dec 31, 2015",BP,114521-E,Licence fees,Revenue Stream: 114521E [Licence fees] - Petro...,GBP,0.673537,3569547.0,5299706.0,https://eiti.org/api/v1.0/organisation/35487
40763,2017-05-16T07:00:55+0000,2017-05-16T07:00:55+0000,United Kingdom,GBR,2015,"Jan 1, 2015","Dec 31, 2015",Centrica Energy E&P,114521-E,Licence fees,Revenue Stream: 114521E [Licence fees] - Petro...,GBP,0.673537,3874951.0,5753140.0,https://eiti.org/api/v1.0/organisation/35488
40764,2017-05-16T07:00:55+0000,2017-05-16T07:00:55+0000,United Kingdom,GBR,2015,"Jan 1, 2015","Dec 31, 2015",Centrica Energy E&P,114521-E,Licence fees,Revenue Stream: 114521E [Licence fees] - Oil &...,GBP,0.673537,127210.0,188869.0,https://eiti.org/api/v1.0/organisation/35488
40768,2017-05-16T07:00:55+0000,2017-05-16T07:00:55+0000,United Kingdom,GBR,2015,"Jan 1, 2015","Dec 31, 2015",ConocoPhillips,114521-E,Licence fees,Revenue Stream: 114521E [Licence fees] - Petro...,GBP,0.673537,4922012.0,7307711.0,https://eiti.org/api/v1.0/organisation/35489
40769,2017-05-16T07:00:55+0000,2017-05-16T07:00:55+0000,United Kingdom,GBR,2015,"Jan 1, 2015","Dec 31, 2015",ConocoPhillips,114521-E,Licence fees,Revenue Stream: 114521E [Licence fees] - Oil &...,GBP,0.673537,907098.0,1346768.0,https://eiti.org/api/v1.0/organisation/35489


Now we have a dataframe called `uk_2015` that only consists of Licence payments to the UK government in 2015. 

Let's read through each line of this dataframe in a loop and print out a sentence describing the amount paid and which company paid it. 

Notice in the past we used a list to feed into a 'for' loop. Right now we don't have a list, but only a dataframe. So what do we do? Luckily there's a function you can call on a pandas dataframe that makes a dataframe into a list-like thing that you can feed into a 'for' loop. It's called `iterrows()`, and it returns 2 things, the index of the row and the data from the row itself. This slightly changes the way we declare the 'for' loop. Instead of having only one thing between the 'for' and 'in' (such as `for f in friends_list:`), we have two things seperated by a comma, the first one gets the index, and the second one gets the row data.

In [47]:
print("In 2015, the following companies paid License fees to the UK government as follows: \n")

total = 0

for index, row in uk_2015.iterrows():
    total = total + row["value_reported_as_USD"]
    print (row['company_name'] + " paid US$ " + str(row["value_reported_as_USD"]) + \
           " under the \n" + row["name_of_revenue_stream"] +"\n")
    
print("In total, the amount paid in License Fees was US$ " + str(total))

In 2015, the following companies paid License fees to the UK government as follows: 

BP paid US$ 5299706.0 under the 
Revenue Stream: 114521E [Licence fees] - Petroleum Licence Fees

Centrica Energy E&P paid US$ 5753140.0 under the 
Revenue Stream: 114521E [Licence fees] - Petroleum Licence Fees

Centrica Energy E&P paid US$ 188869.0 under the 
Revenue Stream: 114521E [Licence fees] - Oil &amp; Gas Authority (OGA) Levy

ConocoPhillips paid US$ 7307711.0 under the 
Revenue Stream: 114521E [Licence fees] - Petroleum Licence Fees

ConocoPhillips paid US$ 1346768.0 under the 
Revenue Stream: 114521E [Licence fees] - Oil &amp; Gas Authority (OGA) Levy

Dana Petroleum paid US$ 722005.0 under the 
Revenue Stream: 114521E [Licence fees] - Petroleum Licence Fees

Eni UK Limited paid US$ 1953347.0 under the 
Revenue Stream: 114521E [Licence fees] - Petroleum Licence Fees

Eni UK Limited paid US$ 205257.0 under the 
Revenue Stream: 114521E [Licence fees] - Oil &amp; Gas Authority (OGA) Levy

ENG