# Extra: 3 Analyse data exercises

Welcome to this Jupyter Notebook! 

This notebook is part of the course Python for Journalists at [datajournalism.com](https://datajournalism.com/watch/python-for-journalists). The data used originally comes from [the Electoral Commission website](http://search.electoralcommission.org.uk/Search?currentPage=1&rows=10&sort=AcceptedDate&order=desc&tab=1&open=filter&et=pp&isIrishSourceYes=false&isIrishSourceNo=false&date=Reported&from=&to=&quarters=2018Q12&rptPd=3617&prePoll=false&postPoll=false&donorStatus=individual&donorStatus=tradeunion&donorStatus=company&donorStatus=unincorporatedassociation&donorStatus=publicfund&donorStatus=other&donorStatus=registeredpoliticalparty&donorStatus=friendlysociety&donorStatus=trust&donorStatus=limitedliabilitypartnership&donorStatus=impermissibledonor&donorStatus=na&donorStatus=unidentifiabledonor&donorStatus=buildingsociety&register=ni&register=gb&optCols=Register&optCols=IsIrishSource&optCols=ReportingPeriodName), but is edited for training purposes. The  edited dataset is available on the course website and its [Github repo](https://github.com/winnydejong/pythonforjournalists). 

This notebook contains some exercises for you to practice your newly learned skills with, after finishing module 3 of the Python for Journalists course. Note: since this a later added extra, there is no video to accompany this notebook.

## About Jupyter Notebooks and Pandas
Right now you're looking at a Jupyter Notebook: an interactive, browser based programming environment. You can use these notebooks to program in R, Julia or Python - as you'll be doing later on. Read more about Jupyter Notebook in the [Jupyter Notebook Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
  
To analyse up our data, we'll be using Python and Pandas. Pandas is an open-source Python library - basically an extra toolkit to go with Python - that is designed for data analysis. Pandas is flexible, easy to use and has lots of useful functions built right in. Read more about Pandas and its features in [the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/). That Pandas functions in ways similar to both spreadsheets and SQL databases (though the latter won't be discussed in this course), makes it beginner friendly. :)  

**Notebook shortcuts**  

Within Jupyter Notebooks, there are some shortcuts you can use. If you'll be using more notebooks for your data analysis in the future, you'll remember these shortcuts soon enough. :) 

* `esc` will take you into command mode
* `a` will insert cell above
* `b` will insert cell below
* `shift then tab` will show you the documentation for your code
* `shift and enter` will run your cell
* ` d d` will delete a cell

**Pandas dictionary**

* **dataframe**: dataframe is Pandas speak for a table with a labeled y-axis, also known as an index. (The index usually starts at 0.)
* **series**: a series is a list, a series can be made of a single column within a dataframe.

Before we dive in, a little more about Jupyter Notebooks. Every notebooks is made out of cells. A cell can either contain Markdown text - like this one - or code. In the latter you can execute your code. To see what that means, type the following command in the next cell `print("hello world")`.

In [1]:
print('hello world')

hello world


# Setup

During the exercises you'll use the Pandas library again. Import Pandas as pd here:

In [4]:
import pandas as pd

Now we need some data to work with; luckily you know how to import results_clean.csv. Don't you? 

In [5]:
df = pd.read_csv('results_clean.csv')

# Explore the data

Before you start with the exercises below, it's a good idea to get to know the data a bit. 

### Dimensions

If you use ``len()``, Pandas will tell you how long your dataframe is; it will give you the number of rows of df.

In [16]:
len(df)

300

In case you'd like to know the number of rows and columns, you can use ``.shape``.

In [17]:
df.shape

(300, 11)

To get the total number of elements in the DataFrame, use the ``.size`` attribute. It will give your the product of the number of rows and the number of columns:


In [19]:
df.size

3300

### Sample
Please look at a sample of the dataset, use ``.sample()``.

In [10]:
df.sample()

Unnamed: 0.1,Unnamed: 0,RegulatedEntityName,AcceptedDate,DonorName,DonorStatus,Year,Month,Value,RegulatedEntityType,DonorId,CampaigningName
180,180,Labour Party,2017-11-28,Mr Edward John Izzard,Individual,2017,11,10000.0,Political Party,83990,


Note that if you simply use ``.sample``, you'll take a sample of 1. If you want to take a sample of N size, use ``.sample(N)``.

In [11]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,RegulatedEntityName,AcceptedDate,DonorName,DonorStatus,Year,Month,Value,RegulatedEntityType,DonorId,CampaigningName
99,99,Liberal Democrats,2017-12-31,Mr Alan Sherwell,Individual,2017,12,1750.0,Political Party,78672,
194,194,Conservative and Unionist Party,2017-11-21,Mr Howard Leigh,Individual,2017,11,4250.0,Political Party,36384,
191,191,Labour Party,2017-11-21,Lord Charles Falconer of Thoroton,Individual,2017,11,833.0,Political Party,83993,
179,179,Conservative and Unionist Party,2017-11-29,Mr Jeremy RS Hunt,Individual,2017,11,1947.9,Political Party,83870,
4,4,Liberal Democrats,2017-12-31,Mr Duncan Greenland,Individual,2017,12,7750.0,Political Party,35403,
158,158,Conservative and Unionist Party,2017-12-05,Mr Andrew D Williams,Individual,2017,12,4350.0,Political Party,83861,
60,60,Liberal Democrats,2017-12-31,Mr Manuel Abellan-San Martin,Individual,2017,12,1560.0,Political Party,83343,
0,0,Plaid Cymru - The Party of Wales,2018-12-19,Mr Alun Ffred Jones,Individual,2018,12,20000.0,Political Party,83318,
160,160,Conservative and Unionist Party,2017-12-04,Mr Raymond Chamberlain,Individual,2017,12,8750.0,Political Party,52210,
202,202,Conservative and Unionist Party,2017-11-16,Mrs Mary Erbrich,Individual,2017,11,60000.0,Political Party,76714,


### Statistical description

Use ``.describe()`` to look further into the data.

In [12]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year,Month,Value,DonorId,CampaigningName
count,300.0,300.0,300.0,300.0,300.0,0.0
mean,149.5,2017.003333,10.623333,10488.4004,65556.336667,
std,86.746758,0.057735,2.544873,32153.305738,19466.692599,
min,0.0,2017.0,1.0,600.0,19152.0,
25%,74.75,2017.0,10.0,1800.0,47058.0,
50%,149.5,2017.0,12.0,2500.0,76334.0,
75%,224.25,2017.0,12.0,8124.0,83335.5,
max,299.0,2018.0,12.0,400000.0,84031.0,


# Clean data

Whenever you export a dataframe to a csv, Pandas includes the index unless you explictly tells it not to do so. When importing said csv again, the unnamed index returns as 'Unnamed: 0'. Let's remove that column using ``.drop()``.

In [20]:
df.drop(['Unnamed: 0'], # drop 'Unnamed: 0' from dataframe called 'df'
        axis=1, # we want to drop a column (axis=1), not a row (axis=0)
        inplace=True) # we want to drop the column in the original dataset, instead of in a copy

Take a sample to check if it worked.

In [21]:
df.sample()

Unnamed: 0,RegulatedEntityName,AcceptedDate,DonorName,DonorStatus,Year,Month,Value,RegulatedEntityType,DonorId,CampaigningName
229,Conservative and Unionist Party,2017-10-30,Mr Byron S Huson,Individual,2017,10,400000.0,Political Party,83882,


# Excercises

Like all journalism, most data journalism starts with a question. The difference between journalism and data-driven journalism, is that the journalist working the story interviews data tables instead of people.

Therefore all exercises start with a question. In newsrooms people both think and talk in questions all the time. So starting with a simple question, means for most following this course starting in there comfortzone. Which is a nice place to begin, don't you think? 

**Every exercise follows the same pattern**   
- the question that needs to be answered: formulated to be easily understood for people
- a breakdown of the question: formulated to be easily understood for computers
- hints to help you write the code in which the breakdown results
- and, if you're looking at the completed notebook, the answer

## 1a Total donations per party

### Question
Which party received the most money?

### Breakdown
- for every unique party in the dataset;
    - filter dataframe: only donations to said party;
    - while dataframe is filtered: sum the 'Value' column;
    - store both party name and sum donations (add to list or df);
- create table with all partynames + sum donations;
- sort table from highest to lowest


### Hints
This can be done using:
- ``.unique()``, to create a list with all unique parties
- a for-loop to iterate over all parties from that list
- a filtered dataframe, using ``df[df['column data should be filter on'] == 'value']``
- ``.sum``, to sum all donations in filtered dataframe
- ``.append``, to add data to list
- ``pd.DataFrame()``, to turn a list of lists into a dataframe
- ``sort_values(by='column name', ascending=False)``, to sort data

Remember: this is a notebook. Use as many cells as you please. :) 

In [4]:
# create an empty list to store data in;
# I'm going to save lists to this list;
# creating a list of lists - this will make sense in a bit
# for now, simply create an empty list
data = []
# collect all unique parties from RegulatedEntityName column i
# n dataframe df into a variable called uniqueParties
uniqueParties = df['RegulatedEntityName'].unique()

# create a for-loop, so you can iterate over every party in the uniqueParties list
# note that 'i' is a substitute for a different party from the uniqueParties list for every for-loop run
for i in uniqueParties:
    # create a temporary dataframe that only keeps rows from the original df
    # if the RegulatedEntityName is i
    tempDf = df[df['RegulatedEntityName'] == i]
    # sum all donations in tempDf, and store this value in totalDonations variable
    totalDonations = tempDf['Value'].sum()
    # create a list called 'row' with both party name and totalDonations
    row = [i, totalDonations]
    # add the list called row to the list called data
    data.append(row)
    
# note that the indention ends here: 
# whatever follows does not need to be done for every party...

# we added a list with the partyname and donation total for every party 
# to a list called data, because Pandas lets you make a dataframe from a list of lists
donationsPerParty = pd.DataFrame(data)

# let's add column names, if you don't the columns will be called '0' and '1' etc.
donationsPerParty.columns = ['party', 'donations sum']

# let's sort rows from high to low based on 'donation sum' column
donationsPerParty = donationsPerParty.sort_values(by='donations sum', ascending=False)

# let's have a look
donationsPerParty

NameError: name 'df' is not defined

## 1b Percentage total donations per party

### Question
How much percentage of the total amount of donations made, went to the Liberal Democrats?

### Breakdown
- calculate the total amount of donations made;
- for every party:
    - calculate how much percentage of the total they received;
    - store this percentages
- add all percentages to dataframe
- get percentage for Liberal Democrats

### Hints
This can be done using:
- ``.unique()``, to create a list with all unique parties
- a for-loop to iterate over all parties from that list
- a filtered dataframe, using ``df[df['column data should be filter on'] == 'value']``
- ``.sum``, to sum all donations in filtered dataframe
- ``.append``, to add data to list
- ``pd.DataFrame()``, to turn a list of lists into a dataframe
- ``sort_values(by='column name', ascending=False)``, to sort data

Note: this is 1b, it builds upon the dataframe you just created in exercise 1a. (In case you haven't finished that yet: do so first, or continue to exercise 2.)

Oh, and remember: this is a notebook. Use as many cells as you please. :) 

In [45]:
# what is 100%? Or, what is the total of donations made?
# get total and store in variable totalDonations
totalDonations = donationsPerParty['donations sum'].sum()

# the formula to calculate a percentage is:
# (part/whole) * 100%
# which results in the 'donations sum' column (which in this case is the part);
# divided by the totalDonations sum as stored in variable;# times 100 percent.

# I want the percentages to be stored in a new column called 'percentage of total donations', 
# which results in the following code: 
donationsPerParty['percentage of total donations'] = (donationsPerParty['donations sum']/totalDonations) *100

In [46]:
# show table 
donationsPerParty

Unnamed: 0,party,donations sum,percentage of total donations
3,Conservative and Unionist Party,2089344.41,66.40175
1,Liberal Democrats,423098.65,13.446558
5,Scottish National Party (SNP),220892.63,7.02022
6,Labour Party,124526.05,3.95758
0,Plaid Cymru - The Party of Wales,121831.56,3.871946
2,UK Independence Party (UKIP),64450.0,2.048295
8,Scottish Green Party,48596.64,1.544457
4,Renew,29480.18,0.936914
9,British National Party,10000.0,0.317811
10,Women's Equality Party,10000.0,0.317811


In [48]:
# sort by percentages, show table
donationsPerParty.sort_values(by='percentage of total donations', ascending=False)

Unnamed: 0,party,donations sum,percentage of total donations
3,Conservative and Unionist Party,2089344.41,66.40175
1,Liberal Democrats,423098.65,13.446558
5,Scottish National Party (SNP),220892.63,7.02022
6,Labour Party,124526.05,3.95758
0,Plaid Cymru - The Party of Wales,121831.56,3.871946
2,UK Independence Party (UKIP),64450.0,2.048295
8,Scottish Green Party,48596.64,1.544457
4,Renew,29480.18,0.936914
9,British National Party,10000.0,0.317811
10,Women's Equality Party,10000.0,0.317811


In [49]:
# only show Liberal Democrats
donationsPerParty[donationsPerParty['party'] == 'Liberal Democrats']

Unnamed: 0,party,donations sum,percentage of total donations
1,Liberal Democrats,423098.65,13.446558


In [50]:
# only show the percentage of total donations that the liberal democrats got
donationsPerParty[donationsPerParty['party'] == 'Liberal Democrats']['percentage of total donations']

1    13.446558
Name: percentage of total donations, dtype: float64

## 1c Store data without index
As you have seen throughout this course; when exporting data Pandas includes an index, unless stated otherwise. 

Let's say you'd like to store the 'donationsPerParty' dataframe as a csv. You know to use ``.to_csv()``. Adding ``index=False``, will give you the same result minus the often useless index. Try it! 

In [51]:
donationsPerParty.to_csv('donations per party, absolute + percentages.csv', index=False)

## 2a count donations throughout the year

### Question
In which month are the most donations (count) made? 

### Breakdown
- Group donations by year and month;
- Use count not sum.

### Hints

This can be done using:
- the ``pivot_table`` command from pandas

When using the ``pivot_table`` command, I always find it helpful to 'design' my desired table. By which I mean filling in the following blanks:   
In the table that answers my question ('In which month are the most donations (count) made?'); there is a row for every _____________________;
    - and a column for every _____________________; and the value in the cells is based on _____________________. 
    
By thinking your pivot table through, the making of it is literally a fill in the blanks exercise. 

In [63]:
# get a sample to see what we're working with
df.sample(4)

Unnamed: 0,RegulatedEntityName,AcceptedDate,DonorName,DonorStatus,Year,Month,Value,RegulatedEntityType,DonorId,CampaigningName
71,Liberal Democrats,2017-12-31,Mrs Sunita Gordon,Individual,2017,12,1548.0,Political Party,76659,
176,Renew,2017-11-30,Mr Richard Christopher Breen,Individual,2017,11,12971.08,Political Party,83311,
39,Liberal Democrats,2017-12-31,Mr Alistair Barr,Individual,2017,12,5000.0,Political Party,47844,
154,Conservative and Unionist Party,2017-12-07,Mr Michael Slade,Individual,2017,12,5000.0,Political Party,36409,


The ``.pivot_table`` attribute, looks like this: 

``df.pivot_table(index="column(s) you want to use as index; use the 'for every row'-blank", 
                 columns="for every unique value in this column, there will be a column made
                           in your pivot table; use the 'column for every'-blank",
                           values="whatever you fill in here will populate the cells; use the 'value in cells'-blank",
                           aggfunc='count')``

By filling out the blanks before, you probably will be able to fill in ``.pivot_table`` attribute. Note: you don't need to define the columns...

In [67]:
df.pivot_table(index=['Year', 'Month'],
               values="Value",
               aggfunc='count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Year,Month,Unnamed: 2_level_1
2017,1,6
2017,2,2
2017,3,2
2017,4,6
2017,5,10
2017,6,6
2017,7,2
2017,8,4
2017,9,1
2017,10,39


## 2b sum donations throughout the year 

### Question
In which month is the most money (sum donations) donated? 

### Breakdown
- Group donations by year and month;
- Use count not sum.

### Hints

Similar to 2a, use the ``pivot_table`` command from pandas.

Again, when using the ``pivot_table`` command, I always find it helpful to 'design' my desired table. By 
which I mean filling in the following blanks:   
In the table that answers my question ('In which month are the most donations (count) made?'); there is a row for every _____________________;
    - and a column for every _____________________; and the value in the cells is based on _____________________. 
    
By thinking your pivot table through, the making of it is literally a fill in the blanks exercise. 

But, there's a difference. Instead of counting, we now want to sum values in our pivot table. To do that, I'll be using the sum function from numpy; a different Python library. 

Before we can get to it, we need to import numpy using ``import numpy as np``.

In [68]:
import numpy as np

With that out of our way, we can now go ahead and create a pivot_table that sums our data...

Use the following template: 

When using the ``pivot_table`` command, I always find it helpful to 'design' my desired table. By which I mean filling in the following blanks:   
In the table that answers my question ('In which month are the most donations (count) made?'); there is a row for every ________ (1) _____________;
    - and a column for every _________ (2) ____________; and the value in the cells is based on ________ (3) _____________. 

``df.pivot_table(index="column or list of columns you want to use as index (1)", 
                 columns="column or list of columns you want to use as columns (2)",
                           values="columns used to populate cells in pivot table (3)",
                           aggfunc=np.sum)``

In [69]:
df.pivot_table(index=['Year', 'Month'],
               values="Value",
               aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Year,Month,Unnamed: 2_level_1
2017,1,14000.0
2017,2,3500.0
2017,3,3150.0
2017,4,20900.0
2017,5,29500.0
2017,6,35900.0
2017,7,4000.0
2017,8,9266.0
2017,9,1875.0
2017,10,1263257.27


Do you remember how to use ``sort_values()`` to figure out when the most money was donated? 

In [71]:
df.pivot_table(index=['Year', 'Month'],
               values="Value",
               aggfunc=np.sum
              ).sort_values(by='Value',
                            ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Year,Month,Unnamed: 2_level_1
2017,12,1265256.9
2017,10,1263257.27
2017,11,475914.95
2017,6,35900.0
2017,5,29500.0
2017,4,20900.0
2018,12,20000.0
2017,1,14000.0
2017,8,9266.0
2017,7,4000.0


## 3a most money donated per person

### Question
What is the top 10 of people who donated the most money? 

### Breakdown
- group data by DonorNames;
- sum Values for every DonorName;
- sort Values from highest to lowest;
- only print top 10

### Hints
This can be done using:
- ``pivot_table()`` and ``np.sum``
- ``.sort_values``
- ``head()``

To keep things easier to follow, I'll build upon my answer. First; let's create the pivot table. Read comments below for explanation.

In [81]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc=np.sum # use aggregate function sum to get totals
              )

Unnamed: 0_level_0,Value
DonorName,Unnamed: 1_level_1
Baroness Barbara Janke,1600.00
Baroness Emma Nicholson,2499.99
Baroness Kathryn Parminter,2400.00
Baroness Shirley Williams,1700.00
Christopher Williams,1506.37
Clive Hollick,20000.00
Cllr Ian Shires,1920.00
Cllr Joe Harris,1557.25
Dr Alun Griffiths,1800.00
Dr Arujuna Sivananthan,2499.00


Note how the notebook automatically truncates your table? 

You can use the following Pandas options to change that:  
``pd.set_option('display.max_columns', None)``  no limit to columns  
``pd.set_option('display.max_rows', None)`` no limit to rows  
``pd.set_option('display.max_colwidth', -1)``  no truncation if column is wide  

In [82]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Let's try again to see if the truncation is gone...


In [83]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc=np.sum # use aggregate function sum to get totals
              )

Unnamed: 0_level_0,Value
DonorName,Unnamed: 1_level_1
Baroness Barbara Janke,1600.0
Baroness Emma Nicholson,2499.99
Baroness Kathryn Parminter,2400.0
Baroness Shirley Williams,1700.0
Christopher Williams,1506.37
Clive Hollick,20000.0
Cllr Ian Shires,1920.0
Cllr Joe Harris,1557.25
Dr Alun Griffiths,1800.0
Dr Arujuna Sivananthan,2499.0


The pivot table is not sorted yet... Let's use ``.sort_values()``

In [85]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc=np.sum # use aggregate function sum to get totals
              ).sort_values(by='Value', # sort values in pivot table by Value
                           ascending=False) # descending, highest to lowest

Unnamed: 0_level_0,Value
DonorName,Unnamed: 1_level_1
Mr Byron S Huson,400000.0
Mr Michael Davis,272000.0
Mr Ian McNish,217732.63
Lord Stanley Fink,111600.0
Ms Jane Mactaggart,108809.89
Mr Ian R Taylor,100000.0
Ms Lesley Jackson,100000.0
Mrs Mary Erbrich,60000.0
Mr Malcolm Bluemel,57000.0
Mr Michael A Dangoor,56600.0


OK, sorted pivot table. Well done! But we only need the top 10... Let's limit our pivot table using ``.head()``

In [86]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc=np.sum # use aggregate function sum to get totals
              ).sort_values(by='Value', # sort values in pivot table by Value
                           ascending=False # descending, highest to lowest
                           ).head(10)

Unnamed: 0_level_0,Value
DonorName,Unnamed: 1_level_1
Mr Byron S Huson,400000.0
Mr Michael Davis,272000.0
Mr Ian McNish,217732.63
Lord Stanley Fink,111600.0
Ms Jane Mactaggart,108809.89
Mr Ian R Taylor,100000.0
Ms Lesley Jackson,100000.0
Mrs Mary Erbrich,60000.0
Mr Malcolm Bluemel,57000.0
Mr Michael A Dangoor,56600.0


See how all donor names are in bold? That's because in the pivot_table statement, we've declared 'DonorName' our index. We can reset the index using... *drumroll please*... ``.reset_index()``

In [87]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc=np.sum # use aggregate function sum to get totals
              ).sort_values(by='Value', # sort values in pivot table by Value
                           ascending=False # descending, highest to lowest
                           ).head(10 # only show first 10 rows (head)
                                 ).reset_index() # reset index to have a regular index instead of DonorName

Unnamed: 0,DonorName,Value
0,Mr Byron S Huson,400000.0
1,Mr Michael Davis,272000.0
2,Mr Ian McNish,217732.63
3,Lord Stanley Fink,111600.0
4,Ms Jane Mactaggart,108809.89
5,Mr Ian R Taylor,100000.0
6,Ms Lesley Jackson,100000.0
7,Mrs Mary Erbrich,60000.0
8,Mr Malcolm Bluemel,57000.0
9,Mr Michael A Dangoor,56600.0


Note that since computers start counting at zero, the index ends at 9. :)

## 3b most often donated

### Question
What is the top 10 of people who donated most often? 

### Breakdown
- group data by DonorNames;
- count occurences for every DonorName;
- sort Values from highest to lowest;
- only print top 10

### Hints
This can be done using:
- ``pivot_table()`` and ``np.sum``
- ``.sort_values``
- ``head()``

Since the table contains both donorid's and donorname, I'll try and create pivot tables using both. Hopefully not getting any differences. (In real life, differences would mean further investigations since id-columns most often are unique identifiers...)

In [89]:
# create a pivot table based on dataframe called df;
df.pivot_table(index='DonorName', # use 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc='count' # use aggregate function 'count' to number of occurences
              ).sort_values(by='Value', # sort values in pivot table by Value
                           ascending=False # descending, highest to lowest
                           ).head(10 # only show first 10 rows (head)
                                 ).reset_index() # reset index to have a regular index instead of DonorName

Unnamed: 0,DonorName,Value
0,Ms Jane Mactaggart,9
1,Mr Duncan Greenland,8
2,Lord Charles Falconer of Thoroton,6
3,Mr Malcolm Bluemel,5
4,Mr Mark Petterson,4
5,Lord D Stevens of Ludgate,3
6,Scirard Lancelyn Green,3
7,Professor Tim Congdon,3
8,Mr Ian Pirie,3
9,Ms Jean Lambert MEP,3


In [93]:
# create a pivot table based on dataframe called df;
df.pivot_table(index=['DonorId', 'DonorName'], # use 'DonorId' and 'DonorName' columns as index
          values='Value', # use 'Value' column to populate cells in pivot table
          aggfunc='count' # use aggregate function 'count' to number of occurences
              ).sort_values(by='Value', # sort values in pivot table by Value
                           ascending=False # descending, highest to lowest
                           ).head(10 # only show first 10 rows (head)
                                 ).reset_index() # reset index to have a regular index instead of DonorName

Unnamed: 0,DonorId,DonorName,Value
0,83993,Lord Charles Falconer of Thoroton,6
1,55995,Mr Mark Petterson,4
2,83323,Mr Malcolm Bluemel,4
3,34382,Ms Jean Lambert MEP,3
4,38170,Professor Tim Congdon,3
5,74686,Mr Ian Pirie,3
6,48879,Lord D Stevens of Ludgate,3
7,77944,Mr Dominic R Johnson,3
8,76334,Scirard Lancelyn Green,3
9,55988,Ms Jane Mactaggart,2


What happened? The top 10 changed... Most likely scenarios: there are multiple people with the same name (multiple women named Ms Jane Mactaggart and more than one guy named Mr Duncan Greenland etc); or this Mr Duncan Greenland (and the others in the bunch) have multiple DonorIds; or a combination of the first two. 

If this was a story you'd be working on in the newsroom, you'd have your work cut out for you. :) 