# MP Expenses Project v2.0
by Darren Christie 
Created October 2020

This is a project notebook that looks at MP Expenses. This version of the project uses a sqlite3 database instead of
a csv file.
The notebook compares a single MP (which will probably be your local MP) with the expenses of all MPs.

## The Data
Data has been obtained from the [IPSA website](https://www.theipsa.org.uk/mp-costs/annual-publication/) starting from 2010/2011 csv files were downloaded for individual claims for each reported year. The datasets were downloaded 6/6/2020.
These csv files can be found in the data/raw folder.

The following awk command was used to merge the individual csv files into a combined csv file with a single header.

`awk '(NR == 1) || (FNR > 1)' Individual*.csv > combined_claims.csv`

FNR represents the number of the processed record in a single file. NR represents it globally. Therefore the first line is accepted and the rest are ignored.
I can not take credit for the above awk command. I got it from the StackExchange website (accessed on 14/5/2020) and was an answer provided by a Marek Grac.

The combined csv file can be found in the data/processed folder.
This combined csv file was then imported into a sqlite3 database.

In [1]:
# our standard import for our projects
import warnings
warnings.simplefilter('ignore', FutureWarning)

import matplotlib
import matplotlib.pyplot as plot
matplotlib.rcParams['axes.grid'] = True # show gridlines by default

# tells Jupyter to display all charts inside this notebook, immediately after each call to plot()
%matplotlib inline

import datetime as dt
import numpy as np
import sqlite3 as lite

from pandas import *

In [2]:
# suppress scientific notation globally
# taken from https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
pandas.options.display.float_format = '{:.2f}'.format

### function definitions used elsewhere in the notebook

In [3]:
# a function to calculate the range
# this works with a groupby function call
# a modified version of code found at http://www.pybloggers.com/2018/12/python-pandas-groupby-tutorial/
def stat_range(df):
    rang = df.max() - df.min()
    
    return rang

## Clean up data
This step has been moved further down once we have got the results back from the sql

## Assumptions about the data
* that the MPs expenses year follows a tax year and runs from 1st April - 31st March.
* that -ve values in the Amount Paid and Claimed columns means that the MP has had to pay money back. This is currently a query I have raised with the IPSA via social media to confirm one way or the other.

## Processing

### Set the MP we are looking at
The MP that we are interested in investigating their expenses.
If you are unsure who your MP is you can find out at [FindYourMP](https://members.parliament.uk/FindYourMP). Enter your post code and it will tell you who you MP is.
**NOTE:** The name of your MP needs to match exactly as it appears in the csv file/dataframe. Otherwise it will not find anything.

In [4]:
LOCALMP = "Stephen Barclay"

### Set some other constants that we will use throughout the notebook

In [5]:
STARTTAXYEAR = 2010
ENDTAXYEAR = 2020

### Create the connection to our database

In [6]:
sqlCon = lite.connect('data/processed/mpexpenses.db')

### Build our sql query to retrieve all of the local MPs data from the database

In [7]:
sqlQuery = f"select \"Date\",\"Category\",\"Expense Type\",\"Amount Paid\" from expenses where \"MP's Name\" = \"{LOCALMP}\""
print (sqlQuery)

select "Date","Category","Expense Type","Amount Paid" from expenses where "MP's Name" = "Stephen Barclay"


### Execute our query and get the results into a dataframe

In [8]:
df = pandas.read_sql_query(sqlQuery,sqlCon)

### Build our sql query to retrieve every MPs expenses data from teh database

In [24]:
sqlQuery = f"select \"Date\",\"MP's Name\",\"Category\",\"Expense Type\",\"Amount Paid\" from expenses"
print (sqlQuery)

select "Date","MP's Name","Category","Expense Type","Amount Paid" from expenses


### Execute our second query and get the results into a dataframe

In [26]:
allDF = pandas.read_sql_query(sqlQuery,sqlCon)

### Set the data type of a couple of the columns and create an index for the dataframes

In [27]:
# correct column types
df['Date'] = to_datetime(df['Date'])
df['Amount Paid'] = to_numeric(df['Amount Paid'])
allDF['Date'] = to_datetime(allDF['Date'])
allDF['Amount Paid'] = to_numeric(allDF['Amount Paid'])
# set the Date column to our index
df.index = df['Date']
df = df.sort_index()
allDF.index = allDF['Date']
allDF = allDF.sort_index()

In [28]:
# These next line of code basically removes -ve values, which I have assumed means that the MP has repaid money 
# to the IPSA.
newdf = df[df['Amount Paid'] > 0]
newAllDF = allDF[allDF['Amount Paid'] > 0]

### Start to produce some analysis based on the data retrieved

#### Create yearly detailed and summary dataframes

In [13]:
localMPYearlyDetail = DataFrame()
localMPYearlySummary = DataFrame()
currTaxYear = STARTTAXYEAR

# loop round and extract each tax years summary data at the two levels we are interested in
for counter in range (0,(ENDTAXYEAR - STARTTAXYEAR)):
    
    # create our tax year index i.e. 2010/2011
    tempIndex = str(currTaxYear)+'/'+str(currTaxYear+1)
    
    # extract the data from the dataframe that falls in the current tax year
    tempDF = newdf.loc[dt.datetime(currTaxYear,4,1):dt.datetime(currTaxYear+1,3,31)]
    
    # generate our summary stats based on the category
    yearlyCategorySummary = tempDF.groupby('Category')['Amount Paid'].agg(['sum','mean', 'median', 'max', 'min',stat_range,'std'])
    yearlyCategorySummary['Tax Year'] = tempIndex # add the tax year as a column to the dataframe
    localMPYearlySummary = localMPYearlySummary.append(yearlyCategorySummary) # append the stats we generated to the dataframe
    
    # generate our detailed stats based on the category and the expense type within eacg category
    yearlyDetailSummary = tempDF.groupby(['Category','Expense Type'])['Amount Paid'].agg(['sum','mean', 'median', 'max', 'min',stat_range,'std'])
    yearlyDetailSummary['Tax Year'] = tempIndex # add the tax year as a column to the dataframe
    localMPYearlyDetail = localMPYearlyDetail.append(yearlyDetailSummary) # append the stats we generated to the dataframe
    
    currTaxYear += 1 # move to next tax year
    

# turn into a multiindex dataframe
localMPYearlySummary.reset_index(level=0, inplace=True)
#localMPYearlySummary.set_index(['Tax Year','Category'],inplace=True)
localMPYearlySummary.set_index(['Category','Tax Year'],inplace=True)
localMPYearlySummary.sort_index(inplace=True)

localMPYearlyDetail.reset_index(level=[0,1], inplace=True)
localMPYearlyDetail.set_index(['Category','Expense Type','Tax Year'],inplace=True)
localMPYearlyDetail.sort_index(inplace=True)

localMPYearlyDetail = localMPYearlyDetail.round(decimals=2) # round to 2 decimal places
localMPYearlySummary = localMPYearlySummary.round(decimals=2) # round to 2 decimal places

In [22]:
#localMPYearlyDetail

In [23]:
#localMPYearlySummary 

### Total expenses claimed for all time

In [17]:
print (f"Between 1st April {STARTTAXYEAR} and 31st March {ENDTAXYEAR} MP {LOCALMP} claimed £{newdf['Amount Paid'].sum()} in expenses.")

Between 1st April 2010 and 31st March 2020 MP Stephen Barclay claimed £1445649.29 in expenses.


In [36]:
# calculate the total expenses for each MP and sort them, this returns a series
groupedExpenses = newAllDF.groupby('MP\'s Name')
allMPTotalClaim = groupedExpenses['Amount Paid'].aggregate(sum).sort_values(ascending=False)

# find where our MP is in that sorted series
count = 1
for index,value in allMPTotalClaim.items():
    if index == LOCALMP:
        place = count
    count += 1 

print(f'This placed them {place} highest out of {allMPTotalClaim.count()} MPs.')
print(f'This number of MPs includes past and present. Basically anyone who has been an MP and made a claim in that peroid.')

This placed them 177 highest out of 954 MPs.
This number of MPs includes past and present. Basically anyone who has been an MP and made a claim in that peroid.


In [19]:
print ("This is that total amount broken down by year.")
localMPYearlySummary['sum'].unstack().agg(sum)

This is that total amount broken down by year.


Tax Year
2010/2011   121265.93
2011/2012   163031.11
2012/2013   155172.73
2013/2014   156460.53
2014/2015   157243.01
2015/2016   127967.94
2016/2017   179408.72
2017/2018   181479.61
2018/2019   171405.47
2019/2020    26119.74
dtype: float64

In [12]:
print ("This is a breakdown of that figure by category claimed between those dates.")
newdf.groupby('Category')['Amount Paid'].sum()

This is a breakdown of that figure by category claimed between those dates.


Category
Accommodation               146634.07
Dependant Travel              3051.99
MP Travel                    70831.82
Miscellaneous Expenses        2910.68
Office Costs                201886.41
Office Costs Expenditure       837.60
Staff Travel                 12307.91
Staffing                   1007188.81
Name: Amount Paid, dtype: float64

### Yearly Summary By Category

In [20]:
localMPYearlySummary['sum'].unstack().fillna(0.0)

Tax Year,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,2016/2017,2017/2018,2018/2019,2019/2020
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Accommodation,15895.39,25263.58,23751.26,19451.65,10653.9,8996.6,10382.99,9540.7,13248.0,6550.0
Dependant Travel,247.86,1423.89,1380.24,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MP Travel,2669.68,2830.6,2671.57,4039.64,4034.64,8241.01,10546.39,10540.05,12284.16,12775.56
Miscellaneous Expenses,1224.96,0.0,0.0,408.0,89.4,0.0,780.0,0.0,0.0,0.0
Office Costs,20255.99,20408.81,18005.83,25400.72,22223.66,17537.61,19799.37,22904.41,27941.52,4820.83
Office Costs Expenditure,0.0,0.0,0.0,0.0,837.6,0.0,0.0,0.0,0.0,0.0
Staff Travel,475.97,754.61,474.6,677.11,899.99,1595.05,3402.18,1294.66,1657.09,1076.65
Staffing,80496.08,112349.62,108889.23,106483.41,118503.82,91597.67,134497.79,137199.79,116274.7,896.7


### Yearly Detail By Category and Expense Type 

In [21]:
# the next 4 lines of code to display all the rows and columns in the detail view are taken from
# https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
pandas.set_option('display.width', None)
pandas.set_option('display.max_colwidth', -1)

localMPYearlyDetail['sum'].unstack().fillna(0.0)

Unnamed: 0_level_0,Tax Year,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,2016/2017,2017/2018,2018/2019,2019/2020
Category,Expense Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Accommodation,Accommodation Rent,15001.15,25004.94,22228.35,13866.68,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Council Tax,181.0,150.58,1485.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Electricity,167.07,0.0,0.0,78.27,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Gas,360.0,90.0,0.0,79.1,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Hotel - London,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,875.0,5775.0
Accommodation,Hotel London Area,0.0,0.0,0.0,5061.6,10653.9,8996.6,10382.99,9540.7,12373.0,775.0
Accommodation,Telephone Usage/Rental,23.67,0.0,37.74,97.78,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Television Licence,145.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Accommodation,Water,17.0,18.06,0.0,268.22,0.0,0.0,0.0,0.0,0.0,0.0
Dependant Travel,Own Car Dependant,247.86,1328.94,240.84,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## To Do List
A list of ideas I have to expand this porject.
- Add comparision with other MPs
- Add in the MP Basic Salary.
    I think taking into account whether an MP has held a position such as being on a select committee or become
    a minister and the extra pay they would get is "too complicated".
    
- Add in the current tax year to date claims
    - compare them to the historic data
    - predict where they might go
- Add graphs to show off the data