# Final Exam (16 marks) - House Office Expenditure Data

Members of Congress and Congressional offices receive an annual budget to spend on staff, supplies, transportation, and other expenses. Each quarter, representatives report the recipients of their expenditures. ProPublica compiles these reports into research-ready CSV files. The full data set has already been downloaded for your convenience. The data set includes a readme text file describing the data in more detail, which may be helpful in completing this exam.

## Data Notes

House Office Expenditure Data

Last updated June 4, 2019.

https://projects.propublica.org/represent/expenditures

Members of the House of Representatives get an annual budget for their Washington and district offices, but how they spend it is up to them. There are some rules: It can’t be used for personal or campaign expenses, and there is no reserve source of money if lawmakers spend all of their allowances.

Lawmakers also are required to report the recipients of their office spending, and since 2009 the Sunlight Foundation has been taking the PDF files published by the House and converting them into text files useful for analysis and research. As of November 2016, ProPublica has taken over both the collection and hosting of these files. They can be examined using spreadsheet or database software. 

__How We Collect This Data__

Each quarter we take the report published by the House and generate two text files: One contains summary information for each office and category of spending (some examples include “Personnel Compensation” and “Travel”), and the other contains details of each recipient of office spending and its purpose. Note that the data has not been standardized (meaning that "AT&T" might also appear as "A.T.&T."), so simple aggregation on the recipient could result in multiple totals for the same individual or entity, depending on the spelling. Individual recipients can be paid by more than one office or lawmaker in some cases.

Most of the records are connected to lawmaker offices, but the files also contain spending records for House committees and administrative offices, in addition to leadership organizations such as the Speaker of the House and the two parties' leaders.

Before you dig into the data to find out how the House spends its money, you may find it useful to check out this post (https://www.propublica.org/article/update-on-house-disbursements-a-few-notes-on-how-to-use-the-data) from a Sunlight training webinar that explains discrepancies with how the House reports lawmakers' spending, and gives guidelines on how to use the data.


## Data Dictionary


__Summary files__

    BIOGUIDE_ID – the official ID of members of the House (http://bioguide.congress.gov/biosearch/biosearch.asp)
    OFFICE – the name of the House office
    YEAR – the calendar year
    QUARTER – the quarter of the year
    CATEGORY – broad description of spending
    YTD – year to date amount spent by office in that category 
    AMOUNT – amount spent by office in that category in quarter

__Detail files__

    Has BIOGUIDE, OFFICE, QUARTER, YEAR, CATEGORY, AMOUNT, plus the following:

    PAYEE – name of recipient
    PURPOSE – specific purpose of spending
    DATE -  date of payment (optional)
    START DATE – beginning of period which payment covers
    END DATE – end of period which payment covers
    TRANSCODE – House transaction code
    TRANSCODELONG – description of House transaction code
    RECORDID – House record number
    RECIP (orig.) - original (non standardized) recipient


### Task 1 (2 marks)

Import the necessary libraries and read each of the following files from the included __data__ folder as a Dataframe: 

- 2010Q1-house-disburse-detail.csv
- 2010Q1-house-disburse-summary.csv
- 2010Q2-house-disburse-detail.csv
- 2010Q2-house-disburse-summary.csv
- 2010Q3-house-disburse-detail.csv
- 2010Q3-house-disburse-summary.csv
- 2010Q4-house-disburse-detail.csv
- 2010Q4-house-disburse-summary.csv
- 2011Q1-house-disburse-detail.csv
- 2011Q1-house-disburse-summary.csv
- 2011Q2-house-disburse-detail.csv
- 2011Q2-house-disburse-summary.csv
- 2011Q3-house-disburse-detail.csv
- 2011Q3-house-disburse-summary.csv
- 2011Q4-house-disburse-detail.csv
- 2011Q4-house-disburse-summary.csv
- 2012Q1-house-disburse-detail.csv
- 2012Q1-house-disburse-summary.csv
- 2012Q2-house-disburse-detail.csv
- 2012Q2-house-disburse-summary.csv
- 2012Q3-house-disburse-detail.csv
- 2012Q3-house-disburse-summary.csv
- 2012Q4-house-disburse-detail.csv
- 2012Q4-house-disburse-summary.csv
- 2013Q1-house-disburse-detail.csv
- 2013Q1-house-disburse-summary.csv
- 2013Q2-house-disburse-detail.csv
- 2013Q2-house-disburse-summary.csv
- 2013Q3-house-disburse-detail.csv
- 2013Q3-house-disburse-summary.csv
- 2013Q4-house-disburse-detail.csv
- 2013Q4-house-disburse-summary.csv
- 2014Q1-house-disburse-detail.csv
- 2014Q1-house-disburse-summary.csv
- 2014Q2-house-disburse-detail.csv
- 2014Q2-house-disburse-summary.csv
- 2014Q3-house-disburse-detail.csv
- 2014Q3-house-disburse-summary.csv
- 2014Q4-house-disburse-detail.csv
- 2014Q4-house-disburse-summary.csv
- 2015Q1-house-disburse-detail.csv
- 2015Q1-house-disburse-summary.csv
- 2015Q2-house-disburse-detail.csv
- 2015Q2-house-disburse-summary.csv
- 2015Q3-house-disburse-detail.csv
- 2015Q3-house-disburse-summary.csv
- 2015Q4-house-disburse-detail.csv
- 2015Q4-house-disburse-summary.csv
- 2016Q1-house-disburse-detail.csv
- 2016Q1-house-disburse-summary.csv
- 2016Q2-house-disburse-detail.csv
- 2016Q2-house-disburse-summary.csv
- 2016Q3-house-disburse-detail.csv
- 2016Q3-house-disburse-summary.csv
- 2016Q4-house-disburse-detail.csv
- 2016Q4-house-disburse-summary.csv
- 2017Q1-house-disburse-detail.csv
- 2017Q1-house-disburse-summary.csv
- 2017Q2-house-disburse-detail.csv
- 2017Q2-house-disburse-summary.csv
- 2017Q3-house-disburse-detail.csv
- 2017Q3-house-disburse-summary.csv
- 2017Q4-house-disburse-detail.csv
- 2017Q4-house-disburse-summary.csv

As you can begin to realize, there are a total of 64 files and each needs to be read into a separate Dataframe. Therefore the manual approach of storing each into a separate variable such as __df1__, __df2__ etc is no longer feasible.

What you should do instead is, use a suitable data structure to create a data store. Once you have done that, you should be able to access your Dataframes as follows:

- `data['detail']['Y2010Q1']` represents the Dataframe for `2010Q1-house-disburse-detail.csv`
- `data['summary']['Y2010Q1']` represents the Dataframe for `2010Q1-house-disburse-summary.csv`

... and so on.

Such a data store would allow you to use a consistent naming scheme for both the `detail` and `summary` Dataframes. That is:

For both the files `2010Q1-house-disburse-detail.csv` and `2010Q1-house-disburse-summary.csv`, you use the same name __Y2010Q1__ for your Dataframe.

Similarly, for both the files `2010Q2-house-disburse-detail.csv` and `2010Q2-house-disburse-summary.csv`, you use the same name __Y2010Q2__ for your Dataframe and so on.

<u>Hint:</u> You would need to use the following additional library:

__import__ glob

And then use the code `path = 'data'` and `glob.glob(path + "/*.csv")` to read all the files from the __data__ folder. Also, you would need the `encoding='unicode_escape'` to read the files properly.

You would also need to come up with suitable logic and code so that the file `2010Q1-house-disburse-detail.csv` gets read into the Dataframe `data['detail']['Y2010Q1']` while `2010Q1-house-disburse-summary.csv` gets read into the Dataframe `data['summary']['Y2010Q1']` and so on (shouldn't take more than 8-10 lines of code).

In case you get a warning for the first file, ignore it as it's just the BIOGUIDE_ID column or use `low_memory=False` to fix.

In [1]:
### Write your code below this comment.
import pandas as pd
import glob
import numpy as np

In [2]:
path = 'data'
data_files = glob.glob(path + "/*.csv")
data_files

['data\\2010Q1-house-disburse-detail.csv',
 'data\\2010Q1-house-disburse-summary.csv',
 'data\\2010Q2-house-disburse-detail.csv',
 'data\\2010Q2-house-disburse-summary.csv',
 'data\\2010Q3-house-disburse-detail.csv',
 'data\\2010Q3-house-disburse-summary.csv',
 'data\\2010Q4-house-disburse-detail.csv',
 'data\\2010Q4-house-disburse-summary.csv',
 'data\\2011Q1-house-disburse-detail.csv',
 'data\\2011Q1-house-disburse-summary.csv',
 'data\\2011Q2-house-disburse-detail.csv',
 'data\\2011Q2-house-disburse-summary.csv',
 'data\\2011Q3-house-disburse-detail.csv',
 'data\\2011Q3-house-disburse-summary.csv',
 'data\\2011Q4-house-disburse-detail.csv',
 'data\\2011Q4-house-disburse-summary.csv',
 'data\\2012Q1-house-disburse-detail.csv',
 'data\\2012Q1-house-disburse-summary.csv',
 'data\\2012Q2-house-disburse-detail.csv',
 'data\\2012Q2-house-disburse-summary.csv',
 'data\\2012Q3-house-disburse-detail.csv',
 'data\\2012Q3-house-disburse-summary.csv',
 'data\\2012Q4-house-disburse-detail.csv',


In [36]:
data_summary = {}
data_detail = {}
data = {}
for file in data_files:
    d = pd.read_csv("{0}".format(file), encoding='unicode_escape')
    if "summary" in file:
        file_name = file.replace("-house-disburse-summary.csv", "")
        file_name = 'Y' + file_name.replace("data\\","")
        data_summary[file_name] =  d
    else:
        file_name = file.replace("-house-disburse-detail.csv", "")
        file_name = 'Y' + file_name.replace("data\\","")
        data_detail[file_name] =  d
data['summary']= data_summary
data['detail'] = data_detail

In [4]:
data['summary']['Y2010Q1']

Unnamed: 0,BIOGUIDE_ID,OFFICE,YEAR,QUARTER,CATEGORY,YTD,AMOUNT
0,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,OTHER SERVICES,1755.00,455.00
1,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,SUPPLIES AND MATERIALS,979.36,771.16
2,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,EQUIPMENT,33815.32,33767.32
3,,OFFICE OF THE SPEAKER,FISCAL YEAR 2010,2010Q1,PERSONNEL COMPENSATION,447067.08,226102.50
4,,OFFICE OF THE SPEAKER,FISCAL YEAR 2010,2010Q1,PERSONNEL COMPENSATION,1857385.12,952771.16
...,...,...,...,...,...,...,...
3991,,ALTERNATE SITE,FISCAL YEAR 2010,2010Q1,EQUIPMENT,74325.40,1251.48
3992,,EMERGENCY RESPONSE TEAM,FISCAL YEAR 2010,2010Q1,TRAVEL,33344.24,26539.28
3993,,EMERGENCY RESPONSE TEAM,FISCAL YEAR 2010,2010Q1,"RENT, COMMUNICATION, UTILITIES",4518.60,29.85
3994,,EMERGENCY RESPONSE TEAM,FISCAL YEAR 2010,2010Q1,OTHER SERVICES,295.11,295.11


In [5]:
data['detail']['Y2010Q1']

Unnamed: 0,BIOGUIDE_ID,OFFICE,QUARTER,CATEGORY,DATE,PAYEE,START DATE,END DATE,PURPOSE,AMOUNT,YEAR,TRANSCODE,TRANSCODELONG,RECORDID,RECIP (orig.)
0,,COMMUNICATIONS,2010Q1,OTHER SERVICES,,03Â­10 P2 MFP0003226 ...,03/02/10,03/02/10,NON-TECHNOLOGY SERVICE CONTRCT,455.00,FISCAL YEAR 2010,,,,03Â­10 P2 MFP0003226 ...
1,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,02Â­05 P2 MFP0003219 ALLSTEEL,11/28/09,11/28/09,HABITATION EXPENSES,47.26,FISCAL YEAR 2010,,,,02Â­05 P2 MFP0003219 ALLSTEEL
2,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 CDW GOVERN...,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,250.00,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 CDW GOVERN...
3,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 DO,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,436.00,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 DO
4,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 DO,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,37.90,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 DO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132795,,EMERGENCY RESPONSE TEAM,2010Q1,TRAVEL,,03Â­29 P1 10A54000097 KEVIN ...,02/14/10,02/19/10,TRAVEL REIMBURSEMENT,858.00,FISCAL YEAR 2010,,,,03Â­29 P1 10A54000097 KEVIN ...
132796,,EMERGENCY RESPONSE TEAM,2010Q1,TRAVEL,,03Â­29 P1 10A54000096 PAUL J...,02/14/10,02/19/10,TRAVEL REIMBURSEMENT,795.00,FISCAL YEAR 2010,,,,03Â­29 P1 10A54000096 PAUL J...
132797,,EMERGENCY RESPONSE TEAM,2010Q1,"RENT, COMMUNICATION, UTILITIES",,03Â­07 P1 10A10600093 RICHARD MARTINS,02/14/10,02/19/10,"TELECOM SVC, EQUIP & TOLLS",9.95,FISCAL YEAR 2010,,,,03Â­07 P1 10A10600093 RICHARD MARTINS
132798,,EMERGENCY RESPONSE TEAM,2010Q1,"RENT, COMMUNICATION, UTILITIES",,03Â­07 P1 10A10600097 TIMOTHY WRIGHT,02/14/10,02/19/10,"TELECOM SVC, EQUIP & TOLLS",19.90,FISCAL YEAR 2010,,,,03Â­07 P1 10A10600097 TIMOTHY WRIGHT


---

### Task 2 (1 mark)

Some Dataframes may have one or more column names with extra spaces at either end. Write code to detect such Dataframes and correct the column names before moving ahead.

In [37]:
### Write your code below this comment.
k_list = list(data['detail'].keys())
for i in k_list:
    data['detail'][i].columns = data['detail'][i].columns.str.rstrip()
    
for file_name, value in data['detail'].items():
    column = value.columns.tolist()
    print(column)

['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE', 'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)']
['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE', 'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)']
['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE', 'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)']
['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE', 'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)']
['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE', 'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)']
['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE', 'START DATE', 'END DATE', 'PURPOSE

---

### Task 3 (2.5 marks)

Combine all the `'detail'` Dataframes from your store vertically into a single Dataframe named __df_detail__ based on common columns across all the Dataframes. Make sure the sort order of the common column names is the same as that in the first Dataframe `data['detail']['Y2010Q1']`. Once you have done that, confirm whether the total number of rows for all the Dataframes is equal to the number of rows for the combined Dataframe __df_detail__.

In [40]:
### Write your code below this comment.
df_detail = pd.DataFrame()
k_list = list(data['detail'].keys())
for k in k_list:
    df = data['detail'][k]
    df_detail = pd.concat([df_detail,df], ignore_index=True)

df_detail


Unnamed: 0,BIOGUIDE_ID,OFFICE,QUARTER,CATEGORY,DATE,PAYEE,START DATE,END DATE,PURPOSE,AMOUNT,YEAR,TRANSCODE,TRANSCODELONG,RECORDID,RECIP (orig.),PROGRAM,SORT SEQUENCE
0,,COMMUNICATIONS,2010Q1,OTHER SERVICES,,03Â­10 P2 MFP0003226 ...,03/02/10,03/02/10,NON-TECHNOLOGY SERVICE CONTRCT,455.00,FISCAL YEAR 2010,,,,03Â­10 P2 MFP0003226 ...,,
1,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,02Â­05 P2 MFP0003219 ALLSTEEL,11/28/09,11/28/09,HABITATION EXPENSES,47.26,FISCAL YEAR 2010,,,,02Â­05 P2 MFP0003219 ALLSTEEL,,
2,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 CDW GOVERN...,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,250.00,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 CDW GOVERN...,,
3,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 DO,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,436.00,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 DO,,
4,,COMMUNICATIONS,2010Q1,SUPPLIES AND MATERIALS,,03Â­05 P2 OSM42304 DO,12/21/09,12/21/09,OFFICE SUPPLIES OUTSIDE,37.90,FISCAL YEAR 2010,,,,03Â­05 P2 OSM42304 DO,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3289983,,FISCAL YEAR 2016 PAGING,Q4,EQUIPMENT,10/16/2017,BEARCOM,8/1/2017,8/31/2017,WARRANTIES,6405.41,2017,AP,,947363,,PAGING,DETAIL
3289984,,FISCAL YEAR 2016 PAGING,Q4,EQUIPMENT,10/16/2017,BEARCOM,9/1/2017,9/30/2017,WARRANTIES,6405.41,2017,AP,,947365,,PAGING,DETAIL
3289985,,FISCAL YEAR 2016 PAGING,Q4,EQUIPMENT,,,,,EQUIPMENT TOTALS:,19216.23,2017,,,,,PAGING,SUBTOTAL
3289986,,FISCAL YEAR 2016 PAGING,Q4,EQUIPMENT,,,,,PAGING TOTALS:,19216.23,2017,,,,,PAGING,SUBTOTAL


In [8]:
df_detail.columns

Index(['BIOGUIDE_ID', 'OFFICE', 'QUARTER', 'CATEGORY', 'DATE', 'PAYEE',
       'START DATE', 'END DATE', 'PURPOSE', 'AMOUNT', 'YEAR', 'TRANSCODE',
       'TRANSCODELONG', 'RECORDID', 'RECIP (orig.)', 'PROGRAM',
       'SORT SEQUENCE'],
      dtype='object')

In [38]:
print(len(df_detail))
count = 0
for key, value in data['detail'].items():
    count = count + len(value)
print(count)

3289988
3289988


---

### Task 4 (2.5 marks)

Similarly, combine all the `'summary'` Dataframes from your store vertically into a single Dataframe named __df_summary__ based on common columns across all the Dataframes. Make sure the sort order of the common column names is the same as that in the first Dataframe `data['summary']['Y2010Q1']`. Once you have done that, confirm whether the total number of rows for all the Dataframes is equal to the number of rows for the combined Dataframe __df_summary__.

In [10]:
### Write your code below this comment.
df_summary = pd.DataFrame()
k_list = list(data['summary'].keys())
for k in k_list:
    df1 = data['summary'][k]
    df_summary = pd.concat([df_summary,df1], ignore_index=True)

df_summary

Unnamed: 0,BIOGUIDE_ID,OFFICE,YEAR,QUARTER,CATEGORY,YTD,AMOUNT,Unnamed: 7,PROGRAM,DESCRIPTION
0,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,OTHER SERVICES,1755.00,455.00,,,
1,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,SUPPLIES AND MATERIALS,979.36,771.16,,,
2,,COMMUNICATIONS,FISCAL YEAR 2010,2010Q1,EQUIPMENT,33815.32,33767.32,,,
3,,OFFICE OF THE SPEAKER,FISCAL YEAR 2010,2010Q1,PERSONNEL COMPENSATION,447067.08,226102.50,,,
4,,OFFICE OF THE SPEAKER,FISCAL YEAR 2010,2010Q1,PERSONNEL COMPENSATION,1857385.12,952771.16,,,
...,...,...,...,...,...,...,...,...,...,...
134899,,FISCAL YEAR 2018 COMMUNICATIONS SERVICES,2017,Q4,COMMUNICATIONS SERVICES TOTALS:,89769.68,89769.68,,,
134900,,FISCAL YEAR 2018 COMMUNICATIONS SERVICES,2017,Q4,OFFICE TOTALS:,89769.68,89769.68,,,
134901,,FISCAL YEAR 2018 CDN ENHANCE,2017,Q4,RENT COMMUNICATION UTILITIES,231939.76,231939.76,,,
134902,,FISCAL YEAR 2018 CDN ENHANCE,2017,Q4,CDN ENHANCE TOTALS:,231939.76,231939.76,,,


In [39]:
print(len(df_summary))
count = 0
for key, value in data['summary'].items():
    count = count + len(value)
print(count)

134904
134904


---

### Task 5 (1.5 marks)

Find missing values in the __df_detail__ Dataframe and report the sum of missing values for all the columns in __df_detail__ as a single number. 

Note that the combined Dataframe __df_detail__ would have some columns that have 3 spaces `'   '` stored as a string which came from the 2017 data files and basically indicate a missing value. Convert these values to __NaN__ so they are correctly recognized as missing values and then update the count of missing values again as a single number.

In [12]:
### Write your code below this comment.
missing = df_detail.isnull().sum()
missing.sum()

10493027

In [13]:
df_detail = df_detail.replace(r"\s", np.NaN, regex=True)
missing1=df_detail.isnull().sum()
missing1.sum()

27490981

---

### Task 6 (1.5 marks)

Specify the right data type for the following columns in __df_detail__ for further analysis:

- `'AMOUNT'`
- `'START DATE'`
- `'END DATE'`

The __AMOUNT__ column needs to be converted into a numeric column (i.e. one with floating point values). While the columns __START DATE__ and __END DATE__ need to be converted into DateTime type objects.

<u>Hint:</u> If you get any errors, try using `errors="coerce"` and that should do the trick.

In [14]:
### Write your code below this comment.
df_detail['AMOUNT']=pd.to_numeric(df_detail['AMOUNT'], errors='coerce')
df_detail['AMOUNT'].dtype

dtype('float64')

In [15]:
from datetime import datetime, date, time as dt
df_detail['START DATE'] = pd.to_datetime(df_detail['START DATE'], errors='coerce')
df_detail['START DATE'].dtype

dtype('<M8[ns]')

In [16]:
df_detail['END DATE'] = pd.to_datetime(df_detail['END DATE'], errors='coerce')
df_detail['END DATE'].dtype

dtype('<M8[ns]')

---

### NOW ANSWER THE FOLLOWING QUESTIONS.

All questions pertain to the __df_detail__ Dataframe.

<u>Note:</u> The detailed instructions you have received in this exam so far regarding transformations were for your convenience only. In a real-world Data Science challenge, you would only be asked the questions (such as the ones to follow) and then it's up to you to do whatever transformations need to be done to answer those questions.

### Task 7 (0.5 mark)

What is the total of all the payments in the dataset?

In [17]:
### Write your code below this comment.
df_detail['AMOUNT'].sum()

4701011813.840001