# Summarize FY2024 and FY 2025 USA Spending Data

This notebook analyzes contract data from the USA Spending database for fiscal years 2024 and 2025. The workflow:

1. **Data Acquisition**: Connects to the USASpending API to fetch contract data for Department of Justice (agency 17) for fiscal years 2024 and 2025.
2. **Data Download**: Automatically downloads ZIP files containing contract data based on URLs returned by the API.
3. **Data Extraction**: Extracts the downloaded ZIP files into a data folder.
4. **Data Processing**: Reads and combines all CSV files into a single dataframe, then extracts key contract information including award IDs, recipient details, product/service codes, modification dates, and award values.
5. **Analysis**: Creates four different summary views of the data:
    - **By Award ID**: Groups contracts by award ID to identify the largest contracts by total potential value
    - **By Recipient**: Groups by recipient UEI to analyze which vendors received the most contract dollars
    - **By Parent Recipient**: Aggregates by parent organizations to understand overall organizational contract awards
    - **By Product/Service Code**: Identifies which types of products and services received the most funding
6. **Data Export**: Saves all views to a multi-sheet Excel file for further analysis and sharing

The analysis reveals significant contract awards, with some individual contracts valued at over a billion dollars. The data includes detailed information about recipients, contract modifications, and the types of products and services being procured by the Department of Justice.

In [1]:
# import necessary libraries
import numpy as np
import pandas as pd
import requests
import zipfile
import os

In [2]:
# define a list of fiscal years
fys = [2024, 2025]
# initiate an empty list to store the data
data = []

# for each fiscal year, download the list of monthly files based on the request below
for fy in fys:
    url = "https://api.usaspending.gov/api/v2/bulk_download/list_monthly_files"
    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json"
    }
    payload = {
        "agency": "17",
        "fiscal_year": str(fy),
        "type": "contracts"
    }
    response = requests.post(url, headers=headers, json=payload)
    # add the first object in the "monthly_files" array to the data list
    monthly_files = response.json()
    if "monthly_files" in monthly_files and len(monthly_files["monthly_files"]) > 0:
        data.append(monthly_files["monthly_files"][0])

    
# load the data list as a pandas dataframe
monthly_files_df = pd.DataFrame(data)
monthly_files_df.head()

Unnamed: 0,fiscal_year,agency_name,agency_acronym,type,updated_date,file_name,url
0,2024,Department of Justice,DOJ,contracts,2025-02-06,FY2024_015_Contracts_Full_20250206.zip,https://files.usaspending.gov/award_data_archi...
1,2025,Department of Justice,DOJ,contracts,2025-02-06,FY2025_015_Contracts_Full_20250206.zip,https://files.usaspending.gov/award_data_archi...


In [3]:
# for each url in the "url" column, download the zip file and save it to the "data" folder
for index, row in monthly_files_df.iterrows():
    url = row["url"]
    filename = url.split("/")[-1]
    r = requests.get(url, allow_redirects=True)
    # save the zip file to the "data" folder
    # if the "data" folder does not exist, create it
    try:
        os.mkdir("data")
    except FileExistsError:
        pass

    with open(f"data/{filename}", "wb") as f:
        f.write(r.content)

In [4]:
# extract the zip files in the "data" folder
for index, row in monthly_files_df.iterrows():
    filename = row["url"].split("/")[-1]
    with zipfile.ZipFile(f"data/{filename}", "r") as zip_ref:
        zip_ref.extractall("data")

# Data Processing and Analysis Workflow

## Data Acquisition and Processing
- **API Data Collection**: Data is collected via the USASpending API for Department of Justice (agency 17) for fiscal years 2024 and 2025.
- **Downloaded Files**: ZIP files containing contract data are automatically downloaded from URLs returned by the API.
- **Extracted Data**: The ZIP files are extracted to a "data" folder for processing.
- **Combined Dataset (df)**: All CSV files are read and concatenated into a single dataframe containing all contract records across both fiscal years.

## Processed Data
- **df_narrow**: Selected subset of 10 critical columns from the combined dataset. Columns include:
    - award_id_piid: Unique contract identifier
    - action_date_fiscal_year: Fiscal year of the contract action
    - recipient_uei: Unique Entity Identifier for contractors
    - recipient_name: Name of contract recipient
    - recipient_parent_uei: Parent organization UEI
    - recipient_parent_name: Name of parent organization
    - product_or_service_code_description: Description of product/service
    - product_or_service_code: Code for product/service category
    - last_modified_date: Date of last contract modification
    - potential_total_value_of_award: Maximum potential value of contract

## Analysis Views
- **df_grouped_award_id_max_last_modified**: Summarizes contracts by award ID and fiscal year, showing potential total value, latest modification date, and count of modifications per contract.

- **df_grouped_recipient_uei_max_last_modified**: Analyzes which vendors received the most contract dollars, grouped by recipient UEI, name, and fiscal year.

- **df_grouped_recipient_parent_uei_max_last_modified**: Aggregates contracts by parent organizations to understand overall organizational contract awards.

- **df_grouped_product_or_service_code_max_last_modified**: Identifies which types of products and services received the most funding, categorized by product or service codes.

All views are exported to a multi-sheet Excel file named "USA_Spending_Summary_Analysis.xlsx" for further analysis and sharing.


In [5]:
# read each csv file in the "data" folder and concatenate them into a single dataframe

# Create a list to store each CSV file's dataframe
dfs = []

# Iterate through each csv file in the "data" folder
for file in os.listdir("data"):
    if file.endswith(".csv"):
        df = pd.read_csv(f"data/{file}")
        dfs.append(df)

# Combine the list of dataframes into one dataframe
df = pd.concat(dfs, ignore_index=True)

df.head()

  df = pd.read_csv(f"data/{file}")
  df = pd.read_csv(f"data/{file}")


Unnamed: 0,contract_transaction_unique_key,contract_award_unique_key,award_id_piid,modification_number,transaction_number,parent_award_agency_id,parent_award_agency_name,parent_award_id_piid,parent_award_modification_number,federal_action_obligation,...,highly_compensated_officer_2_amount,highly_compensated_officer_3_name,highly_compensated_officer_3_amount,highly_compensated_officer_4_name,highly_compensated_officer_4_amount,highly_compensated_officer_5_name,highly_compensated_officer_5_amount,usaspending_permalink,initial_report_date,last_modified_date
0,1501_4732_15JTAX22F00000053_P00008_GS00F156GA_0,CONT_AWD_15JTAX22F00000053_1501_GS00F156GA_4732,15JTAX22F00000053,P00008,0.0,4732.0,FEDERAL ACQUISITION SERVICE,GS00F156GA,PA0019,149841.82,...,,,,,,,,https://www.usaspending.gov/award/CONT_AWD_15J...,2024-09-26,2024-09-30
1,1501_1501_15JPSS23F00000433_P00002_15JPSS20G00...,CONT_AWD_15JPSS23F00000433_1501_15JPSS20G00000...,15JPSS23F00000433,P00002,0.0,1501.0,"OFFICES, BOARDS AND DIVISIONS",15JPSS20G00000334,0,-2561.54,...,200000.0,,,,,,,https://www.usaspending.gov/award/CONT_AWD_15J...,2024-09-23,2024-09-30
2,1501_8000_15JPSS24F00000983_0_NNG15SD91B_0,CONT_AWD_15JPSS24F00000983_1501_NNG15SD91B_8000,15JPSS24F00000983,0,0.0,8000.0,NATIONAL AERONAUTICS AND SPACE ADMINISTRATION,NNG15SD91B,11,0.0,...,,,,,,,,https://www.usaspending.gov/award/CONT_AWD_15J...,2024-09-30,2024-10-01
3,1544_-NONE-_15M10224PA4700495_0_-NONE-_0,CONT_AWD_15M10224PA4700495_1544_-NONE-_-NONE-,15M10224PA4700495,0,0.0,,,,,54725.0,...,,,,,,,,https://www.usaspending.gov/award/CONT_AWD_15M...,2024-09-30,2024-09-30
4,1542_1542_15UASH24F00000851_0_15UC0C21D00001511_0,CONT_AWD_15UASH24F00000851_1542_15UC0C21D00001...,15UASH24F00000851,0,0.0,1542.0,"FEDERAL PRISON INDUSTRIES, INC.",15UC0C21D00001511,0,43452.9,...,,,,,,,,https://www.usaspending.gov/award/CONT_AWD_15U...,2024-10-01,2024-10-01


In [6]:
# df_2024 = pd.read_csv(f'FY2024_015_Contracts_Full_20250213_1.csv')
# df_2025 = pd.read_csv(f'FY2025_015_Contracts_Full_20250213_1.csv')

# # combine the two dataframes
# df = pd.concat([df_2024, df_2025])

In [7]:
# select columns award_id_piid, action_date_fiscal_year, recipient_uei, recipient_name, recipient_parent_uei, recipient_parent_name, product_or_service_code_description, product_or_service_code, last_modified, potential_total_value_of_award
df_narrow = df[['award_id_piid', 'action_date_fiscal_year', 'recipient_uei', 'recipient_name', 'recipient_parent_uei', 'recipient_parent_name', 'product_or_service_code_description', 'product_or_service_code', 'last_modified_date', 'potential_total_value_of_award']]

In [8]:
# Change column type to datetime64[ns] for column: 'last_modified_date'
df_narrow = df_narrow.astype({'last_modified_date': 'datetime64[ns]'})

In [9]:
# 1. group by award_id_piid and max(last_modified) and return max(potential_total_value_of_award) count(rows)

# Group by award_id_piid and get max last_modified_date
df_grouped_award_id = df_narrow.groupby('award_id_piid')['last_modified_date'].max().reset_index()
# Merge to get potential_total_value_of_award for max last_modified_date
df_grouped_award_id_max_last_modified = pd.merge(df_narrow, df_grouped_award_id, on=['award_id_piid', 'last_modified_date'], how='inner')
# Group by award_id_piid to get the number of rows
row_counts = df_grouped_award_id_max_last_modified.groupby('award_id_piid').size().reset_index(name='row_count')
# Merge row counts back to the main DataFrame
df_grouped_award_id_max_last_modified = pd.merge(df_grouped_award_id_max_last_modified, row_counts, on='award_id_piid', how='inner')
# Select relevant columns and sort
df_grouped_award_id_max_last_modified = df_grouped_award_id_max_last_modified[['award_id_piid', 'action_date_fiscal_year', 'last_modified_date', 'potential_total_value_of_award', 'row_count']].sort_values(by='potential_total_value_of_award', ascending=False)

# format the potential_total_value_of_award_sum column as currency
df_grouped_award_id_max_last_modified['potential_total_value_of_award'] = df_grouped_award_id_max_last_modified['potential_total_value_of_award'].map('${:,.2f}'.format)
# format the last_modified_date_max column as date
df_grouped_award_id_max_last_modified['last_modified_date'] = df_grouped_award_id_max_last_modified['last_modified_date'].dt.strftime('%Y-%m-%d')

df_grouped_award_id_max_last_modified.head()

Unnamed: 0,award_id_piid,action_date_fiscal_year,last_modified_date,potential_total_value_of_award,row_count
17235,15F06724A0000311,2024,2024-09-18,"$86,000,000,000.00",1
55303,15F06719F0001923,2025,2024-11-18,"$50,481,562,555.57",2
55011,15F06719F0001923,2025,2024-11-18,"$50,481,562,555.57",2
73066,15JPSS25F00000239,2025,2025-01-22,"$10,000,000,000.00",1
16835,15F06724A0000312,2024,2024-09-18,"$8,600,000,000.00",1


In [10]:
# 2. group by recipient_uei and max(last_modified) and return max(potential_total_value_of_award) count(rows)

# Group by recipient_uei and get max last_modified_date
df_grouped_recipient_uei = df_narrow.groupby('recipient_uei')['last_modified_date'].max().reset_index()
# Merge to get potential_total_value_of_award for max last_modified_date
df_grouped_recipient_uei_max_last_modified = pd.merge(df_narrow, df_grouped_recipient_uei, on=['recipient_uei', 'last_modified_date'], how='inner')
# Group by recipient_uei to get the number of rows
row_counts = df_grouped_recipient_uei_max_last_modified.groupby('recipient_uei').size().reset_index(name='row_count')
# Merge row counts back to the main DataFrame
df_grouped_recipient_uei_max_last_modified = pd.merge(df_grouped_recipient_uei_max_last_modified, row_counts, on='recipient_uei', how='inner')
# Select relevant columns and sort
df_grouped_recipient_uei_max_last_modified = df_grouped_recipient_uei_max_last_modified[['recipient_uei', 'recipient_name', 'action_date_fiscal_year', 'last_modified_date', 'potential_total_value_of_award', 'row_count']].sort_values(by='potential_total_value_of_award', ascending=False)

# format the potential_total_value_of_award_sum column as currency
df_grouped_recipient_uei_max_last_modified['potential_total_value_of_award'] = df_grouped_recipient_uei_max_last_modified['potential_total_value_of_award'].map('${:,.2f}'.format)
# format the last_modified_date_max column as date
df_grouped_recipient_uei_max_last_modified['last_modified_date'] = df_grouped_recipient_uei_max_last_modified['last_modified_date'].dt.strftime('%Y-%m-%d')

df_grouped_recipient_uei_max_last_modified.head()

Unnamed: 0,recipient_uei,recipient_name,action_date_fiscal_year,last_modified_date,potential_total_value_of_award,row_count
5322,QGJNGLBLVKY6,"CHENEGA INTEGRATED SECURITY SOLUTIONS, LLC",2025,2024-11-18,"$50,481,562,555.57",2
5381,QGJNGLBLVKY6,"CHENEGA INTEGRATED SECURITY SOLUTIONS, LLC",2025,2024-11-18,"$50,481,562,555.57",2
2358,P2RFJLVYFLF3,KNOWLEDGE MANAGEMENT INC,2024,2024-09-19,"$8,600,000,000.00",1
2278,JPHFV985CTT2,"BASTION ANALYTICS, LLC",2024,2024-09-18,"$8,600,000,000.00",1
2361,U5NTTK3LE8D4,"IT VETERANS, LLC",2024,2024-09-18,"$8,600,000,000.00",1


In [11]:
# 3. group by recipient_parent_uei and max(last_modified) and return max(potential_total_value_of_award) count(rows)

# Group by recipient_parent_uei and get max last_modified_date
df_grouped_recipient_parent_uei = df_narrow.groupby('recipient_parent_uei')['last_modified_date'].max().reset_index()
# Merge to get potential_total_value_of_award for max last_modified_date
df_grouped_recipient_parent_uei_max_last_modified = pd.merge(df_narrow, df_grouped_recipient_parent_uei, on=['recipient_parent_uei', 'last_modified_date'], how='inner')
# Group by recipient_parent_uei to get the number of rows
row_counts = df_grouped_recipient_parent_uei_max_last_modified.groupby('recipient_parent_uei').size().reset_index(name='row_count')
# Merge row counts back to the main DataFrame
df_grouped_recipient_parent_uei_max_last_modified = pd.merge(df_grouped_recipient_parent_uei_max_last_modified, row_counts, on='recipient_parent_uei', how='inner')
# Select relevant columns and sort
df_grouped_recipient_parent_uei_max_last_modified = df_grouped_recipient_parent_uei_max_last_modified[['recipient_parent_uei', 'action_date_fiscal_year', 'last_modified_date', 'potential_total_value_of_award', 'row_count']].sort_values(by='potential_total_value_of_award', ascending=False)

# format the potential_total_value_of_award_sum column as currency
df_grouped_recipient_parent_uei_max_last_modified['potential_total_value_of_award'] = df_grouped_recipient_parent_uei_max_last_modified['potential_total_value_of_award'].map('${:,.2f}'.format)
# format the last_modified_date_max column as date
df_grouped_recipient_parent_uei_max_last_modified['last_modified_date'] = df_grouped_recipient_parent_uei_max_last_modified['last_modified_date'].dt.strftime('%Y-%m-%d')

df_grouped_recipient_parent_uei_max_last_modified.head()

Unnamed: 0,recipient_parent_uei,action_date_fiscal_year,last_modified_date,potential_total_value_of_award,row_count
6051,QGJNGLBLVKY6,2025,2024-11-18,"$50,481,562,555.57",2
6108,QGJNGLBLVKY6,2025,2024-11-18,"$50,481,562,555.57",2
2304,CFWRL5LXXX93,2024,2024-09-18,"$8,600,000,000.00",1
6353,GLPKRZJL8GM3,2025,2024-11-05,"$8,600,000,000.00",1
2329,FPVJBR6CXML9,2024,2024-09-18,"$8,600,000,000.00",1


In [12]:
# 4. group by product_or_service_code and max(last_modified) and return max(potential_total_value_of_award) count(rows)

# Group by product_or_service_code and get max last_modified_date
df_grouped_product_or_service_code = df_narrow.groupby('product_or_service_code')['last_modified_date'].max().reset_index()
# Merge to get potential_total_value_of_award for max last_modified_date
df_grouped_product_or_service_code_max_last_modified = pd.merge(df_narrow, df_grouped_product_or_service_code, on=['product_or_service_code', 'last_modified_date'], how='inner')
# Group by product_or_service_code to get the number of rows
row_counts = df_grouped_product_or_service_code_max_last_modified.groupby('product_or_service_code').size().reset_index(name='row_count')
# Merge row counts back to the main DataFrame
df_grouped_product_or_service_code_max_last_modified = pd.merge(df_grouped_product_or_service_code_max_last_modified, row_counts, on='product_or_service_code', how='inner')
# Select relevant columns and sort
df_grouped_product_or_service_code_max_last_modified = df_grouped_product_or_service_code_max_last_modified[['product_or_service_code', 'product_or_service_code_description', 'action_date_fiscal_year', 'last_modified_date', 'potential_total_value_of_award', 'row_count']].sort_values(by='potential_total_value_of_award', ascending=False)

# format the potential_total_value_of_award_sum column as currency
df_grouped_product_or_service_code_max_last_modified['potential_total_value_of_award'] = df_grouped_product_or_service_code_max_last_modified['potential_total_value_of_award'].map('${:,.2f}'.format)
# format the last_modified_date_max column as date
df_grouped_product_or_service_code_max_last_modified['last_modified_date'] = df_grouped_product_or_service_code_max_last_modified['last_modified_date'].dt.strftime('%Y-%m-%d')

df_grouped_product_or_service_code_max_last_modified.head()

Unnamed: 0,product_or_service_code,product_or_service_code_description,action_date_fiscal_year,last_modified_date,potential_total_value_of_award,row_count
1336,D399,IT AND TELECOM- OTHER IT AND TELECOMMUNICATIONS,2025,2025-01-31,"$4,934,000,000.00",1
577,C1AA,ARCHITECT AND ENGINEERING- CONSTRUCTION: OFFIC...,2025,2025-01-27,"$512,709,360.00",1
482,Y1FF,CONSTRUCTION OF PENAL FACILITIES,2025,2025-01-23,"$461,497,000.00",1
627,7030,INFORMATION TECHNOLOGY SOFTWARE,2025,2025-01-30,"$440,896,876.52",1
652,Y1AA,CONSTRUCTION OF OFFICE BUILDINGS,2025,2024-12-03,"$403,835,926.00",1


In [13]:
# export each dataframe to a separate sheet within an excel file
# use a relative path rather than an absolute path
with pd.ExcelWriter(r'USA_Spending_Summary_Analysis.xlsx') as writer:  
    df_narrow.to_excel(writer, sheet_name='raw_data')
    df_grouped_award_id_max_last_modified.to_excel(writer, sheet_name='Summary by Award ID')
    df_grouped_recipient_uei_max_last_modified.to_excel(writer, sheet_name='Summary by Recipient UEI')
    df_grouped_recipient_parent_uei_max_last_modified.to_excel(writer, sheet_name='Summary by Parent Recipient UEI')
    df_grouped_product_or_service_code_max_last_modified.to_excel(writer, sheet_name='Summary by PSC')