# Initial Pipeline
**Goal:** Implement a data processing pipeline using appropriate data structures for different operations on contract financial data.

**Tasks:**
1. Create a basic pipeline script that:
   - Loads contract transaction data using naive approach (loading all at once)
   - Processes vendor and project data using basic data structures
   - Implements simple financial analysis functions (spend by category, budget variance)

2. Implement specific components with appropriate data structures:
   - **Contract Lookup:** Dictionary-based contract catalog with O(1) access by contract ID
   - **Transaction Processing:** List-based storage with sequential processing for financial transactions
   - **Vendor Analysis:** Set-based unique vendor grouping for spend analysis
   - **Timeline Tracking:** Queue-based transaction sequence for chronological analysis

3. Document each data structure choice with justification and complexity analysis

**Deliverable:** A functioning but unoptimized pipeline (`initial_financial_pipeline.py`)

## Exploratory Data Analysis
Let's look at the incoming data structure.

### Contracts

In [18]:
import pandas as pd
import os

DATA_PATH = '../data/raw/'

# Import contracts data
con_df = pd.read_csv(os.path.join(DATA_PATH, 'contracts.csv'))
con_df.head()

Unnamed: 0,contract_id,contract_number,vendor_id,project_id,contract_type,start_date,end_date,original_value,current_value,status,department,description,contracting_officer
0,CTR-000001,N0027132-53-D-2254,VEN-0132,PRJ-0047,Cost-Plus-Incentive-Fee,2024-01-07,2025-01-06,20837173.33,18753456.0,Active,Coast Guard,exploit killer markets,PER-00528
1,CTR-000002,N0041190-61-D-4930,VEN-0084,PRJ-0027,Time-and-Materials,2023-07-13,2025-07-12,19545536.6,19545536.6,Active,Marines,orchestrate turn-key schemas,PER-00491
2,CTR-000003,N0092555-85-D-1451,VEN-0190,PRJ-0016,Firm-Fixed-Price,2023-05-31,2024-05-30,46294987.63,46294987.63,Active,NSA,architect innovative functionalities,PER-00286
3,CTR-000004,N0097012-45-D-2287,VEN-0061,PRJ-0001,Time-and-Materials,2022-09-20,2025-09-19,40540302.76,60810454.14,Active,DARPA,repurpose impactful web services,PER-00380
4,CTR-000005,N0079544-11-D-7162,VEN-0101,PRJ-0002,Firm-Fixed-Price,2021-01-13,2024-01-13,23498686.99,23498686.99,Active,Army,maximize bricks-and-clicks web services,PER-00641


In [14]:
con_df.describe(include='all')

Unnamed: 0,contract_id,contract_number,vendor_id,project_id,contract_type,start_date,end_date,original_value,current_value,status,department,description,contracting_officer
count,500,500,500,500,500,500,500,500.0,500.0,500,500,500,500
unique,500,500,187,50,6,442,461,,,5,9,498,392
top,CTR-000001,N0027132-53-D-2254,VEN-0004,PRJ-0044,Cost-Plus-Award-Fee,2024-03-17,2025-08-14,,,Active,NSA,transition clicks-and-mortar users,PER-00292
freq,1,1,8,18,96,3,3,,,304,64,2,4
mean,,,,,,,,25877930.0,26330280.0,,,,
std,,,,,,,,14355450.0,15012350.0,,,,
min,,,,,,,,101591.6,111750.8,,,,
25%,,,,,,,,13551330.0,13219790.0,,,,
50%,,,,,,,,26167850.0,26031200.0,,,,
75%,,,,,,,,38762310.0,39204060.0,,,,


In [15]:
con_df.isna().sum()

contract_id            0
contract_number        0
vendor_id              0
project_id             0
contract_type          0
start_date             0
end_date               0
original_value         0
current_value          0
status                 0
department             0
description            0
contracting_officer    0
dtype: int64

In [32]:
contract_dict = {con_df['contract']: contract for contracts in contract}

NameError: name 'contract' is not defined

### Personnel

In [27]:
pers_df = pd.read_csv(os.path.join(DATA_PATH, 'personnel.csv'))
pers_df.head()

Unnamed: 0,personnel_id,name,role,department,email,phone,security_clearance,hire_date,supervisor
0,PER-00001,Rodney Freeman,Financial Analyst,DLA,molly37@example.com,+1-627-733-0090x74528,Confidential,2020-10-23,PER-00987
1,PER-00002,Joseph Perez,Financial Analyst,DARPA,monroejonathan@example.net,780-443-3140x9757,Top Secret,2006-10-31,PER-00235
2,PER-00003,Brenda Hammond,Contract Specialist,DIA,proberts@example.com,3778604660,Public Trust,2013-03-24,PER-00134
3,PER-00004,James Vargas,Financial Analyst,Army,gonzalesgabriel@example.net,+1-260-556-9370,Confidential,2017-03-21,PER-00355
4,PER-00005,Robert Lester,Subject Matter Expert,DARPA,ruizmisty@example.com,+1-679-403-4574x4062,Confidential,2018-02-03,PER-00244


In [28]:
pers_df.describe(include='all')

Unnamed: 0,personnel_id,name,role,department,email,phone,security_clearance,hire_date,supervisor
count,1000,1000,1000,1000,1000,1000,1000,1000,796
unique,1000,994,8,9,997,1000,5,939,548
top,PER-00001,Crystal Reed,Quality Assurance,Coast Guard,hcole@example.net,+1-627-733-0090x74528,Top Secret,2011-06-24,PER-00538
freq,1,2,141,120,2,1,227,3,5


### Projects

In [30]:
proj_df = pd.read_csv(os.path.join(DATA_PATH, 'projects.csv'))
proj_df.head()

Unnamed: 0,project_id,name,type,description,start_date,end_date,total_budget,department,program_manager,priority
0,PRJ-0001,Project Synergistic national access,Construction,Simply live others system. Threat painting eve...,2015-10-13,2018-10-12,465549100.0,DIA,PER-00454,Medium
1,PRJ-0002,Project Public-key multimedia service-desk,Research,Current someone market government. Third train...,2017-04-15,2020-04-14,465707400.0,DIA,PER-00871,Low
2,PRJ-0003,Project Universal client-server hierarchy,Consulting,Admit ever community develop. Structure up ano...,2017-03-23,2021-03-22,5260635.0,DIA,PER-00198,High
3,PRJ-0004,Project Visionary web-enabled parallelism,Construction,Answer white now religious stuff. Sort across ...,2022-01-13,2028-01-12,220555100.0,DIA,PER-00625,Medium
4,PRJ-0005,Project Mandatory even-keeled service-desk,Research,Serious never cost information rather. Late hi...,2018-01-31,2025-01-29,420588000.0,Coast Guard,PER-00653,High


In [31]:
proj_df.describe(include='all')

Unnamed: 0,project_id,name,type,description,start_date,end_date,total_budget,department,program_manager,priority
count,50,50,50,50,50,50,50.0,50,50,50
unique,50,50,10,50,49,49,,9,46,4
top,PRJ-0001,Project Synergistic national access,Development,Simply live others system. Threat painting eve...,2015-10-13,2019-10-12,,NSA,PER-00081,Low
freq,1,1,8,1,2,2,,9,2,18
mean,,,,,,,256212100.0,,,
std,,,,,,,159791300.0,,,
min,,,,,,,5260635.0,,,
25%,,,,,,,101777000.0,,,
50%,,,,,,,255137500.0,,,
75%,,,,,,,417494000.0,,,


### Deliverables

In [22]:
deliv_df = pd.read_csv(os.path.join(DATA_PATH,'deliverables.csv'))
deliv_df.head()

Unnamed: 0,deliverable_id,contract_id,title,type,due_date,delivery_date,status,description,accepted,reviewer
0,DEL-000001,CTR-000453,Deliverable harness plug-and-play channels,Training,2024-07-15,2024-07-15,Delivered,Allow especially outside return stage full pre...,Yes,PER-00549
1,DEL-000002,CTR-000397,Deliverable facilitate efficient convergence,Documentation,2024-03-08,2024-03-11,Accepted,Black cover company somebody become fish appro...,,PER-00423
2,DEL-000003,CTR-000095,Deliverable drive rich functionalities,Prototype,2022-01-30,,Pending,Sense expect stuff up.,,PER-00136
3,DEL-000004,CTR-000291,Deliverable deploy integrated info-mediaries,Hardware,2020-01-12,2020-01-12,Delivered,Lose lay plan hundred necessary in wide.,Yes,PER-00292
4,DEL-000005,CTR-000314,Deliverable facilitate interactive web-readiness,Data,2026-04-25,2026-05-02,Accepted,Ever avoid responsibility both as develop.,,PER-00254


In [23]:
deliv_df.describe(include='all')

Unnamed: 0,deliverable_id,contract_id,title,type,due_date,delivery_date,status,description,accepted,reviewer
count,5000,5000,5000,5000,5000,3020,5000,5000,1019,5000
unique,5000,500,4914,7,2210,1701,5,5000,3,993
top,DEL-000001,CTR-000343,Deliverable redefine open-source convergence,Documentation,2024-11-23,2025-06-12,Pending,Allow especially outside return stage full pre...,Yes,PER-00564
freq,1,20,2,780,11,7,1038,1,801,15


### Contract Modifications

In [21]:
# Load the contract modifications data
con_mod_df = pd.read_csv(os.path.join(DATA_PATH,"contract_modifications.csv"))
con_mod_df.head()

Unnamed: 0,modification_id,contract_id,mod_number,mod_date,type,description,value_change,days_change,approved_by,status
0,MOD-000001,CTR-000183,P199,2024-06-21,Funding,Stuff through guy member.,4390464.09,0,PER-00718,Rejected
1,MOD-000002,CTR-000377,P972,2027-08-21,Schedule,Staff same serious visit past time admit.,0.0,83,PER-00066,Approved
2,MOD-000003,CTR-000221,P750,2024-08-07,Administrative,Understand from age receive ready particularly...,0.0,0,PER-00615,Approved
3,MOD-000004,CTR-000078,P313,2022-07-27,Termination,Position resource direction experience north r...,0.0,0,PER-00376,Rejected
4,MOD-000005,CTR-000337,P113,2023-12-20,Termination,Society despite sense write doctor article they.,0.0,0,PER-00463,Pending


In [4]:
con_mod_df.describe(include='all')

Unnamed: 0,modification_id,contract_id,mod_number,mod_date,type,description,value_change,days_change,approved_by,status
count,2000,2000,2000,2000,2000,2000,2000.0,2000.0,2000,2000
unique,2000,492,799,1370,6,2000,,,889,4
top,MOD-000001,CTR-000097,P248,2025-05-01,Extension,Stuff through guy member.,,,PER-00115,Pending
freq,1,10,9,7,379,1,,,8,529
mean,,,,,,,744931.8,25.5205,,
std,,,,,,,3468195.0,49.920854,,
min,,,,,,,-10384820.0,-30.0,,
25%,,,,,,,0.0,0.0,,
50%,,,,,,,0.0,0.0,,
75%,,,,,,,838213.1,28.0,,
