# Building Unified Entity Profiles

**Purpose**:
  1. Load the cleaned individual source datasets produced by Notebook 01.
  2. Define the central entity ('Taxpayer') and establish a base list of all
     unique entities across sources.
  3. Join/Merge the different source datasets based on the common identifier
     ('Taxpayer ID').
  4. Handle one-to-many relationships (e.g., multiple properties or directorships
     per taxpayer) through aggregation (calculating counts, sums, means, etc.).
  5. Produce a single, unified DataFrame where each row represents a unique
     taxpayer profile with combined information from all available sources.

**Prerequisites**:
  - Successful completion of Notebook 01.
  - Existence of cleaned data files (or cleaned DataFrames in memory):
    - 'cleaned_tax_filings.csv'
    - 'cleaned_property_ownership.csv'
    - (Optional) 'cleaned_company_directorships.csv'

**Outputs**:
  - A Pandas DataFrame (`unified_profile_df`) containing the unified profiles.
  - This DataFrame saved to a CSV file (e.g., 'unified_taxpayer_profiles.csv').

**Next Step**:
  Notebook 03 will perform feature engineering on this unified profile dataset.

## Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import os
from IPython.display import display

# --- Configuration ---
# Assuming cleaned data was saved in Notebook 01 in a 'cleaned' subdirectory
CLEANED_DATA_DIR = './data/cleaned'
OUTPUT_DIR = './data/processed' # Directory to save the unified profile

TAX_FILE = os.path.join(CLEANED_DATA_DIR, 'cleaned_tax_filings.csv')
PROP_FILE = os.path.join(CLEANED_DATA_DIR, 'cleaned_property_ownership.csv')
COMP_FILE = os.path.join(CLEANED_DATA_DIR, 'cleaned_company_directorships.csv') # Optional

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Notebook 02: Building Unified Entity Profiles")
print("-" * 50)
print(f"Looking for cleaned data in: {CLEANED_DATA_DIR}")
print(f"Output directory for unified profile: {OUTPUT_DIR}")
print("-" * 50)

Notebook 02: Building Unified Entity Profiles
--------------------------------------------------
Looking for cleaned data in: ./data/cleaned
Output directory for unified profile: ./data/processed
--------------------------------------------------


## Load Cleaned Data Sources

In [2]:
try:
    tax_df_cleaned = pd.read_csv(TAX_FILE)
    print(f"Successfully loaded cleaned Tax Filings data: {tax_df_cleaned.shape}")
except FileNotFoundError:
    print(f"ERROR: Cleaned Tax Filings file not found at {TAX_FILE}.")
    print("Please ensure Notebook 01 was run successfully and saved the cleaned files.")
    raise

try:
    property_df_cleaned = pd.read_csv(PROP_FILE, parse_dates=['Ownership Date']) # Parse dates on load
    print(f"Successfully loaded cleaned Property Ownership data: {property_df_cleaned.shape}")
except FileNotFoundError:
    print(f"ERROR: Cleaned Property Ownership file not found at {PROP_FILE}.")
    print("Please ensure Notebook 01 was run successfully and saved the cleaned files.")
    raise

company_df_cleaned = None
if os.path.exists(COMP_FILE):
    try:
        company_df_cleaned = pd.read_csv(COMP_FILE)
        print(f"Successfully loaded cleaned Company Directorships data: {company_df_cleaned.shape}")
    except Exception as e:
        print(f"Warning: Could not load cleaned Company Directorships file at {COMP_FILE}. Error: {e}")
else:
    print(f"Info: Optional cleaned Company Directorships file not found at {COMP_FILE}. Skipping.")

# Ensure Taxpayer ID is string in all loaded dataframes
tax_df_cleaned['Taxpayer ID'] = tax_df_cleaned['Taxpayer ID'].astype(str)
property_df_cleaned['Taxpayer ID'] = property_df_cleaned['Taxpayer ID'].astype(str)
if company_df_cleaned is not None:
    company_df_cleaned['Taxpayer ID'] = company_df_cleaned['Taxpayer ID'].astype(str)

Successfully loaded cleaned Tax Filings data: (4750, 4)
Successfully loaded cleaned Property Ownership data: (4599, 4)
Successfully loaded cleaned Company Directorships data: (825, 3)


## Define Central Entity and Create Base DataFrame

In [3]:
print("Central Entity: Taxpayer")

# Collect all unique Taxpayer IDs from all available sources
all_ids = pd.Series(dtype=str)
all_ids = pd.concat([all_ids, tax_df_cleaned['Taxpayer ID']], ignore_index=True)
all_ids = pd.concat([all_ids, property_df_cleaned['Taxpayer ID']], ignore_index=True)
if company_df_cleaned is not None:
    all_ids = pd.concat([all_ids, company_df_cleaned['Taxpayer ID']], ignore_index=True)

unique_taxpayer_ids = all_ids.unique()
print(f"Found {len(unique_taxpayer_ids)} unique Taxpayer IDs across all sources.")

# Create the base DataFrame containing all unique IDs
base_df = pd.DataFrame({'Taxpayer ID': unique_taxpayer_ids})
print("Created base DataFrame with all unique Taxpayer IDs.")

Central Entity: Taxpayer
Found 4895 unique Taxpayer IDs across all sources.
Created base DataFrame with all unique Taxpayer IDs.


## Process & Merge Tax Filing Data (One-to-One)

In [4]:
# Assuming Taxpayer ID is unique in the cleaned tax data (handled in Notebook 01)
# Select relevant columns
tax_data_to_merge = tax_df_cleaned[['Taxpayer ID', 'Declared Income', 'Deductions', 'Sector']].copy()

# Perform a left merge to keep all taxpayers from the base_df
unified_profile_df = pd.merge(base_df, tax_data_to_merge, on='Taxpayer ID', how='left')

print(f"Merged Tax Filing data. Shape after merge: {unified_profile_df.shape}")
print(f"Columns added: {list(tax_data_to_merge.columns[1:])}")

Merged Tax Filing data. Shape after merge: (4895, 4)
Columns added: ['Declared Income', 'Deductions', 'Sector']


## Process, Aggregate & Merge Property Data (One-to-Many)

In [5]:
# Group by Taxpayer ID and calculate aggregate statistics
print("Aggregating property data per Taxpayer ID...")
property_agg = property_df_cleaned.groupby('Taxpayer ID').agg(
    prop_count=('Property Value', 'count'), # Count non-NA property values as proxy for count
    prop_value_total=('Property Value', 'sum'),
    prop_value_avg=('Property Value', 'mean'),
    prop_value_max=('Property Value', 'max'),
    prop_value_min=('Property Value', 'min'),
    prop_loc_distinct_count=('Location', pd.Series.nunique),
    prop_ownership_earliest=('Ownership Date', 'min'),
    prop_ownership_latest=('Ownership Date', 'max')
).reset_index() # Reset index to make Taxpayer ID a column again

print("Aggregation complete. Sample aggregated property data:\n")
display(property_agg.head())

# Perform a left merge to add aggregated property data to the unified profile
unified_profile_df = pd.merge(unified_profile_df, property_agg, on='Taxpayer ID', how='left')

print(f"Merged aggregated Property Ownership data. Shape after merge: {unified_profile_df.shape}")
print(f"Columns added: {list(property_agg.columns[1:])}")

Aggregating property data per Taxpayer ID...
Aggregation complete. Sample aggregated property data:



Unnamed: 0,Taxpayer ID,prop_count,prop_value_total,prop_value_avg,prop_value_max,prop_value_min,prop_loc_distinct_count,prop_ownership_earliest,prop_ownership_latest
0,TXP_0025D86389,1,131960.71,131960.71,131960.71,131960.71,1,2016-08-13,2016-08-13
1,TXP_008A4AF7C8,1,373428.68,373428.68,373428.68,373428.68,1,1900-01-01,1900-01-01
2,TXP_008A9682AC,2,484747.54,242373.77,255940.82,228806.72,2,2005-07-13,2014-06-24
3,TXP_00C8F6EB93,3,906750.37,302250.123333,662450.85,79487.02,2,2006-08-13,2018-11-14
4,TXP_00F6FD6C6A,4,1869803.61,467450.9025,754132.29,135473.23,3,2007-12-03,2017-06-18


Merged aggregated Property Ownership data. Shape after merge: (4895, 12)
Columns added: ['prop_count', 'prop_value_total', 'prop_value_avg', 'prop_value_max', 'prop_value_min', 'prop_loc_distinct_count', 'prop_ownership_earliest', 'prop_ownership_latest']


## Process, Aggregate & Merge Company Data (Optional, One-to-Many)

In [6]:
if company_df_cleaned is not None:
    print("Aggregating company directorship data per Taxpayer ID...")
    # Group by Taxpayer ID and calculate aggregate statistics
    company_agg = company_df_cleaned.groupby('Taxpayer ID').agg(
        directorship_count=('Director Role', 'count'), # Count non-NA roles
        comp_distinct_count=('Company ID', pd.Series.nunique)
    ).reset_index()

    print("Aggregation complete. Sample aggregated company data:")
    display(company_agg.head())

    # Perform a left merge to add aggregated company data
    unified_profile_df = pd.merge(unified_profile_df, company_agg, on='Taxpayer ID', how='left')

    print(f"Merged aggregated Company Directorship data. Shape after merge: {unified_profile_df.shape}")
    print(f"Columns added: {list(company_agg.columns[1:])}")

else:
    print("Skipping Company Directorship merge as the data was not loaded.")

Aggregating company directorship data per Taxpayer ID...
Aggregation complete. Sample aggregated company data:


Unnamed: 0,Taxpayer ID,directorship_count,comp_distinct_count
0,TXP_01B103822E,3,3
1,TXP_020C83F6CA,2,2
2,TXP_0230138BDC,2,2
3,TXP_027879F15E,1,1
4,TXP_0307C12673,2,2


Merged aggregated Company Directorship data. Shape after merge: (4895, 14)
Columns added: ['directorship_count', 'comp_distinct_count']


## Handle Missing Values Resulting from Joins

In [7]:
print("Filling NaNs created by left joins (e.g., taxpayers not present in a source).")

# For count and total columns generated by aggregation, NaN means the taxpayer
# wasn't in that source, so the count/total is 0.
count_total_cols = [
    'prop_count', 'prop_value_total', 'prop_loc_distinct_count',
    'directorship_count', 'comp_distinct_count' # Add directorship cols only if they exist
]

for col in count_total_cols:
    if col in unified_profile_df.columns: # Check if column exists (handles optional company data)
        unified_profile_df[col] = unified_profile_df[col].fillna(0).astype(int if 'count' in col else float) # Use appropriate type
        print(f"Filled NaNs in '{col}' with 0.")

# Other columns (e.g., avg, min, max, dates, sector, income, deductions) might
# legitimately be NaN if the taxpayer wasn't in the source OR if the source data
# was missing initially.
# Decision: We will handle these more strategically during Feature Engineering (Notebook 03).
# For now, let's just report the remaining NaNs.

print("\nRemaining missing values after filling 0 for counts/totals:")
print(unified_profile_df.isnull().sum())

Filling NaNs created by left joins (e.g., taxpayers not present in a source).
Filled NaNs in 'prop_count' with 0.
Filled NaNs in 'prop_value_total' with 0.
Filled NaNs in 'prop_loc_distinct_count' with 0.
Filled NaNs in 'directorship_count' with 0.
Filled NaNs in 'comp_distinct_count' with 0.

Remaining missing values after filling 0 for counts/totals:
Taxpayer ID                   0
Declared Income             145
Deductions                  145
Sector                      145
prop_count                    0
prop_value_total              0
prop_value_avg             2317
prop_value_max             2317
prop_value_min             2317
prop_loc_distinct_count       0
prop_ownership_earliest    2317
prop_ownership_latest      2317
directorship_count            0
comp_distinct_count           0
dtype: int64


## Inspect the Unified Profile DataFrame

In [8]:
print("First 5 rows of the unified profile:")
display(unified_profile_df.head())

print("\nUnified profile DataFrame Info:")
unified_profile_df.info()

print("\nUnified profile DataFrame Description (Numerical Summary):")
# Select only numeric columns for describe to avoid warnings
numeric_cols = unified_profile_df.select_dtypes(include=np.number).columns.tolist()
display(unified_profile_df[numeric_cols].describe())

First 5 rows of the unified profile:


Unnamed: 0,Taxpayer ID,Declared Income,Deductions,Sector,prop_count,prop_value_total,prop_value_avg,prop_value_max,prop_value_min,prop_loc_distinct_count,prop_ownership_earliest,prop_ownership_latest,directorship_count,comp_distinct_count
0,TXP_45F37389A0,250000.0,48301.93,Technology,0,0.0,,,,0,NaT,NaT,0,0
1,TXP_BB837C9FB9,147896.72,23505.16,Manufacturing,3,1027938.19,342646.063333,605739.37,156621.69,3,2007-07-31,2011-07-31,0,0
2,TXP_D1A8EA287F,51184.27,9297.54,Healthcare,2,272408.52,136204.26,180801.29,91607.23,2,2015-02-11,2017-12-08,0,0
3,TXP_A801DC9BBA,42463.53,9691.49,Retail,1,109498.36,109498.36,109498.36,109498.36,1,2020-05-26,2020-05-26,0,0
4,TXP_2D1F72FE6F,36677.42,2491.77,Technology,1,557816.91,557816.91,557816.91,557816.91,1,2017-12-17,2017-12-17,0,0



Unified profile DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4895 entries, 0 to 4894
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Taxpayer ID              4895 non-null   object        
 1   Declared Income          4750 non-null   float64       
 2   Deductions               4750 non-null   float64       
 3   Sector                   4750 non-null   object        
 4   prop_count               4895 non-null   int64         
 5   prop_value_total         4895 non-null   float64       
 6   prop_value_avg           2578 non-null   float64       
 7   prop_value_max           2578 non-null   float64       
 8   prop_value_min           2578 non-null   float64       
 9   prop_loc_distinct_count  4895 non-null   int64         
 10  prop_ownership_earliest  2578 non-null   datetime64[ns]
 11  prop_ownership_latest    2578 non-null   datetime64[ns]
 12  d

Unnamed: 0,Declared Income,Deductions,prop_count,prop_value_total,prop_value_avg,prop_value_max,prop_value_min,prop_loc_distinct_count,directorship_count,comp_distinct_count
count,4750.0,4750.0,4895.0,4895.0,2578.0,2578.0,2578.0,4895.0,4895.0,4895.0
mean,75934.540703,11341.060998,0.93953,392459.8,413570.2,534499.8,310812.4,0.838202,0.168539,0.168539
std,59278.287878,10023.302843,1.128277,692889.4,378439.0,481108.3,360361.6,0.961865,0.576398,0.576398
min,7710.84,227.4,0.0,0.0,50000.0,50000.0,50000.0,0.0,0.0,0.0
25%,33001.8725,4460.5175,0.0,0.0,186578.7,216095.1,108117.2,0.0,0.0,0.0
50%,57096.39,8050.14,1.0,91195.93,293471.6,359925.6,194043.7,1.0,0.0,0.0
75%,99823.65,14707.795,2.0,527574.7,494547.3,691354.5,332799.7,1.0,0.0,0.0
max,250000.0,70926.3,7.0,9040092.0,2387933.0,2387933.0,2387933.0,5.0,5.0,5.0


## Save the Unified Profile DataFrame

In [9]:
output_file_path = os.path.join(OUTPUT_DIR, 'unified_taxpayer_profiles.csv')
try:
    unified_profile_df.to_csv(output_file_path, index=False)
    print(f"Successfully saved unified profiles to: {output_file_path}")
except Exception as e:
    print(f"ERROR saving unified profile file: {e}")

Successfully saved unified profiles to: ./data/processed/unified_taxpayer_profiles.csv


## Conclusion

In [10]:
print("Notebook 02 finished.")
print("Successfully built the unified taxpayer profile DataFrame by:")
print("  - Loading cleaned data sources.")
print("  - Establishing a base list of all unique taxpayers.")
print("  - Merging tax data (one-to-one).")
print("  - Aggregating and merging property data (one-to-many).")
print("  - Aggregating and merging company data (optional, one-to-many).")
print("  - Performing initial handling of NaNs resulting from joins (filling counts/sums with 0).")
print(f"\nFinal unified profile shape: {unified_profile_df.shape}")
print("The unified profile dataset is saved and ready for feature engineering.")
print("\nProceed to Notebook 03: Feature Engineering on Unified Profiles.")

Notebook 02 finished.
Successfully built the unified taxpayer profile DataFrame by:
  - Loading cleaned data sources.
  - Establishing a base list of all unique taxpayers.
  - Merging tax data (one-to-one).
  - Aggregating and merging property data (one-to-many).
  - Aggregating and merging company data (optional, one-to-many).
  - Performing initial handling of NaNs resulting from joins (filling counts/sums with 0).

Final unified profile shape: (4895, 14)
The unified profile dataset is saved and ready for feature engineering.

Proceed to Notebook 03: Feature Engineering on Unified Profiles.
