# Generating Unified Profile Vector Embeddings

**Purpose**:
  1. Load the final, engineered feature set and corresponding Taxpayer IDs
     prepared in Notebook 03.
  2. Apply a technique to represent each taxpayer's profile (defined by their
     engineered features) as a dense vector embedding.
     *MVP Approach*: Use the scaled feature vectors directly as embeddings.
  3. Verify the dimensions and alignment of the generated embeddings and IDs.
  4. Save the embeddings and corresponding Taxpayer IDs in formats suitable for
     ingestion into a vector database in the next notebook.

**Why Embeddings?**
  Embeddings capture the complex, multi-faceted characteristics of each taxpayer
  profile (derived from multiple data sources) in a numerical vector format.
  This allows us to use efficient vector similarity search techniques to find
  taxpayers with similar overall profiles, which is crucial for identifying
  anomalous or potentially fraudulent patterns that might be hidden when looking
  at features in isolation.

**Prerequisites**:
  - Successful completion of [Notebook 03](./notebook_03.ipynb).
  - Existence of the engineered features file ('engineered_features.csv').
  - Existence of the corresponding Taxpayer IDs file ('taxpayer_ids.csv').

**Outputs**:
  - A NumPy array containing the vector embeddings (one row per taxpayer).
  - This array saved to a file (e.g., 'embeddings.npy').
  - The corresponding Taxpayer IDs saved alongside (e.g., 'embedding_ids.csv').

**Next Step**:
  [Notebook 05](./notebook_05.ipynb) will set up a vector database and index these generated embeddings.

## Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import os
# Optional: Import PCA if demonstrating dimensionality reduction as an alternative
# from sklearn.decomposition import PCA

# --- Configuration ---
PROCESSED_DATA_DIR = './data/processed' # Directory containing N03 output
OUTPUT_DIR = './data/processed' # Directory to save embeddings and IDs

FEATURES_INPUT_FILE = os.path.join(PROCESSED_DATA_DIR, 'engineered_features.csv')
IDS_INPUT_FILE = os.path.join(PROCESSED_DATA_DIR, 'taxpayer_ids.csv')

EMBEDDINGS_OUTPUT_FILE = os.path.join(OUTPUT_DIR, 'embeddings.npy')
EMBEDDING_IDS_OUTPUT_FILE = os.path.join(OUTPUT_DIR, 'embedding_ids.csv') # Save IDs again for clarity

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Notebook 04: Generating Unified Profile Vector Embeddings")
print("-" * 50)
print(f"Loading engineered features from: {FEATURES_INPUT_FILE}")
print(f"Loading taxpayer IDs from: {IDS_INPUT_FILE}")
print(f"Saving embeddings to: {EMBEDDINGS_OUTPUT_FILE}")
print(f"Saving corresponding IDs to: {EMBEDDING_IDS_OUTPUT_FILE}")
print("-" * 50)

Notebook 04: Generating Unified Profile Vector Embeddings
--------------------------------------------------
Loading engineered features from: ./data/processed/engineered_features.csv
Loading taxpayer IDs from: ./data/processed/taxpayer_ids.csv
Saving embeddings to: ./data/processed/embeddings.npy
Saving corresponding IDs to: ./data/processed/embedding_ids.csv
--------------------------------------------------


## Load Processed Features and IDs

In [2]:
try:
    features_df = pd.read_csv(FEATURES_INPUT_FILE)
    print(f"Successfully loaded engineered features: {features_df.shape}")
except FileNotFoundError:
    print(f"ERROR: Engineered features file not found at {FEATURES_INPUT_FILE}.")
    print("Please ensure Notebook 03 was run successfully and saved the file.")
    raise

try:
    taxpayer_ids_df = pd.read_csv(IDS_INPUT_FILE)
    print(f"Successfully loaded taxpayer IDs: {taxpayer_ids_df.shape}")
except FileNotFoundError:
    print(f"ERROR: Taxpayer IDs file not found at {IDS_INPUT_FILE}.")
    print("Please ensure Notebook 03 was run successfully and saved the file.")
    raise

# Basic Validation
if features_df.isnull().values.any():
    print("ERROR: Missing values found in the loaded features data!")
    print(features_df.isnull().sum())
    raise ValueError("NaNs found in feature data. Cannot generate embeddings.")
else:
    print("Validation: No missing values found in features.")

if len(features_df) != len(taxpayer_ids_df):
    print(f"ERROR: Mismatch between number of feature rows ({len(features_df)}) and number of IDs ({len(taxpayer_ids_df)})!")
    raise ValueError("Mismatch in length between features and IDs.")
else:
    print("Validation: Number of feature rows matches number of IDs.")


Successfully loaded engineered features: (4900, 28)
Successfully loaded taxpayer IDs: (4900, 1)
Validation: No missing values found in features.
Validation: Number of feature rows matches number of IDs.


## Prepare Data for Embedding Generation

In [3]:
# Convert features DataFrame to NumPy array
# This array represents the multi-dimensional space where each taxpayer profile exists
features_array = features_df.to_numpy()
print(f"Converted features DataFrame to NumPy array with shape: {features_array.shape}")

# Ensure IDs are in a simple list format, preserving order
id_list = taxpayer_ids_df['Taxpayer ID'].astype(str).tolist()
print(f"Converted Taxpayer IDs to list. Total IDs: {len(id_list)}")

# Final check for alignment
assert features_array.shape[0] == len(id_list), "Mismatch between feature array rows and ID list length!"
print("Validation: Feature array rows align with ID list length.")

Converted features DataFrame to NumPy array with shape: (4900, 28)
Converted Taxpayer IDs to list. Total IDs: 4900
Validation: Feature array rows align with ID list length.


In [8]:
display(features_array)

array([[ 2.98799529,  3.7496856 ,  1.82753275, ..., -0.45256964,
         2.45563212, -0.25218817],
       [ 1.24108515,  1.24088774,  0.0544587 , ..., -0.45256964,
        -0.40722712, -0.25218817],
       [-0.41359194, -0.19655941, -0.83207832, ..., -0.45256964,
        -0.40722712, -0.25218817],
       ...,
       [-0.31244003, -0.32276434, -0.83207832, ..., -0.45256964,
        -0.40722712,  3.96529312],
       [-0.31244003, -0.32276434, -0.83207832, ..., -0.45256964,
        -0.40722712,  3.96529312],
       [-0.31244003, -0.32276434, -0.83207832, ..., -0.45256964,
        -0.40722712,  3.96529312]])

## Generate Embeddings

In [4]:
# --- MVP Approach: Use Engineered Features Directly ---
# In this approach, the final processed feature vector for each taxpayer IS the embedding.
# This is the simplest method and directly uses the information engineered in N03.
# The dimensionality of the embedding will be equal to the number of features.

embeddings = features_array
embedding_dimension = embeddings.shape[1]

print(f"Using engineered feature vectors directly as embeddings.")
print(f"Generated {embeddings.shape[0]} embeddings.")
print(f"Embedding Dimension: {embedding_dimension}")


# --- Optional Alternative: Dimensionality Reduction (e.g., PCA) ---
# If the feature dimensionality is very high, or if we want potentially smoother
# embeddings, techniques like PCA can be used. This is not the primary path for the MVP.
# Uncomment the following block to experiment with PCA:
"""
print("\n--- Optional: Generating embeddings using PCA ---")
# Choose the number of dimensions for the PCA embedding
PCA_DIMENSIONS = 32 # Example dimension - tune based on explained variance

pca = PCA(n_components=PCA_DIMENSIONS, random_state=42)
embeddings_pca = pca.fit_transform(features_array)

print(f"Generated PCA embeddings with shape: {embeddings_pca.shape}")
print(f"Explained variance ratio by {PCA_DIMENSIONS} components: {pca.explained_variance_ratio_.sum():.4f}")

# If using PCA, you would replace the main 'embeddings' variable:
# embeddings = embeddings_pca
# embedding_dimension = PCA_DIMENSIONS
# print(f"NOTE: Switched to using PCA embeddings for subsequent steps.")
print("--- End Optional PCA Section ---")
"""
# --- End Optional Section ---

Using engineered feature vectors directly as embeddings.
Generated 4900 embeddings.
Embedding Dimension: 28


'\nprint("\n--- Optional: Generating embeddings using PCA ---")\n# Choose the number of dimensions for the PCA embedding\nPCA_DIMENSIONS = 32 # Example dimension - tune based on explained variance\n\npca = PCA(n_components=PCA_DIMENSIONS, random_state=42)\nembeddings_pca = pca.fit_transform(features_array)\n\nprint(f"Generated PCA embeddings with shape: {embeddings_pca.shape}")\nprint(f"Explained variance ratio by {PCA_DIMENSIONS} components: {pca.explained_variance_ratio_.sum():.4f}")\n\n# If using PCA, you would replace the main \'embeddings\' variable:\n# embeddings = embeddings_pca\n# embedding_dimension = PCA_DIMENSIONS\n# print(f"NOTE: Switched to using PCA embeddings for subsequent steps.")\nprint("--- End Optional PCA Section ---")\n'

## Inspect Embeddings

In [5]:
print(f"Shape of the final embeddings array: {embeddings.shape}")
print(f"Data type of embeddings: {embeddings.dtype}")

# Show the first few embeddings (or slices of them)
print("\nSample Embeddings (first 3):")
for i in range(min(3, len(embeddings))):
    # Print first 10 components if dimension is large
    print(f"  ID {id_list[i]}: {embeddings[i][:min(10, embedding_dimension)]}...")

Shape of the final embeddings array: (4900, 28)
Data type of embeddings: float64

Sample Embeddings (first 3):
  ID TXP_0A78B11C9A: [ 2.98799529  3.7496856   1.82753275  1.61055236  0.53166466  0.43593582
  0.25299264  2.24382894 -0.29226896 -0.29226896]...
  ID TXP_85119644D8: [ 1.24108515  1.24088774  0.0544587  -0.34890359 -0.73421114 -0.83103712
 -0.38314222  0.17032365 -0.29226896 -0.29226896]...
  ID TXP_FB9D1C4009: [-0.41359194 -0.19655941 -0.83207832 -0.5626008  -0.22500131 -0.25262929
 -0.22968489 -0.866429   -0.29226896 -0.29226896]...


## Save Embeddings and Corresponding IDs

In [6]:
print("Saving embeddings in NumPy binary format (.npy) and IDs as CSV.")

try:
    # Save the embeddings array
    np.save(EMBEDDINGS_OUTPUT_FILE, embeddings)
    print(f"Successfully saved embeddings NumPy array to: {EMBEDDINGS_OUTPUT_FILE}")

    # Save the corresponding IDs (in the same order) as a CSV
    # This makes it easy to load the IDs alongside the embeddings in the next notebook
    ids_to_save_df = pd.DataFrame({'Taxpayer ID': id_list})
    ids_to_save_df.to_csv(EMBEDDING_IDS_OUTPUT_FILE, index=False)
    print(f"Successfully saved corresponding Taxpayer IDs to: {EMBEDDING_IDS_OUTPUT_FILE}")

except Exception as e:
    print(f"ERROR saving embedding data files: {e}")

Saving embeddings in NumPy binary format (.npy) and IDs as CSV.
Successfully saved embeddings NumPy array to: ./data/processed/embeddings.npy
Successfully saved corresponding Taxpayer IDs to: ./data/processed/embedding_ids.csv


## Conclusion

In [7]:
print("Notebook 04 finished.")
print("Successfully generated vector embeddings for each unified taxpayer profile:")
print(f"  - Loaded {len(id_list)} engineered profiles.")
print(f"  - Used the {embedding_dimension}-dimensional feature vectors directly as embeddings (MVP approach).")
print("  - Saved the embeddings as a NumPy array and the corresponding IDs.")
print("\nThe embeddings are now ready for indexing in a vector database.")

Notebook 04 finished.
Successfully generated vector embeddings for each unified taxpayer profile:
  - Loaded 4900 engineered profiles.
  - Used the 28-dimensional feature vectors directly as embeddings (MVP approach).
  - Saved the embeddings as a NumPy array and the corresponding IDs.

The embeddings are now ready for indexing in a vector database.


Proceed to [Notebook 05](./notebook_05.ipynb): Setting Up Vector DB & Indexing Profile Embeddings.