![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+ML%2FSolution+Prototypes%2Fdocument-processing&file=5-document-anomalies.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20ML/Solution%20Prototypes/document-processing/5-document-anomalies.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520ML%2FSolution%2520Prototypes%2Fdocument-processing%2F5-document-anomalies.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20ML/Solution%20Prototypes/document-processing/5-document-anomalies.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20ML/Solution%20Prototypes/document-processing/5-document-anomalies.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Detecting Anomalous Documents

> This workflow is part of a series of workflows for the solution prototype: [Document Processing With Generative AI: Parse, Extract, Validate Authenticity, and More](./readme.md)

In the previous workflow ([4-document-similarity](./4-document-similarity.ipynb)) we used distance between embeddings as a way of understanding document simlarity.  In this workflow we expand that idea to also detect dis-similar documents or anomalies which are potentially fraudulent.  This workflow uses the same `VECTOR_SEARCH` function directly within BigQuery and the set of documents with known variations that was also prepared in a BigQuery table in workflow 2 ([2-document-extraction](./2-document-extraction.ipynb)), and embedding values added in workflow 3 ([3-document-embedding](./3-document-embedding.ipynb)).

## Setup

Note that this notebook expects to use a local virtual environment with the `./requirements.txt` installed.  

A potential workaround if using this notebook standalone is running:

>```python
>pip install -r requirements.txt
>```

And then restart the kernel.

In [1]:
# package imports for this work
import os, subprocess

from IPython.display import display, Image, Markdown
import ipywidgets
import matplotlib.pyplot as plt
import seaborn as sns

from google.cloud import storage
from google.cloud import bigquery

In [2]:
# what project are we working in?
PROJECT_ID = subprocess.run(['gcloud', 'config', 'get-value', 'project'], capture_output=True, text=True, check=True).stdout.strip()
PROJECT_ID

'statmike-mlops-349915'

In [3]:
LOCATION = 'us-central1'

SERIES = 'applied-ml-solution-prototypes'
EXPERIMENT = 'document-processing'
GCS_BUCKET = PROJECT_ID # bucket has same name as project here

In [4]:
# setup google cloud storage client
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

# setup google cloud bigquery client
bq = bigquery.Client(project = PROJECT_ID)

# load the bigquery magics for jupyter with:
%load_ext bigquery_magics

---
## Review Data Sources

During this series we have created information tables in BigQuery to collect document information (extrated field, embedded values of pages) and vendor average information.  These tables will becomes the basis for evaluating documents as they come in.


### Vendor Averages

This table has the average embedding for documents with known authenticity as well is the distribution information for authentic documents distance from the average embedding.  This information is aggregated for each known vendor.

In [5]:
%%bigquery
SELECT *
FROM `statmike-mlops-349915.solution_prototype_document_processing.known_authenticity_vendor_info`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,vendor,average_embedding,average_distance_center,stddev_distance_center
0,vendor_4,"[-0.003796251816585715, 0.04530451753571429, 0...",-0.964754,0.010032
1,vendor_7,"[0.013940833694827586, 0.042548744, 0.04536325...",-0.957068,0.006144
2,vendor_9,"[-0.0017655710555652179, 0.03942124971304347, ...",-0.96174,0.010771
3,vendor_11,"[-0.006832965441904762, 0.04243769000000001, 0...",-0.939972,0.0133
4,vendor_3,"[-0.0036581132154210523, 0.028580572899999997,...",-0.97152,0.004743
5,vendor_0,"[-0.014520798997894737, 0.03216948262631579, 0...",-0.958915,0.00522
6,vendor_2,"[0.009100865230625, 0.051977301012500005, 0.06...",-0.951682,0.008794
7,vendor_5,"[-4.8841104625000384e-05, 0.03990647653124999,...",-0.96469,0.004031
8,vendor_6,"[0.01954306822105263, 0.027207733263157893, 0....",-0.952401,0.006769
9,vendor_1,"[-0.027806649450000002, 0.04521821479999999, 0...",-0.937559,0.011703


### Documents With Known Authenticity

We have a collection of derived information for document known to be authentic:

In [6]:
%%bigquery
SELECT *
FROM `statmike-mlops-349915.solution_prototype_document_processing.known_authenticity`
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,ml_process_document_result,ml_process_document_status,vendor_name,vendor_address,company_name,company_address,invoice_id,invoice_total,line_item,uri,updated,vendor,embedding
0,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,,,BioTech Innovations Corp,"666 Genome Way\nSan Diego, CA 92121",KD-2024-0315,$37108.50,"[{'item_sku': 'WEB- DEV- 001', 'item_descripti...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:53:37.673000+00:00,vendor_2,"[0.0089465566, 0.0536203124, 0.0614382476, -0...."
1,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,,,HealthAI Innovations,"123 Main Street\nSan Francisco, CA 94111",INV-2024-0315,$21924.00,"[{'item_sku': 'CSD- 001', 'item_description': ...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:53:35.765000+00:00,vendor_2,"[0.0121024447, 0.0618715286, 0.0657297298, -0...."
2,"{""entities"":[{""confidence"":1,""id"":""1"",""propert...",,,,GlobalMed Health,"123 Serene Drive\nSan Diego, CA 92101",INV-2024-1122,$23600.00,"[{'item_sku': 'WEB- DEV- 001', 'item_descripti...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:53:35.038000+00:00,vendor_2,"[0.0141071267, 0.0521478951, 0.0623387806, -0...."
3,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,,,Swift Logistics Solutions,"987 Elm Street\nDallas, TX 75201",KD-2024-0722,$19920.00,"[{'item_sku': 'WEB- DEV- 001', 'item_descripti...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:53:39.182000+00:00,vendor_2,"[0.0090936739, 0.0496088825, 0.0622638054, -0...."
4,"{""entities"":[{""confidence"":1,""id"":""1"",""propert...",,,,Style Forward Retail,"99 Fashion Blvd Los Angeles, CA 90015",INV-2024-1105,$34800.00,"[{'item_sku': None, 'item_description': 'Web D...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:52:52.221000+00:00,vendor_12,"[0.0019979002, 0.0560282543, 0.0625561327, -0...."


### Documents With Unknown Authenticity

We have a collection of derived information for documents with unknown authenticity - actually in workflow [0-generate-documents](./0-generate-documents.ipynb) these documents were created by a process that does allow deviations in the layout and format.  These document should be detected as anomalous!

In [7]:
%%bigquery
SELECT *
FROM `statmike-mlops-349915.solution_prototype_document_processing.unknown_authenticity`
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,ml_process_document_result,ml_process_document_status,vendor_name,vendor_address,company_name,company_address,invoice_id,invoice_total,line_item,uri,updated,vendor,embedding
0,"{""mimeType"":""application/pdf""}",,,,,,,,[],gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:54:11.423000+00:00,vendor_5,"[0.0131794149, 0.0133737465, 0.00720742205, -0..."
1,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,,,Cyberdyne Systems,"789 Pine Lane\nHill Valley, WA 98052",INV-2024-1022,$50032.00,"[{'item_sku': 'CS- WEB- 001', 'item_descriptio...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:54:13.291000+00:00,vendor_5,"[0.00087831, 0.0551444367, 0.0510212779, -0.00..."
2,"{""mimeType"":""application/pdf""}",,,,,,,,[],gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:55:09.934000+00:00,vendor_8,"[0.0160381682, 0.0116377119, 0.00194062886, 0...."
3,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,Apex Digital Solutions,,ManuTech Solutions,"321 Oak Street\nDetroit, MI 48201",INV-2024-1022,$19040.00,"[{'item_sku': 'SW- DEV- 001', 'item_descriptio...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:52:11.750000+00:00,vendor_10,"[-0.0163299236, 0.0281633101, 0.0402984172, -0..."
4,"{""entities"":[{""confidence"":1,""id"":""0"",""mention...",,Apex Digital Solutions,,Quantum Health Systems,"890 BioTech Drive\nBoston, MA 02115",INV-2024-1023,$23940.00,"[{'item_sku': 'SWDEV- 001', 'item_description'...",gs://statmike-mlops-349915/applied-ml-solution...,2025-04-23 20:52:12.517000+00:00,vendor_10,"[-0.0247134808, 0.0297346395, 0.0328056589, -0..."


---
## Compare A Document To Vendor Information

These queries use a single document as the basis for evaluations:

>'gs://statmike-mlops-349915/applied-ml-solution-prototypes/document-processing/vendor_5/fake_invoices/vendor_5_invoice_3.pdf'

Change this to any of the vendors documents in the subfolder `/fake_invoices`.

### Calculate The Distance From the Document To The Vendor Average

In [16]:
%%bigquery
# for query document(s) calcualte the distance to each vendors average embedding
SELECT
    SPLIT(query.uri, '/')[7] as query, 
    base.vendor,
    distance
FROM VECTOR_SEARCH(
    # The base table and column to search for neighbors in:
    (SELECT vendor, average_embedding FROM `statmike-mlops-349915.solution_prototype_document_processing.known_authenticity_vendor_info`),
    'average_embedding',
    # The query table and column to search with - pick an anomalous doucment
    (
        SELECT uri, embedding, vendor
        FROM `statmike-mlops-349915.solution_prototype_document_processing.unknown_authenticity`
        WHERE uri LIKE '%vendor_5_invoice_3.pdf%'
    ),
    'embedding',
    # options
    top_k => -1,
    distance_type => 'DOT_PRODUCT'
)
WHERE query.vendor = base.vendor


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,query,vendor,distance
0,vendor_5_invoice_3.pdf,vendor_5,-0.48399


**Interpretation**

This document is not close to the vendors average.  Perhaps it is mislabeled as the wrong vendor.  In the next section we look at the documents distance to each vendors average.

### Calculate The Distance From The Document To Each Vendor Average



In [17]:
%%bigquery
# for query document(s) calcualte the distance to each vendors average embedding
SELECT
    SPLIT(query.uri, '/')[7] as query, 
    base.vendor,
    distance
FROM VECTOR_SEARCH(
    # The base table and column to search for neighbors in:
    (SELECT vendor, average_embedding FROM `statmike-mlops-349915.solution_prototype_document_processing.known_authenticity_vendor_info`),
    'average_embedding',
    # The query table and column to search with - pick an anomalous doucment
    (
        SELECT uri, embedding, vendor
        FROM `statmike-mlops-349915.solution_prototype_document_processing.unknown_authenticity`
        WHERE uri LIKE '%vendor_5_invoice_3.pdf%'
    ),
    'embedding',
    # options
    top_k => -1,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,query,vendor,distance
0,vendor_5_invoice_3.pdf,vendor_13,-0.4857
1,vendor_5_invoice_3.pdf,vendor_5,-0.48399
2,vendor_5_invoice_3.pdf,vendor_14,-0.483877
3,vendor_5_invoice_3.pdf,vendor_4,-0.468442
4,vendor_5_invoice_3.pdf,vendor_8,-0.464682
5,vendor_5_invoice_3.pdf,vendor_12,-0.464473
6,vendor_5_invoice_3.pdf,vendor_9,-0.447008
7,vendor_5_invoice_3.pdf,vendor_2,-0.445453
8,vendor_5_invoice_3.pdf,vendor_7,-0.441649
9,vendor_5_invoice_3.pdf,vendor_6,-0.437258


**Interpretation**

The document is actaully closer to a different vendor but still not very close.  Rather than being mis-labled it is likely a document that need to be reviewed for authenticity.  This will be done with generative AI in the next workflow [6-document-comparison](./6-document-comparison.ipynb)

### Evaluate The Distance Between The Document And Vendor(s)

Just looking at the distance calculations above makes it clear this choosen document is anomalous.  Sometime documents will be a closer to the cutoff and harder to evaluate.  This next step uses the distribution information to statistically evaluate the distance and flag it for consideration as anomalous.