# Link taxpayers with clustering

**Pre-treament or not ?**

**Embedding models**
* Word2Vec
* all-mpnet-base-v2 (Abhishek et al. 2024) : https://huggingface.co/sentence-transformers/all-mpnet-base-v2

**Clustering methods**
* k-means
* DBSCAN
* Agglomerative Hierarchical Clustering

**References**
* https://huggingface.co/blog/getting-started-with-embeddings
* https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings?hl=en

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
print(os.environ["CUDA_VISIBLE_DEVICES"])

1


In [3]:
import os, sys
from pathlib import Path

BASE = Path(os.path.dirname(os.path.realpath("__file__"))).resolve() # If not on GColab, BASE will be the directory of this notebook
DATASETS = Path('/home/STual/DAN-cadastre/data').resolve()
OUT_BASE = Path('/home/STual/DAN-cadastre/outputs/clustering').resolve()

print(sys.path)
print(BASE)
print(DATASETS)
print(OUT_BASE)

['/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '', '/home/STual/DAN-cadastre/.venv_dan/lib/python3.10/site-packages']
/home/STual/DAN-cadastre/scripts/Clustering
/home/STual/DAN-cadastre/data
/home/STual/DAN-cadastre/outputs/clustering


## Open data

In [4]:
import glob

DATASET = DATASETS / "Taxpayers"
files = glob.glob(str(DATASET) + '/*.json')
print(files)

['/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_all.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_0_100.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_400_500.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_100_200.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_200_300.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_500_600.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_300_400.json', '/home/STual/DAN-cadastre/data/Taxpayers/taxpayers_600_700.json']


In [5]:
import json
# Loop through the JSON files
data = []
for file in files:
    with open(file, 'r') as f:
            data += json.load(f)

In [6]:
with open('taxpayers_backup.json', 'w',encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

In [53]:
import torch
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_similarity(queries,corpus,corpus_embeddings,top_k):
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    for query in queries:
        query_embedding = embedder.encode(query, convert_to_tensor=True)
    
        # We use cosine-similarity and torch.topk to find the highest 5 scores
        similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
        scores, indices = torch.topk(similarity_scores, k=top_k)
    
        print("\nQuery:", query)
        print("Top 5 most similar sentences in corpus:")
    
        for score, idx in zip(scores, indices):
            print(corpus[idx], f"(Score: {score:.4f})")

## 0. Transcription matching

Code from https://www.sbert.net/examples/applications/semantic-search/README.html

In [54]:
# Corpus with example sentences
corpus = [d["transcription"] for d in data[10:]] 
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [d["transcription"] for d in data[:10]] 

top_k = min(15, len(corpus))

embedding_similarity(queries,corpus,corpus_embeddings,top_k)


Query: Rameau J↑n↓ louis fils veuve
Top 5 most similar sentences in corpus:
Jourdain J↑n↓ Louis Veuve (Score: 0.7487)
Rameau J↑n↓ f↑ois↓ veuve (Score: 0.7255)
Rameau Jean Louis fils→françois (Score: 0.7239)
Rameau helvébais françois (Score: 0.7207)
Rameau helvetuis françois (Score: 0.7081)
Rameau f↑ois↓ Joseph (Score: 0.6990)
Soudieux J↑n↓ louis (Score: 0.6909)
Mayeux J↑ques↓ Louis (Score: 0.6832)
Josset Jean Veuve (Score: 0.6752)
Josset J↑a↓ veuve (Score: 0.6690)
Soudieux Ch↑es↓ Louis (Score: 0.6688)
Rameau Jean françois (Score: 0.6678)
Rameau Jean françois (Score: 0.6678)
Leroux Louis Veuve (Score: 0.6663)
Metivier Louis F↑re↓ (Score: 0.6656)

Query: Vigoureux f↑ois↓
Top 5 most similar sentences in corpus:
Vigoureux J↑n↓→antoine (Score: 0.8356)
Vigoureux, J↑n↓ ant↑e↓→militaire (Score: 0.8218)
Vigoureux J↑n↓ ant↑e↓→militaire (Score: 0.8210)
Vigoureux J↑n↓ ant↑e↓→militaire (Score: 0.8210)
Vigoureux f↑ois↓→ex-militaire (Score: 0.8176)
Vigoureux f↑ois↓→ex-militaire (Score: 0.8176)
Vitry

## 1. Normalized transcription

In [55]:
import re
import unicodedata

def remove_accents(text):
    # Normalize the text to decompose characters (NFD: Normal Form Decomposed)
    normalized_text = unicodedata.normalize('NFD', text)
    # Filter out combining marks (category 'Mn' stands for Mark, Nonspacing)
    non_accented_text = ''.join(char for char in normalized_text if unicodedata.category(char) != 'Mn')
    return non_accented_text

def normalize_text(text):
    text = text.lower()
    text = text.replace('→',' ')
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [56]:
transcriptions_normalized = []
for d in data:
    transcriptions_normalized.append(remove_accents(normalize_text(d["transcription"])))

transcriptions_normalized_embeddings = embedder.encode(transcriptions_normalized, convert_to_tensor=True)

embedding_similarity(transcriptions_normalized[:10],transcriptions_normalized[10:],transcriptions_normalized_embeddings[10:],top_k)


Query: rameau jn louis fils veuve
Top 5 most similar sentences in corpus:
rameau jn fois veuve (Score: 0.8590)
rameau jean louis fils francois (Score: 0.8113)
rameau helvebais francois (Score: 0.7704)
rameau helvetuis francois (Score: 0.7524)
laboreau martin veuve (Score: 0.7486)
rameau jean francois (Score: 0.7342)
rameau jean francois (Score: 0.7342)
davois louis veuve (Score: 0.7290)
josset jean veuve (Score: 0.7287)
mazareau pierre veuve veuve (Score: 0.7273)
rameau louis honore (Score: 0.7163)
rameau louis honore (Score: 0.7163)
bertaut antoine veuve (Score: 0.7114)
rameau jn bte (Score: 0.7086)
rameau jn bte (Score: 0.7086)

Query: vigoureux fois
Top 5 most similar sentences in corpus:
vigoureux fois exmilitaire (Score: 0.8924)
vigoureux fois exmilitaire (Score: 0.8924)
vigoureux jean (Score: 0.7882)
vitry je fois (Score: 0.7679)
vigoureux francois (Score: 0.7637)
vigoureux jn antoine (Score: 0.7443)
vigoureux francois vigon a idem (Score: 0.7107)
vigoureux gabriel fabt de bas (

## 2. Re-structured + normalized transcription

In [81]:
transcriptions_structured = []

for d in data:
    uuid = d["element_uuid"]
    counter = 1
    try:
        for taxpayer_json in d["entities_json"]["taxpayers"]:
            taxpayer_desc = []
            if len(taxpayer_json['name']) > 0:
                name = remove_accents(normalize_text(taxpayer_json['name']))
                taxpayer_desc.append(name)
            if len(taxpayer_json['firstnames']) > 0:
                f_n = remove_accents(normalize_text(taxpayer_json['firstnames']))
                f_n_s = f_n.split(' ')
                #firstnames = sorted(f_n_s)
                firstnames_ = " ".join(f_n_s)
                taxpayer_desc.append(firstnames_)
            transcriptions_structured.append([uuid,counter," ".join(taxpayer_desc)])
            counter += 1
    except:
        print(taxpayer_json)

In [82]:
transcriptions_structured_inputs = [e[2] for e in transcriptions_structured]

In [83]:
transcriptions_structured_embeddings = embedder.encode(transcriptions_structured_inputs, convert_to_tensor=True)

embedding_similarity(transcriptions_structured_inputs[:15],transcriptions_structured_inputs[15:],transcriptions_structured_embeddings[15:],top_k)


Query: rameau jn louis
Top 5 most similar sentences in corpus:
rameau jean louis (Score: 0.8764)
rameau louis victor (Score: 0.8356)
rameau louis honore (Score: 0.8255)
rameau louis honore (Score: 0.8255)
rameau jean francois (Score: 0.7834)
rameau jean francois (Score: 0.7834)
rameau jn bte (Score: 0.7784)
rameau jn bte (Score: 0.7784)
rameau jn bte (Score: 0.7784)
rameau jn fois (Score: 0.7521)
rameau pierre come (Score: 0.7496)
rameau paul clement (Score: 0.7356)
rameau helvetuis francois (Score: 0.7356)
rameau helvebais francois (Score: 0.7346)
rameau fois joseph (Score: 0.7153)

Query: rameau jn louis
Top 5 most similar sentences in corpus:
rameau jean louis (Score: 0.8764)
rameau louis victor (Score: 0.8356)
rameau louis honore (Score: 0.8255)
rameau louis honore (Score: 0.8255)
rameau jean francois (Score: 0.7834)
rameau jean francois (Score: 0.7834)
rameau jn bte (Score: 0.7784)
rameau jn bte (Score: 0.7784)
rameau jn bte (Score: 0.7784)
rameau jn fois (Score: 0.7521)
rameau p

- Il faut traiter les idems avant de faire ce travail si on veut exploiter les toutes les propriétés dispos...

## Clustering test