##### Prima Solutie propusa

Am ales o abordare rule-based pentru Solutia 1. Am prioritiat interpretabilitatea si viteza de implementare. Am inceput prin a analiza datele disponibile. Am identificat trei surse: Company Description, Business Tags si Sector. Am creat seeduri pentru fiecare eticheta din taxonomie. Cand lipsesc seedurile, am extras cuvinte cheie din denumirea etichetei. Am decis un sistem de scor simplu. Termenul gasit in Business Tags valoreaza 3 puncte. Termenul gasit in Description valoreaza 1 punct. Termenul gasit in Sector valoreaza 2 puncte. Am ales pragul 2 pentru atribuirea unei etichete.

Rule-based pentru ca are reguli simple si ofera audit imediat. Fiecare decizie ramane explicabila. Echipa de business poate revizui seedurile. Corectarile se aplica rapid. Costul de pornire ramane mic.

Scoruri diferite pentru ca Business Tags provin din date structurate. Am considerat acele taguri ca semnale puternice. Prin urmare le am dat greutate mai mare. Description este text liber. Acolo apar mentiuni accidentale. Am dat greutate mai redusa. Sector ofera semnal intermediar. Am setat o greutate medie.

Pragul 2 asigura ca o mentiune unica in Description nu declanseaza etichetarea. Pragul permite insa combinarea semnalelor. Exemplu concret. Daca un termen apare in Description si in Sector atunci suma 1+2=3 trece pragul. Daca apare doar in Business Tags 3 trece pragul. Aceasta regula reduce etichetele false pozitive din cauza cuvintelor comune.

Am curatat textul, am eliminat semnele nonalfanumerice si spatiile in exces. Am construit maparea termeni la etichete. Am parcurs fiecare companie si am calculat scorurile. Am salvat atribuirile in CSV si JSON. Am inregistrat companiile fara etichete. Am extras top etichete cele mai frecvent atribuite. Am identificat companii incert etichetate, adica scor maxim sub 2 sau fara scoruri.

Am produs etichete pentru setul existent. Am salvat fisierele rule_based_assigned.csv si rule_based_results.json. Pentru fiecare companie am pastrat scorurile pe eticheta. Aceasta permite audit complet. Putem identifica exact termenii care au influentat decizia.

Aceasta solutie, prima solutie exceleaza la transparenta. Fiecare eticheta are explicatie usor de urmarit. Implementare rapida. Se ruleaza offline, fara GPU. Corectarile se aplica manual fara retraining. Util pentru etape initiale. Buna pentru audit si pentru generare de weak labels.

Dar aceasta prima solutie are si limitari ca de exemplu sensibilitate la sinonime. Daca seedurile nu includ variante, nu prindem mentiunile. Expresiile compuse sau inversate pot rata potrivirea. Cuvinte comune pot genera zgomot. Clasele rare raman subreprezentate. Textul scurt sau lipsa Business Tags genereaza multe companii neetichetate.

Posibile erori de exemplu apar la false positives cand un termen generic apare in Description dar contextul nu indica relevanta de business. De exemplu cuvantul "service" poate aparea in zeci de companii. Sistemul il va corela cu etichete daca seedul contine "service".
La false negatives cand sinonime sau abrevieri lipsesc din seeds. De exemplu seed "automotive" nu prinde "auto", "autos" sau "carrozzeria".
In context invers cand Description mentioneaza competente non-core, de tipul "colaboreaza cu asiguratori", sistemul poate eticheta gresit compania ca asigurator.

Business Tags curate cresc acuratetea. Lipsa normalizarii in Business Tags duce la fragmentare. Descrierile foarte scurte reduc semnalul. Date multilingve necesita extindere de seeduri multilingve.

Etichete manuale pentru un dev set de 300 de companii. Folosirea de metrics ca precision, recall, F1 pe eticheta si ca verificare optionala average precision si precision pe eticheta 1 si un raport de eroare fp/fn de 20.

In concluzie am asumat ca taxonomia ramane stabila. Am asumat ca Business Tags au semnificatie consistenta. Am asumat ca descrierile contin suficiente semnale pentru majoritatea companiilor. Putem folosi solutia 1 ca prim strat de etichetare si ca instrument de audit.



In [1]:
from pathlib import Path
import pandas as pd
import re
import json
from collections import defaultdict, Counter

In [2]:
tax_path = Path('insurance_taxonomy_label.csv')
comp_path = Path('insurances_company_list.csv')

assert tax_path.exists(), f"Taxonomy file not found at {tax_path}"
assert comp_path.exists(), f"Companies file not found at {comp_path}"

In [3]:
# O functie de curatare a textului

def clean_text(s):
    if pd.isna(s) or s is None:
        return ''
    s = str(s).lower()
    s = re.sub(r'[^a-z0-9ăîșțâàáâäëéíóúüç\s\-]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

In [5]:
tax_df = pd.read_csv(tax_path)
if 'label' not in tax_df.columns:
    tax_df = tax_df.rename(columns={tax_df.columns[0]: 'label'})
if 'seeds' not in tax_df.columns:
    tax_df['seeds'] = ''

taxonomy = {}
for _, row in tax_df.iterrows():
    label = str(row['label']).strip()
    seeds_raw = str(row.get('seeds', '') or '')
    seeds = [clean_text(s) for s in seeds_raw.split('|') if s.strip()]
    if not seeds:
        words = re.split(r'[\s\-_]+', label.lower())
        seeds = [w for w in words if len(w) >= 3]
    taxonomy[label] = list(dict.fromkeys(seeds))
len(taxonomy)

220

In [6]:
term_to_labels = defaultdict(set)
for lab, seeds in taxonomy.items():
    for s in seeds:
        if s:
            term_to_labels[s].add(lab)


list(term_to_labels.items())[:10] # afisează cateva exemple

[('agricultural',
  {'Agricultural Equipment Services', 'Agricultural Machinery Installation'}),
 ('equipment',
  {'Agricultural Equipment Services',
   'Commercial Communication Equipment Installation',
   'Fire Safety Equipment Services',
   'Residential Communication Equipment Installation'}),
 ('services',
  {'Advertising Services',
   'Agricultural Equipment Services',
   'Air Duct Cleaning Services',
   'Alarm Installation Services',
   'Animal Day Care Services',
   'Animal Training Services',
   'Apartment Renovation Services',
   'Arts Services',
   'Asphalt Production Services',
   'Bakery Production Services',
   'Boiler Installation Services',
   'Boiler Repair Services',
   'Branding Services',
   'Building Cleaning Services',
   'Business Development Services',
   'Cable Installation Services',
   'Canning Services',
   'Carpentry Services',
   'Carpet Manufacturing Services',
   'Catering Services',
   'Coffee Processing Services',
   'Commercial Construction Services',


In [7]:
comp_df = pd.read_csv(comp_path)
name_col = None; desc_col = None; tags_col = None; sector_col = None
for c in comp_df.columns:
    lc = c.lower()
    if 'name' in lc and not name_col:
        name_col = c
    if 'description' in lc and not desc_col:
        desc_col = c
    if 'tag' in lc and not tags_col:
        tags_col = c
    if 'sector' in lc and not sector_col:
        sector_col = c

if not name_col:
    name_col = comp_df.columns[0]
if not desc_col:
    desc_col = comp_df.columns[1] if comp_df.shape[1] > 1 else comp_df.columns[0]
if not tags_col:
    tags_col = comp_df.columns[2] if comp_df.shape[1] > 2 else desc_col


name_col, desc_col, tags_col, sector_col

('description', 'description', 'business_tags', 'sector')

In [8]:
comp_df['_desc_clean'] = comp_df[desc_col].fillna('').apply(clean_text)
comp_df['_tags_clean'] = comp_df[tags_col].fillna('').apply(clean_text)
if sector_col:
    comp_df['_sector_clean'] = comp_df[sector_col].fillna('').apply(clean_text)
else:
    comp_df['_sector_clean'] = ''
comp_df.shape

(9494, 8)

In [9]:
def score_row(desc, tags, sector, term_map):
    scores = defaultdict(int)
    for term, labs in term_map.items():
        if term and term in tags:
            for l in labs:
                scores[l] += 3
        if term and term in desc:
            for l in labs:
                scores[l] += 1
        if term and term in sector:
            for l in labs:
                scores[l] += 2
    return dict(scores)

In [10]:
assigned = []
scores_list = []
for idx, row in comp_df.iterrows():
    sc = score_row(row['_desc_clean'], row['_tags_clean'], row['_sector_clean'], term_to_labels)
    scores_list.append(sc)
    assigned_labels = [lab for lab, v in sc.items() if v >= 2]
    assigned.append(assigned_labels)

comp_df['rule_scores'] = scores_list
comp_df['assigned_rule'] = assigned

In [12]:


out_csv = Path('s1/rule_based_assigned.csv')
out_json = Path('s1/rule_based_results.json')
out_per_company = Path('s1/rule_based_results_per_company.json')

comp_df.to_csv(out_csv, index=False)
comp_df[['assigned_rule', 'rule_scores']].to_json(out_per_company, orient='records', force_ascii=False)

# sumar
n_companies = len(comp_df)
n_assigned_any = sum(1 for a in assigned if a)
n_unassigned = n_companies - n_assigned_any
all_assigned = [lab for labs in assigned for lab in labs]
counter = Counter(all_assigned)
top_labels = counter.most_common(30)

summary = {
    'n_companies': n_companies,
    'n_assigned_any': n_assigned_any,
    'n_unassigned': n_unassigned,
    'top_labels': top_labels
}
with open(out_json, 'w', encoding='utf-8') as f:
    json.dump(summary, f, ensure_ascii=False, indent=2)

print("Saved:", out_csv)
print("Saved:", out_per_company)
print("Saved:", out_json)
summary

Saved: s1\rule_based_assigned.csv
Saved: s1\rule_based_results_per_company.json
Saved: s1\rule_based_results.json


{'n_companies': 9494,
 'n_assigned_any': 9386,
 'n_unassigned': 108,
 'top_labels': [('Gas Manufacturing Services', 8537),
  ('Tent Manufacturing Services', 8537),
  ('Carpet Manufacturing Services', 8534),
  ('Textile Manufacturing Services', 8528),
  ('HVAC Installation and Service', 8108),
  ('Testing and Inspection Services', 8022),
  ('Fishing and Hunting Services', 8007),
  ('Air Duct Cleaning Services', 7805),
  ('Ice Production Services', 7457),
  ('Window and Door Manufacturing', 6968),
  ('Oil and Fat Manufacturing', 6951),
  ('Paper Production Services', 6857),
  ('Ink Production Services', 6856),
  ('Rope Production Services', 6844),
  ('Roofing Services with Heat Application', 6819),
  ('Media Production Services', 6818),
  ('Soap Production Services', 6802),
  ('Dairy Production Services', 6796),
  ('Bakery Production Services', 6787),
  ('Asphalt Production Services', 6780),
  ('Fire Safety Equipment Services', 6741),
  ('Agricultural Equipment Services', 6735),
  ('Food

In [13]:
uncertain = []
for i, sc in enumerate(scores_list):
    if not sc or max(sc.values()) < 2:
        uncertain.append({
            'index': i,
            'company': comp_df.iloc[i][name_col] if name_col in comp_df.columns else i,
            'desc': comp_df.iloc[i][desc_col] if desc_col in comp_df.columns else '',
            'tags': comp_df.iloc[i][tags_col] if tags_col in comp_df.columns else '',
            'scores': sc
        })


uncertain_out = Path('s1/uncertain_companies_sample.json')
with open(uncertain_out, 'w', encoding='utf-8') as f:
    json.dump(uncertain[:200], f, ensure_ascii=False, indent=2) # salveaza primele 200 incert

len(uncertain), uncertain_out

(108, WindowsPath('s1/uncertain_companies_sample.json'))

In [17]:
print("Companii procesate:", n_companies)
print("Companii cu cel putin o eticheta:", n_assigned_any)
print("Companii fara eticheta:", n_unassigned)
print("\nTop 10 etichete cele mai frecvente:")
for lab, cnt in top_labels[:10]:
    print(f"{lab}: {cnt}")

print("\nPrimele 5 companii și etichete:")
display(comp_df[[name_col, desc_col, tags_col, 'assigned_rule']].head(5))


Companii procesate: 9494
Companii cu cel putin o eticheta: 9386
Companii fara eticheta: 108

Top 10 etichete cele mai frecvente:
Gas Manufacturing Services: 8537
Tent Manufacturing Services: 8537
Carpet Manufacturing Services: 8534
Textile Manufacturing Services: 8528
HVAC Installation and Service: 8108
Testing and Inspection Services: 8022
Fishing and Hunting Services: 8007
Air Duct Cleaning Services: 7805
Ice Production Services: 7457
Window and Door Manufacturing: 6968

Primele 5 companii și etichete:


Unnamed: 0,description,description.1,business_tags,assigned_rule
0,Welchcivils is a civil engineering and constru...,Welchcivils is a civil engineering and constru...,"['Construction Services', 'Multi-utilities', '...","[Market Research Services, General Handyman Se..."
1,"Kyoto Vegetable Specialists Uekamo, also known...","Kyoto Vegetable Specialists Uekamo, also known...","['Wholesale', 'Dual-task Movement Products', '...","[Agricultural Equipment Services, Agricultural..."
2,Loidholdhof Integrative Hofgemeinschaft is a c...,Loidholdhof Integrative Hofgemeinschaft is a c...,"['Living Forms', 'Farm Cafe', 'Fresh Coffee', ...","[Gas Manufacturing Services, Ice Production Se..."
3,PATAGONIA Chapa Y Pintura is an auto body shop...,PATAGONIA Chapa Y Pintura is an auto body shop...,"['Automotive Body Repair Services', 'Interior ...","[Market Research Services, General Handyman Se..."
4,Stanica WODNA PTTK Swornegacie is a cultural e...,Stanica WODNA PTTK Swornegacie is a cultural e...,"['Cultural Activities', 'Accommodation Service...","[Market Research Services, General Handyman Se..."
