# Data matching

Data deduplication, data linkage, data matching...

Here, we get rid of duplicated names that are not equal, as in 'Nestlé' and 'Nestlé S.A. de C.V.' for:

1. Suppliers
2. Buyers (to a lesser extent)

## Method

Data matching is a step process that includes:

1. Data cleaning
2. Data similarity computation
3. Data clustering
4. Data matching

## Blocks

We will work separately by dividing the suppliers into people and companies.

In [3]:
import pandas as pd
import re

In [4]:
import numpy as np

In [5]:
CNTS = '/home/rdora/declaranet/data/pre-process/contratos_2.csv'
cnts = pd.read_csv(CNTS, parse_dates=['start_date'])

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
cols = ['supplier', 'supplier_state', 'supplier_country']

In [112]:
comps = cnts[cnts.person==0][cols]

people = cnts[cnts.person==1][cols]

In [32]:
gb_comps = comps.groupby('supplier').first().reset_index()

gb_people = people.groupby('supplier').first().reset_index()

In [33]:
COMPS = '/home/rdora/declaranet/data/dedup/companies.csv'
gb_comps.to_csv(COMPS, index=False)

PEOPLE = '/home/rdora/declaranet/data/dedup/people.csv'
gb_people.to_csv(PEOPLE, index=False)

# Run Dedup

Here, we run the script `~/declaranet/python/deduplicate.py` to create a cluster for each record.

In [114]:
COMPS_2 = '/home/rdora/declaranet/data/dedup/output_companies.csv'
comps_cl = pd.read_csv(COMPS_2)

In [115]:
PEOPLE_2 = '/home/rdora/declaranet/data/dedup/output_people.csv'
people_cl = pd.read_csv(PEOPLE_2)

# Groupby

In [84]:
def get_mode(group):
    vc = group.value_counts()
    if vc.shape[0]:
        return vc.index[0]
    else:
        return np.nan

In [124]:
gb_comps = comps_cl.groupby('Cluster ID')['supplier', 'supplier_state'].agg(get_mode).reset_index()

  """Entry point for launching an IPython kernel.


In [125]:
gb_people = people_cl.groupby('Cluster ID')['supplier', 'supplier_state'].agg(get_mode).reset_index()

  """Entry point for launching an IPython kernel.


## Companies

In [162]:
comps = cnts[cnts.person==0]

comps = pd.merge(comps,
                comps_cl[['Cluster ID', 'supplier']],
                how='left')

comps = comps.drop(['supplier', 'supplier_state'], axis=1)

comps = pd.merge(comps, gb_comps, how='left')

In [177]:
print("New contracts with Supplier State:",
      cnts[cnts.person==0].supplier_state.isna().sum() - comps.supplier_state.isna().sum())

print("Number of duplicated companies:",
     cnts[cnts.person==0].supplier.nunique() - comps.supplier.nunique())

New contracts with Supplier State: 150178
Number of duplicated companies: 31334


In [1]:
31334 + 5452

36786

In [12]:
cnts.supplier.nunique()

261318

In [13]:
cnts.buyer.nunique()

5070

## People

In [176]:
people = cnts[cnts.person==1]

people = pd.merge(people,
                people_cl[['Cluster ID', 'supplier']],
                how='left')

people = people.drop(['supplier', 'supplier_state'], axis=1)

people = pd.merge(people, gb_people, how='left')

In [180]:
print("New contracts with Supplier State:",
      cnts[cnts.person==1].supplier_state.isna().sum() - people.supplier_state.isna().sum())

print("Number of duplicated people:",
     cnts[cnts.person==1].supplier.nunique() - people.supplier.nunique())

New contracts with Supplier State: 7457
Number of duplicated people: 5452


# Merge

In [193]:
cnts_final = pd.concat([comps, people]).drop('Cluster ID', axis=1)

Let's correct the daily price

In [204]:
cnts_final.loc[cnts_final.daily_price == np.inf, 'daily_price'] = (
    cnts_final.loc[cnts_final.daily_price == np.inf, 'amount'])

In [210]:
supplier = set(cnts.supplier.unique())
supplier_2 = set(cnts_final.supplier.unique())
miss = supplier - supplier_2

print(f"Number of deduplicated edges: {cnts[cnts.supplier.isin(miss)].shape[0]:,}")

Number of deduplicated edges: 570,575


In [206]:
OUT = '/home/rdora/declaranet/data/pre-process/contratos_3.csv'
cnts_final.to_csv(OUT, index=False)