# Contracts

In this notebook, we are going to clean and fill data to obtain a final version of the contracts table.

The goals are, for as many contracts as possible:

1. Get the mexican state of the buyer and supplier.
2. Characterize the buyer as a moral person (company) or a physical person.

In [286]:
import pandas as pd
import numpy as np

In [287]:
import pickle

In [288]:
import re

In [289]:
from unidecode import unidecode

In [290]:
from sklearn.metrics import confusion_matrix

In [291]:
import sklearn.metrics

In [292]:
CONTRACTS = '/home/rdora/declaranet/data/tables/contratos.csv'
contracts = pd.read_csv(CONTRACTS)

In [293]:
print(f"N rows: {contracts.shape[0]:,}")

N rows: 1,801,208


In [294]:
contracts.columns

Index(['gobierno', 'siglas', 'dependencia', 'claveuc', 'nombre_de_la_uc',
       'responsable', 'codigo_expediente', 'fecha_apertura_proposiciones',
       'caracter', 'tipo_contratacion', 'tipo_procedimiento',
       'forma_procedimiento', 'codigo_contrato', 'titulo_contrato',
       'fecha_inicio', 'fecha_fin', 'importe_contrato', 'moneda',
       'estatus_contrato', 'folio_rupc', 'proveedor_contratista',
       'estratificacion_mpc', 'siglas_pais', 'anuncio'],
      dtype='object')

## Clean `contracts`

Some columns will not be needed for our analysis.

In [295]:
contracts.iloc[0]

gobierno                                                                      APF
siglas                                                                        CFE
dependencia                                      Comisión Federal de Electricidad
claveuc                                                                 018TOQ093
nombre_de_la_uc                             CFE-C.T. JOSE ACEVES POZOS #018TOQ093
responsable                                       FRANCISCO JAVIER ZAZUETA RIVERA
codigo_expediente                                                          117613
fecha_apertura_proposiciones                                  2011-12-15 11:00:00
caracter                                                            Internacional
tipo_contratacion                                                   Adquisiciones
tipo_procedimiento                           Invitación a Cuando Menos 3 Personas
forma_procedimiento                                                         Mixta
codigo_contrato 

In [296]:
cols2drop = ['anuncio',
             'estratificacion_mpc',
             'estatus_contrato',
             'titulo_contrato',
             'forma_procedimiento',
             'caracter',
             'responsable']
contracts = contracts.drop(cols2drop, axis=1)

# UC: public entities

In [297]:
uc = pd.read_excel('/home/rdora/declaranet/data/tables/UC_200529064722.xlsx')

In [298]:
uc.shape

(5558, 13)

## Manual curated dictionary of UC (public entities)

The entity was obtained by manually looking up the code after the hash on the DOF (diarío oficial de la federación).

In [299]:
with open('/home/rdora/declaranet/data/pickle/manual_uc.p', 'rb') as f:
    state_dict = pickle.load(f)

## Two different ways to match UC

We'll try both

1. name (`nombre_de_la_uc`)
2. code (`claveuc`)

1. By code 

In [300]:
ren = {'Ramo': 'ramo',
       'Clave de la UC': 'claveuc',
       'Entidad Federativa': 'b_entidad_federativa'}
uc_code = uc.rename(columns=ren)[['ramo', 'claveuc', 'b_entidad_federativa']]

In [301]:
cnts = pd.merge(contracts, uc_code, how='left', on='claveuc')

In [302]:
(cnts.shape[0] - cnts.b_entidad_federativa.isna().sum()) / cnts.shape[0]

0.7932620774502445

2. By name

In [303]:
ren = {'Ramo': 'n_ramo',
       'Nombre de la UC': 'nombre_de_la_uc',
       'Entidad Federativa': 'n_entidad_federativa'}
uc_name = uc.rename(columns=ren)[['n_ramo', 'nombre_de_la_uc', 'n_entidad_federativa']]
cnts = pd.merge(cnts, uc_name, how='left', on='nombre_de_la_uc')

In [304]:
(cnts.shape[0] - cnts.n_entidad_federativa.isna().sum()) / cnts.shape[0]

0.7016607743247865

3. By both

In [305]:
# Merge both info about `ramo`
cnts.loc[(cnts.n_ramo.notna()) & (cnts.ramo.isna()), 'ramo'] = (
    cnts.loc[(cnts.n_ramo.notna()) & (cnts.ramo.isna()), 'n_ramo'])

In [306]:
# Merge both info about `entidad_federativa`
cnts.loc[(cnts.n_entidad_federativa.notna()) & (cnts.b_entidad_federativa.isna()), 'b_entidad_federativa'] = (
    cnts.loc[(cnts.n_entidad_federativa.notna()) & (cnts.b_entidad_federativa.isna()), 'n_entidad_federativa'])

In [307]:
print("Ramo: ", (cnts.shape[0] - cnts.ramo.isna().sum()) / cnts.shape[0])

Ramo:  0.8516084760893801


In [308]:
print("Entidad federativa: ", (cnts.shape[0] - cnts.b_entidad_federativa.isna().sum()) / cnts.shape[0])

Entidad federativa:  0.8516084760893801


Conclusion: Some 15% of the contracts can't be assigned a `ramo` or a `entidad_federativa`

In [309]:
cnts = cnts.drop(['n_entidad_federativa', 'n_ramo'], axis=1)

## Parsing the name of the UC

Another method is to parse the name of the UC to look for a state name. For example if the word 'yucatan' can be found in the name of the UC chances are it belongs to the state of Yucatan.

We don't need to worry about nan values in the name of the UC.

In [310]:
def get_state(name, target_dict):
    if name in target_dict:
        return target_dict[name]
    else:
        return None

In [311]:
federal_entities = cnts.loc[cnts
                        .b_entidad_federativa
                        .str
                        .contains('MX-')
                        .fillna(False), 'b_entidad_federativa'].unique()

In [312]:
with open('/home/rdora/declaranet/data/pickle/entidades_federativas.p', 'rb') as f:
    state2code = pickle.load(f)

In [313]:
state_names = state2code.keys()

In [314]:
missing_names = cnts.loc[cnts.b_entidad_federativa.isna(), 'nombre_de_la_uc'].unique()

print("Number of missing names before parsing:", len(missing_names))

Number of missing names before parsing: 384


Now let's parse for state names!

In [315]:
ucname2state = {}
for mname in missing_names:
    for state in state_names:
        re_search = re.search(state, mname, flags=re.I)
        if re_search:
            ucname2state[mname] = state2code[re_search.group().lower()]

In [316]:
print("Number of found missing names:", len(ucname2state))

Number of found missing names: 132


In [317]:
cnts.loc[cnts.b_entidad_federativa.isna(), 'b_entidad_federativa'] = (
    cnts.loc[
        cnts.b_entidad_federativa.isna(),
        'nombre_de_la_uc'].apply(get_state, args=(ucname2state,)))

In [318]:
missing_names = cnts.loc[cnts.b_entidad_federativa.isna(), 'nombre_de_la_uc'].unique()

print("Number of missing names After parsing:", len(missing_names))

Number of missing names After parsing: 252


Now let's apply the same procedure but with the manually curated name of UC to state codes.

In [319]:
cnts.loc[cnts.b_entidad_federativa.isna(), 'b_entidad_federativa'] = (
    cnts.loc[
        cnts.b_entidad_federativa.isna(),
        'nombre_de_la_uc'].apply(get_state, args=(state_dict,)))

In [320]:
missing_names = cnts.loc[cnts.b_entidad_federativa.isna(), 'nombre_de_la_uc'].unique()

print("Number of missing names After parsing:", len(missing_names))

Number of missing names After parsing: 124


In [321]:
print("Entidad federativa: ", (cnts.shape[0] - cnts.b_entidad_federativa.isna().sum()) / cnts.shape[0])

Entidad federativa:  0.9976410275770483


In conclusion, we have now 99.7% of the public dependencies to match with a federal entitiy!

## Matching to the federal entity in the file of the contract

This will be our last method to get the state code of the UC of the contracts.

In [322]:
EXP = '/home/rdora/declaranet/data/pre-process/expedientes.csv'
files = pd.read_csv(EXP)

In [323]:
files['entidad_federativa'] = files['entidad_federativa'].apply(get_state, args=(state2code,))

In [324]:
missing_names = cnts.loc[cnts.b_entidad_federativa.isna(), 'nombre_de_la_uc'].unique()

print("Number of missing names before parsing:", len(missing_names))

Number of missing names before parsing: 124


In [325]:
files = files.dropna()
files = files.drop_duplicates(subset='codigo_expediente')

In [326]:
file2code = dict(zip(files.codigo_expediente, files.entidad_federativa))

In [327]:
cnts.loc[cnts.b_entidad_federativa.isna(), 'b_entidad_federativa'] = (
    cnts.loc[
        cnts.b_entidad_federativa.isna(),
        'codigo_expediente'].apply(get_state, args=(file2code,)))

In [328]:
missing_names = cnts.loc[cnts.b_entidad_federativa.isna(), 'nombre_de_la_uc'].unique()

print("Number of missing names After parsing:", len(missing_names))

Number of missing names After parsing: 102


In [329]:
print("Entidad federativa: ", (cnts.shape[0] - cnts.b_entidad_federativa.isna().sum()) / cnts.shape[0])

Entidad federativa:  0.9983777553730607


The best score we could obtain for the UC state names is 99.8%

## Get Missing `ramo` values

According to the catalog of compranet, the `UC` code-name is composed by the ramo (first three numbers). 

In [330]:
def get_ramo(name):
    try:
        ramo = int(name[:3])
        return ramo
    except ValueError:
        return None

In [331]:
cnts.loc[cnts.ramo.isna(), 'ramo'] = (
    cnts.loc[cnts.ramo.isna(), 'claveuc'].apply(get_ramo))

In [332]:
print("Ramo: ", (cnts.shape[0] - cnts.ramo.isna().sum()) / cnts.shape[0])

Ramo:  0.9994153923366985


We could obtain the 99.9% of all the `ramo` codes.

## Match `ramo` code to `ramo` name

In [333]:
RAMOS = '/home/rdora/declaranet/data/tables/ramos.csv'
ramos = pd.read_csv(RAMOS)

In [334]:
ramos = ramos.rename(columns={'RAMO': 'ramo', 'DESCRIPCIÓN': 'desc_ramo'})[['ramo', 'desc_ramo']]

In [335]:
cnts = pd.merge(
    cnts,
    ramos,
    on='ramo',
    how='left')

In [336]:
print("Desc Ramo: ", (cnts.shape[0] - cnts.desc_ramo.isna().sum()) / cnts.shape[0])

Desc Ramo:  0.8712963744331582


We only have 87% of the ramos description

# Buyers

Now let us get the state name of the buyer company. We'll use two methods:

1. Using the unique identifier `rupec`.
2. Using the name of the company for those without `rupec`

1. Using `rupec`

Let's match by `rupec` id and by name as a last resort.

In [337]:
rupc = pd.read_csv('/home/rdora/declaranet/data/tables/RUPC.csv', encoding='latin')

In [338]:
rupc['person'] = 0
rupc.loc[rupc.RFC.isna(), 'person'] = 1

In [339]:
ren = {
    'Folio RUPC': 'folio_rupc',
    'Entidad Federativa': 's_entidad_federativa'}
rupc_code = rupc.rename(columns=ren)[['folio_rupc', 's_entidad_federativa', 'person']]

In [340]:
cnts = pd.merge(
    cnts,
    rupc_code,
    on='folio_rupc',
    how='left')

In [341]:
print("Supplier state: ", (cnts.shape[0] - cnts.s_entidad_federativa.isna().sum()) / cnts.shape[0])

Supplier state:  0.3239670265732775


In [342]:
ren = {
    'Nombre de la empresa': 'proveedor_contratista',
    'Entidad Federativa': 's_entidad_federativa'}
rupc_name = rupc.rename(columns=ren)[['proveedor_contratista', 's_entidad_federativa', 'person']]

In [343]:
rupc_name = rupc_name.dropna()

rupc_name = rupc_name.drop_duplicates(subset='proveedor_contratista')

In [344]:
name2state = dict(zip(rupc_name.proveedor_contratista, rupc_name.s_entidad_federativa))
cnts.loc[cnts.s_entidad_federativa.isna(), 's_entidad_federativa'] = (
    cnts.loc[cnts.s_entidad_federativa.isna(), 'proveedor_contratista'].apply(get_state, args=(name2state,)))

name2person = dict(zip(rupc_name.proveedor_contratista, rupc_name.person))
cnts.loc[cnts.person.isna(), 'person'] = (
    cnts.loc[cnts.person.isna(), 'proveedor_contratista'].apply(get_state, args=(name2person,)))

In [345]:
print("Supplier state: ", (cnts.shape[0] - cnts.s_entidad_federativa.isna().sum()) / cnts.shape[0])

Supplier state:  0.33552316001261373


In [346]:
print("Person: ", (cnts.shape[0] - cnts.person.isna().sum()) / cnts.shape[0])

Person:  0.33555536062464747


So far, we only have 33.5% of all the contracts with a supplier state.

2. Match by name in a list of suppliers without a repec code.

In [347]:
SINRUPC = '/home/rdora/declaranet/data/pre-process/sin_rupec.csv'
sinrupc = pd.read_csv(SINRUPC)

In [348]:
sinrupc['person'] = 0
sinrupc.loc[sinrupc.titularidad_juridica=='4.- Persona Física con Actividad Empresarial (Empresario Individual)',
           'person'] = 1

In [349]:
ren = {'entidad_federativa': 's_entidad_federativa'}
sinrupc = sinrupc.rename(columns=ren).drop(['titularidad_juridica', 'pais_rupec'], axis=1)

In [350]:
sinrupc = sinrupc.dropna()

sinrupc = sinrupc.drop_duplicates()

dups = sinrupc.loc[sinrupc.proveedor_contratista.duplicated(), 'proveedor_contratista'].unique()

In [351]:
print(f"There are {len(dups)} duplicated supplier names")

There are 113 duplicated supplier names


In [352]:
usinrupc = sinrupc[~sinrupc.proveedor_contratista.isin(dups)]

In [353]:
name2state = dict(zip(usinrupc.proveedor_contratista, usinrupc.s_entidad_federativa))
cnts.loc[cnts.s_entidad_federativa.isna(), 's_entidad_federativa'] = (
    cnts.loc[cnts.s_entidad_federativa.isna(), 'proveedor_contratista'].apply(get_state, args=(name2state,)))

name2person = dict(zip(usinrupc.proveedor_contratista, usinrupc.person))
cnts.loc[cnts.person.isna(), 'person'] = (
    cnts.loc[cnts.person.isna(), 'proveedor_contratista'].apply(get_state, args=(name2person,)))

In [354]:
print("Supplier state: ", (cnts.shape[0] - cnts.s_entidad_federativa.isna().sum()) / cnts.shape[0])

Supplier state:  0.7042101745051099


In [355]:
cnts_dups = cnts.loc[cnts.proveedor_contratista.isin(dups), ['proveedor_contratista', 'b_entidad_federativa']]

cnts_dups = cnts_dups.groupby(['proveedor_contratista']).agg(lambda x:x.value_counts().index[0]).reset_index()

dup2state = dict(zip(cnts_dups.proveedor_contratista, cnts_dups.b_entidad_federativa))

set_cnts = set(dup2state.items())

In [356]:
sindups = sinrupc[sinrupc.proveedor_contratista.isin(dups)]

set_sindups = set(zip(sinrupc.proveedor_contratista, sinrupc.s_entidad_federativa))

In [357]:
sinrupc_dups = pd.DataFrame(set_cnts & set_sindups, columns=['proveedor_contratista', 's_entidad_federativa'])

sinrupc_dups = pd.merge(
    sinrupc_dups,
    sinrupc.drop_duplicates(subset='proveedor_contratista').drop('s_entidad_federativa', axis=1),
    how='left',
    on='proveedor_contratista')

In [358]:
name2state = dict(zip(sinrupc_dups.proveedor_contratista, sinrupc_dups.s_entidad_federativa))
cnts.loc[cnts.s_entidad_federativa.isna(), 's_entidad_federativa'] = (
    cnts.loc[cnts.s_entidad_federativa.isna(), 'proveedor_contratista'].apply(get_state, args=(name2state,)))

name2person = dict(zip(sinrupc_dups.proveedor_contratista, sinrupc_dups.person))
cnts.loc[cnts.person.isna(), 'person'] = (
    cnts.loc[cnts.person.isna(), 'proveedor_contratista'].apply(get_state, args=(name2person,)))

In [359]:
print("Supplier state: ", (cnts.shape[0] - cnts.s_entidad_federativa.isna().sum()) / cnts.shape[0])

Supplier state:  0.7045815919094297


For the private suppliers, we could only get 70.4% of them.

## `Ramo` for non federal UCs

In [360]:
def get_dependencia(name):
    
    name = name.split('-', 1)[1].strip().lower()
    name = unidecode(name)
    
    return name

In [361]:
ge = cnts[cnts.gobierno != 'APF']

In [362]:
ge['name'] = ge['dependencia'].apply(get_dependencia)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [363]:
ramos_ud = ramos

In [364]:
ramos_ud['desc_ramo'] = ramos_ud['desc_ramo'].apply(lambda x: unidecode(x).lower())

In [365]:
salud = ge.loc[ge['name'].str.contains('salud'), 'name'].value_counts().index
cultura = ge.loc[ge['name'].str.contains(r'\bcultura'), 'name'].value_counts().index
energia = ge.loc[ge['name'].str.contains(r'\benergia\b'), 'name'].value_counts().index
turismo = ge.loc[ge['name'].str.contains(r'\bturismo\b'), 'name'].value_counts().index
economia = ge.loc[ge['name'].str.contains(r'\beconomia\b'), 'name'].value_counts().index
poder_judicial = ge.loc[ge['name'].str.contains(r'\bjudicial\b'), 'name'].value_counts().index
municipal = ge.loc[ge['name'].str.contains(r'\bmunicip'), 'name'].value_counts().index
municipal = list(municipal) + list(ge.loc[ge['name'].str.contains(r'\balcald'), 'name'].value_counts().index)
educacion = ge.loc[ge['name'].str.contains(r'\beducacion'), 'name'].value_counts().index
educacion = list(educacion) + list(ge.loc[ge['name'].str.contains(r'\buniversidad'), 'name'].value_counts().index)
educacion = list(educacion) + list(ge.loc[ge['name'].str.contains(r'\beducativ'), 'name'].value_counts().index)
educacion = list(educacion) + list(ge.loc[ge['name'].str.contains(r'\btecnologico'), 'name'].value_counts().index)
educacion = list(educacion) + list(ge.loc[ge['name'].str.contains(r'\bescuela'), 'name'].value_counts().index)
educacion = list(educacion) + list(ge.loc[ge['name'].str.contains(r'\bcolegio'), 'name'].value_counts().index)
agrario = ge.loc[ge['name'].str.contains(r'\bagr'), 'name'].value_counts().index
agrario = list(agrario) + list(ge.loc[ge['name'].str.contains(r'\brural'), 'name'].value_counts().index)
comunicacion = ge.loc[ge['name'].str.contains(r'\bcomunicacion'), 'name'].value_counts().index
comunicacion = list(comunicacion) + list(ge.loc[ge['name'].str.contains(r'\btransporte'), 'name'].value_counts().index)
comunicacion = list(comunicacion) + list(ge.loc[ge['name'].str.contains(r'\bcamino'), 'name'].value_counts().index)
desarrollo = ge.loc[ge['name'].str.contains(r'\bdesarrollo'), 'name'].value_counts().index
agua = ge.loc[ge['name'].str.contains(r'\bagua\b'), 'name'].value_counts().index
medio_ambiente = list(agua) + list(ge.loc[ge['name'].str.contains(r'\bambiente\b'), 'name'].value_counts().index)
ciencia = ge.loc[ge['name'].str.contains(r'\bciencia\b'), 'name'].value_counts().index
juridica = ge.loc[ge['name'].str.contains(r'\bjuridic'), 'name'].value_counts().index
obras = ge.loc[ge['name'].str.contains(r'\bobra'), 'name'].value_counts().index
obras = list(obras) + list(ge.loc[ge['name'].str.contains(r'\binfraestructura'), 'name'].value_counts().index)
finanza = ge.loc[ge['name'].str.contains(r'\bfinanza'), 'name'].value_counts().index
finanza = list(finanza) + list(ge.loc[ge['name'].str.contains(r'\bhacienda\b'), 'name'].value_counts().index)
gobernacion = ge.loc[ge['name'].str.contains(r'\badministra'), 'name'].value_counts().index

In [366]:
ramos_ge = [
    set(salud),
    set(cultura),
    set(energia),
    set(turismo),
    set(economia),
    set(poder_judicial),
    set(municipal),
    set(educacion),
    set(agrario),
    set(comunicacion),
    set(desarrollo),
    set(medio_ambiente),
    set(ciencia),
    set(juridica),
    set(obras),
    set(finanza),
    set(gobernacion)]

In [367]:
codigo_ge = [
    12,
    48,
    18,
    21,
    10,
    3,
    88,
    11,
    8,
    9,
    15,
    16,
    38,
    3,
    15,
    6,
    4]

In [368]:
for i, ramo_ge in enumerate(ramos_ge):
    ge.loc[ge.name.isin(ramo_ge), 'ramo'] = codigo_ge[i]
    if codigo_ge[i] == 88:
        ge.loc[ge.name.isin(ramo_ge), 'desc_ramo'] = 'Municipal'
    else:
        ge.loc[ge.name.isin(ramo_ge), 'desc_ramo'] = ramos.loc[ramos.ramo==codigo_ge[i], 'desc_ramo'].item()

In [369]:
uge = ge.drop_duplicates(subset='dependencia')[['dependencia', 'ramo', 'desc_ramo']]

In [370]:
dep2ramo = dict(zip(uge.dependencia, uge.ramo))
cnts.loc[cnts.desc_ramo.isna(), 'ramo'] = (
    cnts.loc[cnts.desc_ramo.isna(), 'dependencia'].apply(get_state, args=(dep2ramo,)))

dep2desc = dict(zip(uge.dependencia, uge.desc_ramo))
cnts.loc[cnts.desc_ramo.isna(), 'desc_ramo'] = (
    cnts.loc[cnts.desc_ramo.isna(), 'dependencia'].apply(get_state, args=(dep2desc,)))

In [371]:
print("Desc Ramo: ", (cnts.shape[0] - cnts.desc_ramo.isna().sum()) / cnts.shape[0])

Desc Ramo:  0.9728893053994875


Got to 97.2% of `ramo` description!

# Misc

In [372]:
good_s = cnts.loc[cnts.s_entidad_federativa.notna(), 'proveedor_contratista'].unique()

good_s = set(good_s)

bad_s = cnts.loc[cnts.s_entidad_federativa.isna(), 'proveedor_contratista'].unique()

bad_s = set(bad_s)

both = bad_s & good_s

In [373]:
cnts.loc[cnts.proveedor_contratista=='AFIANZADORA SOFIMEX, S.A.', 's_entidad_federativa'] = 'MX-CMX'

cnts.loc[cnts.proveedor_contratista=='CONSTRUCTORA Y PAVIMENTADORA VISE S.A. DE C.V.', 's_entidad_federativa'] = (
    'MX-GUA')

cnts.loc[cnts.proveedor_contratista=='OPERADORA CENTRAL DE ESTACIONAMIENTOS SA DE CV', 's_entidad_federativa'] = (
    'MX-CMX')

In [374]:
cnts.loc[cnts.proveedor_contratista=='AFIANZADORA SOFIMEX, S.A.', 'person'] = 0

cnts.loc[cnts.proveedor_contratista=='CONSTRUCTORA Y PAVIMENTADORA VISE S.A. DE C.V.', 'person'] = 0

cnts.loc[cnts.proveedor_contratista=='OPERADORA CENTRAL DE ESTACIONAMIENTOS SA DE CV', 'person'] = 0

## Get companies

In [375]:
def is_company(name):
    name = name.upper()
    name = unidecode(name)
    if len(name.split()) == 1:
        return 0
    for stop_word in stop_words:
        stop = re.compile(stop_word)
        if stop.search(name):
            return 0
        
    return 1

In [376]:
stop_words = [
    r"\bS\.? *A\.?\b",
    r"\bR\.? *L\.?\b",
    r"\bC\.? *V\.?\b",
    r"\bS\.? *C\.?\b",
    r"\bS\.? *C\.? *L\.?",
    r"S\.?A\.? *DE\.? *C\.?V\.?",
    r"S\.? *DE\.? *R\.?L\.? *DE *C\.?V\.?",
    r"S\.? *DE\.? *R\.?L\.?",
    r"S\.?A\.?P\.?I\.? *DE\.? *C\.?V\.?",
    r"\bS\.?A\.?P\.?I\.?\b",
    r"S\.A\.S\.?",
    r"S\.?A\.?B\.? *DE\.? *C\.?V\.?",
    r"\bL\.? *T\.? *D\.?\b",
    r"\bC\.?O\.?\b",
    r"\bINC\b",
    r"\bCOMPANY\b",
    r"\bNATIONAL\b",
    r"\bGROUP\b",
    r"\bGRUPO\b",
    r"\bUNION\b",
    r"\bCORP\b",
    r"\bCORPORATIVO\b",
    r"\bA\.? *C\.?\b",
    r"\bNACIONAL\b",
    r"\bMEXICO\b",
    r"\bASOCIACION\b",
    r"\bU\.? *S\.?\b",
    r"\bU\.? *S\.? *A\.?\b",
    r"\bA\.? *G\.?\b",
    r"&",
    r"EXPRESS\b",
    r"\bCONSTITUCIONAL\b",
    r"\bORGANIZATION\b",
    r"\bORG\.?\b",
    r"\(.*\)",
    r'".*"',
    r"\bCOMERCIALIZACION\b",
    r"\bINTERNATIONAL\b",
    r"\bINT\.?\b",
    r"\bGOBIERNO\b",
    r"\bFEDERAL\b",
    r"\bAYUNTAMIENTO\b",
    r"\bSERVICES\b",
    r"\bSECRETARIA\b",
    r"\bCIVIL\b",
    r"\bPUBLIC\b",
    r"\bLIMITED\b",
    r"\bROBOTICS\b",
    r"\bINSTITUTO\b",
    r"\bSERVICIO",
    r"\bUNIVERSAL\b",
    r"\bFABRICA\b",
    r"\bMARCA\b",
    r"\bMARCAS\b",
    r"\bINDUSTRIA",
    r"\bSOCIEDAD\b",
    r"\bIMPORTA",
    r"\bCOMERCIALIZADORA\b",
    r"\bY\b",
    r"\bCONSTRUC",
    r"\bPRODUCTO",
    r"\bCONSORCIO",
    r"\bMEDICO",
    r"\bQUIMICA",
    r"\bMEDICION",
    r"\bSISTEMA",
    r"\bSEGUROS\b",
    r"\bG\.? *P\.?",
    r"\bMEDICAL\b",
    r"\bCIA\b",
    r"\bFORMAS\b",
    r"\bPRODUCT",
    r"\bMEXICAN",
    r"\bMEDIC",
    r"\bINTERNACIONAL",
    r"\bDISTRIBUIDOR",
    r"\bINSTRUMENT",
    r"\bELECTRO\b",
    r"\bCIENTIFIC",
    r"\bSCIEN",
    r"\bCIENCIA",
    r"\bCORPOR",
    r"\bCORP\b",
    r","]

## TEST

In [377]:
good_cnts = cnts[cnts.person.notna()]

In [378]:
comp2pers = dict(
    zip(good_cnts.proveedor_contratista.unique(),
        map(is_company, good_cnts.proveedor_contratista.unique())))

In [379]:
good_cnts['test'] = good_cnts.proveedor_contratista.apply(lambda x: comp2pers[x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [380]:
y_true = good_cnts['person'].values
y_pred = good_cnts['test'].values

print(confusion_matrix(y_true, y_pred))

print("Accuracy:", sklearn.metrics.accuracy_score(y_true, y_pred))

print("F1:", sklearn.metrics.f1_score(y_true, y_pred))

[[970412  29312]
 [  1886 267549]]
Accuracy: 0.9754183675961798
F1: 0.9449086696709849


## Apply filter

In [384]:
bad_s = set(cnts.loc[cnts.person.isna(), 'proveedor_contratista'].unique())

In [385]:
comp2pers = dict(
    zip(
        list(bad_s)[1:],
        map(is_company, list(bad_s)[1:])))

In [386]:
cnts.person.isna().sum()

532049

In [387]:
cnts.loc[cnts.person.isna(), 'person'] = (
    cnts.loc[cnts.person.isna(), 'proveedor_contratista'].apply(get_state, args=(comp2pers,)))

In [388]:
cnts.person.isna().sum()

2579

In [391]:
cnts.person.value_counts(normalize=True)

0.0    0.76102
1.0    0.23898
Name: person, dtype: float64

# Drop some columns and save

Some columns will not be important for our analysis

In [393]:
def get_buyer(name):
    
    name = name.split('#', 1)[0].strip()
    
    return name

In [396]:
drop_cols = [
    'folio_rupc',
    'dependencia',
    'claveuc']

In [394]:
cnts['buyer'] = cnts.nombre_de_la_uc.apply(get_buyer)

In [395]:
cnts['buyer'].head()

0                           CFE-C.T. JOSE ACEVES POZOS
1                     CFE-DIVISION VALLE DE MEXICO SUR
2    OAX-Secretaría de las Culturas y Artes de Oaxa...
3    OAX-Secretaría de las Culturas y Artes de Oaxa...
4    LICONSA-Subdirección de Adquisición y Distribu...
Name: buyer, dtype: object

In [397]:
cnts = cnts.drop(drop_cols, axis=1)

In [399]:
CNTS = '/home/rdora/declaranet/data/pre-process/contratos.csv'
cnts.to_csv(CNTS, index=False)