# 1. Fuzzy matching for Mexican new IDs with multiple old IDs and multimodal distribution in websites for each new ID

In [1]:
# Import relevant libraries
import pandas as pd 
from thefuzz import fuzz
from thefuzz import process
import numpy as np
import json
import unidecode

## 1.1 Read Builtwith Websites and Panjiva Names Data  - Multimodal Distribution
Read data of new IDs with multiple old IDs that have a multimodal distribution of websites for each new ID

In [2]:
# Read data of builtwith website
mult_builtwith_ind = pd.read_csv("../../Data/Mexico/processed_data/new_ids_cases_for_fuzzy_builtwith_MEX.csv")
mult_builtwith_ind.head(30)

Unnamed: 0,old_id,new_id,panjiva_raw_firm_name,builtwith_website,different_old_ids,n_url,n_total_old_ids,mode_url_count,share_url_mode,obs_has_mode,is_multimodal,unimodal,only_one_url_retrieved
0,MEX167255,MEX11792,PROMAQUIN SA DE CV,promaquin.com,True,1,2,1,,1,1,0,0
1,MEX11792,MEX11792,PROMAQUINASA DE CV,promsacorp.com,True,1,2,1,,1,1,0,0
2,MEX123788,MEX123788,"DIICSA INFRAESTRUCTURA, SA DE CV",diicsacv.com,True,1,2,1,,1,1,0,0
3,MEX97820,MEX123788,"DICA INFRAESTRUCTURA,S.A. DE C.V.",grupodica.com,True,1,2,1,,1,1,0,0
4,MEX123969,MEX123969,LLANTAS Y SERVICIOS DEGA DE SALTILLO SA DE CV,degasaltillo.com,True,1,2,1,,1,1,0,0
5,MEX119440,MEX123969,LLANTAS Y SERVICIOS DEGA SA DE CV,llantasdega.com,True,1,2,1,,1,1,0,0
6,MEX125129,MEX125129,AMERICA LOGISTICS SA DE CV,americalogistics.com,True,1,2,1,,1,1,0,0
7,MEX99792,MEX125129,"AMERICA LOGISTIC GROUP, S.A. DE C.V.",americalogisticsgroup.com,True,1,2,1,,1,1,0,0
8,MEX132615,MEX132615,"HERRAMIENTAS DAPERMEX, S.A. DE C.V.",dapermex.com,True,1,2,1,,1,1,0,0
9,MEX60414,MEX132615,"HERRAMIENTAS DIAMEX INDUSTRIA, S.A. DE C.V.",diager.com,True,1,2,1,,1,1,0,0


In [3]:
mult_builtwith_ind[mult_builtwith_ind['new_id'] == "MEX11792"][["new_id", "old_id", "panjiva_raw_firm_name", "builtwith_website"]]

Unnamed: 0,new_id,old_id,panjiva_raw_firm_name,builtwith_website
0,MEX11792,MEX167255,PROMAQUIN SA DE CV,promaquin.com
1,MEX11792,MEX11792,PROMAQUINASA DE CV,promsacorp.com


## 1.2. Preprocessing

For the preprocessing step, we use the following procedure:

1. **Normalization and Cleaning**:
   - **Lowercase Conversion and Accent Removal**: All text data, including company names and website URLs, are converted to lowercase to standardize the format and ensure uniformity in processing. Additionally, accents are removed using `unidecode`. These accents are often found in company names but rarely in website URLs.
   - **Removal of Common Business Suffixes**: Business-related suffixes such as "SA DE CV", "S.A. DE C.V.", and others are removed from company names. This step is crucial for aligning company names more closely with their website, which typically does not include such formal identifiers.

2. **Character Filtration**:
   - **Non-Alphanumeric Character Removal**: All non-alphanumeric characters, including spaces, commas, periods, and hyphens, are stripped from both names and URLs. This reduces noise and prevents mismatches that can occur due to irrelevant characters or spacing issues in the data.

3. **Extraction and Utilization of Initials**:

    - **Initials Extraction**: This step involves selecting the first letter of each significant word in the company's name to form a set of initials. Words considered insignificant, such as common conjunctions, prepositions, or those shorter than three letters, are excluded ('la', 'el', 'de', 'los', etc.). For instance:
      - For `MERCK SHARP & DOHME COMERCIALIZADORAS DE RL DE CV`, the significant words are `MERCK`, `SHARP`, `DOHME`, and `COMERCIALIZADORAS`.
      - The initials extracted from these words are `MSDC`, which represents a concise representation of the company's name.
    - **Usage in Comparisons**: These initials, "MSDC", can be directly compared to website names. In this case, the website "msd.com" shares the initials "MSD", which is a partial match to the extracted initials from the company name. This commonality in initials suggests a strong linkage between the company name and the website, enhancing the accuracy of matching in scenarios where such abbreviations or acronyms are used in domain names. 

    This method of using initials helps in effectively matching company names with their corresponding websites, especially when websites utilize acronyms or initials as their domain names.

4. **Website URL Simplification**:
   - **Domain Simplification**: Common web prefixes (like "www.") and suffixes (such as ".com", ".org", ".net") are removed from URLs to focus on the core part of the domain, which is essential for direct comparisons with company names.
   - **Conditional Geographic Indicator Removal**: The presence of geographic indicators like ".mx" or ".mex" is conditionally managed based on the content of the company name. If terms related to "Mexico" (such as "mexico" or "mex") are found in the cleaned company name, these indicators are preserved in the URL to enhance specificity in matching. Conversely, if such indicators are absent, they are removed to prevent potential false associations and simplify the domain further.

5. **Application of Conditional Logic**:
   - **Geographic Relevance**: The logic implemented ensures that geographic relevance is maintained by preserving country-specific domain extensions like ".mx" or ".mex" when the company name suggests a Mexican association. This approach helps in accurately identifying and matching companies with their respective web domains, especially in localized contexts.

By implementing these preprocessing steps, the goal is to standardize and optimize the input data for more effective matching, especially when using fuzzy logic or other matching algorithms. This preparation is key to minimizing discrepancies and enhancing the accuracy of identifying matches between company names and their corresponding websites.

In [4]:
def extract_initials(name):
    # Extract initials from significant words
    words = [word for word in name.split() if word not in {'de', 'la', 'las', 'los', 'y', 'e', 'el', 'del', 'of', 'and', ".", ",", "&"}]
    return ''.join([word[0] for word in words])

def preprocess(name, website):
    # Normalize: lowercase and remove accents
    name = unidecode.unidecode(name.lower())
    website = unidecode.unidecode(website.lower())
    
    # Remove common suffixes and other non-alphanumeric characters from name and website
    suffixes = ["sa de cv", "s a de c v", "s de rl de cv", "sa cv", "s de rl", " sa ", "s. a. de c. v.",  
                "s.a.", "s.a. de c.v.", "inc", "corp", "llc", "r.l", "c.v", "s. de r.l.", " de ", " s.a. ", " s. ", 
               " de c.v.", "s.a. de c.v.", "sa de c v"]
    for suffix in suffixes:
        if suffix in [" de "]:
            name = name.replace(suffix, ' ')
        else:
            name = name.replace(suffix, '') # Do not leave space

    
    # Initials are extracted from the cleaned name
    initials = extract_initials(name)
    name = ''.join(e for e in name if e.isalnum())  # Remove all non-alphanumeric characters from name
    
    # Determine if 'mexico' or similar words are in the cleaned company name to decide how to clean the website URL
    if any(x in name for x in ['mexico', 'mex']):
        website = website.replace('.mx', '.mex')
        website = website.replace('www.', '').replace('.com', '').replace('.org', '').replace('.net', '').replace('.gob', '')
    else:
        website = website.replace('www.', '').replace('.mx', '').replace('.mex', '').replace('.com', '').replace('.org', '').replace('.net', '').replace('.gob', '')

    # Remove all remaining non-alphanumeric characters from website
    website = ''.join(e for e in website if e.isalnum())
    return name, website, initials


## 1.3. Scoring algorithm for string similarity between Builtwith websites and Panjiva names when there is a multimodal distribution of websites by new ID


#### Levenshtein Distance and FuzzyWuzzy

The `fuzzywuzzy` Python package, which is utilized in this code, leverages the Levenshtein distance to calculate differences between sequences (i.e., text strings). This method measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word (string) into another. The package provides various ways to compare strings, which are essential in determining the similarity and thereby assisting in matching company names to their respective websites.


#### Normalization of Similarity Scores

Similarity scores are normalized percentages derived from the Levenshtein distance. They are calculated as follows:

$$\text{Similarity Score} = 100 \times \left(1 - \frac{\text{Levenshtein distance}}{\text{maximum string length}} \right)$$

A similarity score of 100 indicates an exact match between strings, whereas a score closer to 0 indicates greater dissimilarity.

#### Key Scores Used
1. **Ratio**: This score calculates the standard Levenshtein distance similarity between two strings on a scale from 0 to 100. It’s useful for direct, full-string comparisons.
2. **Partial Ratio**: Compares partial segments of the string, focusing on the best matching substring. This is particularly useful when matching initials or abbreviations to longer strings.
3. **Token Sort Ratio**: Compares two strings by sorting their tokens alphabetically and then joining them back into a single string, thus removing the impact of word order.
4. **Token Set Ratio**: Compares strings based on shared words, regardless of order or additional words. It splits strings into tokens, finds common words, combines them with the best matching unique tokens, then calculates similarity. This method is ideal for matching variations of names that have the same core words but may include extras or be in a different order.


#### Process for Identifying the Best Website Match

1. **Preprocessing**:
   - Convert each company name and website URL to lowercase.
   - Remove accents from characters.
   - Strip common business suffixes.
   - Remove non-alphanumeric characters.

2. **Scoring**:
   - Compare the preprocessed website against the preprocessed company name and the initials of the preprocessed company name.
   - Compute four distinct similarity scores (ratio, partial ratio, token sort ratio, token set ratio) between the cleaned company name and the cleaned website.
   - Calculate the **Average score** by averaging the above four scores.
   - Calculate the **Initials score** by comparing the cleaned website with the initials of the preprocessed company name using only the partial ratio score.
   - Decide which score to use (Average score or Initials score) based on the following conditions:
     * The Initials score must be greater than 85.
     * The Initials score must exceed the Average score.
     * The string length of the initials must be more than 1 character.
     * The cleaned website's length must be less than 6 characters.
     
    If any condition is not met, the Average score is used; otherwise, the Initials score is considered.

3. **Best Website Determination**:
   - The website with the highest applicable score (Initials score if all conditions are met, otherwise Average score) is identified as the best match for the given old id.


#### Practical Examples

1. **Example 1: Prioritizing Initials for Matching**

    - Raw Company Name: "MERCK SHARP & DOHME COMERCIALIZADORAS DE RL DE CV"
    - Cleaned Company Name: "mercksharpdohmecomercializadora"
    - Extracted Initials: "msdc"
    - Potential Websites: ["msd.com", "bravecto.com"]
    - Cleaned Websites: ["msd", "bravecto"]

    In this example, the comparison process assesses the similarity between the company initials and the potential website domains. For "msd.com," the `partial_ratio` score between the initials "msdc" and the domain "msd" is likely to be higher than the `ratio` score between the cleaned company name and the domain. This is because the initials "msdc" directly correlate with the domain "msd," reflecting a common practice where a company's web domain is an abbreviation of its full name.

    The comparison between the cleaned full company name "mercksharpdohmecomercializadora" and the domain "msd" may result in a lower similarity score (average score) due to the length and complexity of the full name, which could introduce more potential points of difference as measured by the Levenshtein distance.

    Therefore, in cases where a high `partial_ratio` score is achieved with initials, it suggests a deliberate choice of website domain to represent the company's abbreviated name, making the initials to domain comparison the stronger indicator for a match. This approach, which favors initials when they yield a high score, is especially useful when the domain name is likely to be an acronym or initialism of the company name.

- **Example 2: Full Name Match Example**:

    - Raw Name: "FLUID CONTROL MEXICO S DE RL DE CV"
    - Cleaned Name: "fluidcontrolmexico"
    - Initials: "fcm"
    - Websites: "fluidcontrolmx.com" and "pfccorp.com"
    - Cleaned Websites: "fluidcontrolmx" and "pfccorp"
    
    In this example, the initials "fcm" do not closely match the cleaned websites "fluidcontrolmx" or "pfccorp", resulting in a lower partial_ratio score in both cases. Conversely, a comparison between the cleaned name "fluidcontrolmexico" and the cleaned website "fluidcontrolmx" reveals a higher degree of similarity due to the extended overlap of characters. Thus, in this scenario, the full name comparison with the cleaned website is more likely to yield a higher score than the initials comparison, leading to a preference for the full name match with "fluidcontrolmx.com" over "pfccorp.com".

In [5]:
# Define thresholds 
th_init = 80        # Threshold to define when to use initials over the avg_score
th_max_score = 50   # Threshold to define if the max score of the best website by new_id-old is greater than or not to this threshold

In [6]:
# Empty dictionary to store results 
scores_builtwith = {}
# Array with new IDs
new_ids = mult_builtwith_ind["new_id"].unique()

# Iterate over new ids
for new_id in new_ids:
    # Filter dataset to get only observations of old ids associated with that new id 
    df_filtered = mult_builtwith_ind[mult_builtwith_ind["new_id"] == new_id]
    # Create an empty dictionary for the new id
    scores_builtwith[new_id] = {}
    # Create an empty list to store websites with the highest scores
    best_websites = []
    # Iterate over old ids-raw names associated with the new id
    for old_id, raw_name in zip(df_filtered.old_id.to_numpy(), df_filtered.panjiva_raw_firm_name.to_numpy()):
        # Create an empty dictionary for each old id 
        scores_builtwith[new_id][old_id] = {}
        # Assign raw name to the old id dictionary 
        scores_builtwith[new_id][old_id][raw_name] = {}  
        
        max_score = 0
        best_website = None
        dict_websites = {}

        
        # Iterate over websites 
        for website in df_filtered.builtwith_website.to_numpy():
            
            # Preprocess raw name and website and get initials of the cleaned name 
            clean_name, clean_website, initials = preprocess(raw_name, website)
            
            # Does the clean website contain the word "mexico" and the cleaned name does not? If yes, add "mexico" to clean name. 
            # This will give a high score to websites containing the mexico part. 
            if "mexico" in clean_website and "mexico" not in clean_name:
                clean_name = clean_name + "mexico"

            # Add cleaned website to dictionary of websites where the key is the non-cleaned website
            dict_websites[website] = clean_website
            
            
            # Scores comparing cleaned names and cleaned websites 
            score_ratio = fuzz.ratio(clean_name, clean_website)
            score_partial_ratio = fuzz.partial_ratio(clean_name, clean_website)
            score_token_sort_ratio = fuzz.token_sort_ratio(clean_name, clean_website)
            score_token_set_ratio = fuzz.token_set_ratio(clean_name, clean_website)
            
            # Average score 
            avg_score = (0.30*score_ratio + 0.40*score_partial_ratio + 0.15*score_token_sort_ratio + 0.15*score_token_set_ratio)
            
            # Scores comparing initials vs websites (only use partial_ratio as it specifically
            # helps to focus the match on any substring within the website domain that aligns
            # best with the initials)
            initials_score = fuzz.partial_ratio(initials, clean_website)
            
           # Compute highest score
            if (initials_score >= th_init) and (initials_score > avg_score) and len(clean_website) <= 7 and (len(clean_website) <= (len(initials) + 3)) and initials_score > max_score and len(initials) > 1: 
                max_score = initials_score                                                                 # If the cleaned website is too large compared to the initials length then probably the website is not initials based. The length of the clean website must be smaller or equal than the length of the initials + 3 
                best_website = website 
                initials_score_used = 1
                avg_score_used = 0 
                
            else:
                if avg_score > max_score: 
                    max_score = avg_score
                    best_website = website
                    initials_score_used = 0 
                    avg_score_used = 1
            
 
            scores_builtwith[new_id][old_id][raw_name][website] = {
                "cleaned_name": clean_name,
                "cleaned_website": clean_website,
                "initials_name": initials,
                "score_ratio": score_ratio,
                "score_partial_ratio": score_partial_ratio,
                "score_token_sort_ratio": score_token_sort_ratio,
                "score_token_set_ratio": score_token_set_ratio,
                "avg_score": avg_score, 
                "score_initials_partial_ratio": initials_score, 
                "score_initials_used": initials_score_used, 
                "avg_score_used": avg_score_used

            } 
            
        
        # Check if all cleaned websites are equal, if yes, give priority to the website ending in .in
        if len(set(dict_websites.values())) == 1:
            for w in dict_websites.keys():
                if any(substring in w for substring in [".mex", ".mx"]):
                    best_website = w # Give priority to website with ".in"
                else:
                    continue
        
                
        # Store the best website and its score for the current old_id
        scores_builtwith[new_id][old_id]['best_website'] = {
            "website": best_website,
            "max_score": max_score, 
            "is_max_initials_score": initials_score_used, 
            "is_max_avg_score": avg_score_used

        }
        
        # Add the best website to the list 
        best_websites.append(best_website)
        
    if len(set(best_websites)) == 1:
        scores_builtwith[new_id]["share_best_website"] = 1     
    else:
        scores_builtwith[new_id]["share_best_website"] = 0     

In [7]:
#print(json.dumps(scores_builtwith, indent=4, sort_keys=True))

In [8]:
# Create an empty list to store data for DataFrame
data_list = []

# Iterate through the dictionary to extract required information
for new_id, old_ids in scores_builtwith.items():
    for old_id, contents in old_ids.items():
        if old_id == "share_best_website":
            continue
        for panjiva_raw_name, websites in contents.items():
            if panjiva_raw_name == "best_website":
                continue
            for website, info in websites.items():
                data_list.append({
                    "new_id": new_id,
                    "old_id": old_id,
                    "panjiva_raw_name": panjiva_raw_name,
                    "cleaned_name": info["cleaned_name"],
                    "initials_name": info["initials_name"],
                    "website": website,
                    "cleaned_website": info["cleaned_website"],
                    "avg_score": info["avg_score"],
                    "score_partial_ratio": info["score_partial_ratio"],
                    "score_ratio": info["score_ratio"],
                    "score_token_sort_ratio": info["score_token_sort_ratio"],
                    "score_token_set_ratio": info["score_token_set_ratio"],
                    "score_initials_partial_ratio": info["score_initials_partial_ratio"],
                    "score_initials_used": info["score_initials_used"],
                    "avg_score_used": info["avg_score_used"],
                    "website_with_highest_score": contents["best_website"]["website"],
                    "max_score": contents["best_website"]["max_score"],
                    "old_ids_share_best_website": old_ids["share_best_website"], 
                    "is_max_score_initials_score": contents["best_website"]["is_max_initials_score"], 
                    "is_max_score_avg_score": contents["best_website"]["is_max_avg_score"]

                })

                
# Create a DataFrame from the list of dictionaries
df_builtwith_multimodal = pd.DataFrame(data_list)


# Columns to put at the beginning 
front_columns = ['new_id', 'old_id', 'panjiva_raw_name', 'website', 'website_with_highest_score', 
                 'cleaned_name', 'initials_name',  
                 'cleaned_website',  'old_ids_share_best_website', 'max_score',
                 "is_max_score_avg_score", "is_max_score_initials_score"]

# Remaining columns
remaining_columns = [col for col in df_builtwith_multimodal.columns if col not in front_columns]

# New column order
new_order = front_columns + remaining_columns

# Reorder the DataFrame
df_builtwith_multimodal = df_builtwith_multimodal[new_order]

df_builtwith_multimodal

Unnamed: 0,new_id,old_id,panjiva_raw_name,website,website_with_highest_score,cleaned_name,initials_name,cleaned_website,old_ids_share_best_website,max_score,is_max_score_avg_score,is_max_score_initials_score,avg_score,score_partial_ratio,score_ratio,score_token_sort_ratio,score_token_set_ratio,score_initials_partial_ratio,score_initials_used,avg_score_used
0,MEX11792,MEX167255,PROMAQUIN SA DE CV,promaquin.com,promaquin.com,promaquin,p,promaquin,1,100.0,1,0,100.0,100,100,100,100,100,0,1
1,MEX11792,MEX167255,PROMAQUIN SA DE CV,promsacorp.com,promaquin.com,promaquin,p,promsacorp,1,100.0,1,0,58.6,67,53,53,53,100,0,1
2,MEX11792,MEX11792,PROMAQUINASA DE CV,promaquin.com,promaquin.com,promaquina,p,promaquin,1,97.0,1,0,97.0,100,95,95,95,100,0,1
3,MEX11792,MEX11792,PROMAQUINASA DE CV,promsacorp.com,promaquin.com,promaquina,p,promsacorp,1,97.0,1,0,56.8,67,50,50,50,100,0,1
4,MEX123788,MEX123788,"DIICSA INFRAESTRUCTURA, SA DE CV",diicsacv.com,diicsacv.com,diicsainfraestructura,di,diicsacv,1,63.2,1,0,63.2,86,48,48,48,100,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286,MEX94771,MEX94771,"CESCO IMPORTACIONES,S.A.DE C.V.",intercesco.com,cesco.com,cescoimportacionesde,ci,intercesco,1,64.0,1,0,46.6,67,33,33,33,67,0,1
287,MEX96799,MEX96799,DISTRIBUIDORA CASAM SA DE CV,dcasam.com,dcasam.com,distribuidoracasam,dc,dcasam,0,66.4,1,0,66.4,91,50,50,50,100,0,1
288,MEX96799,MEX96799,DISTRIBUIDORA CASAM SA DE CV,grupocasan.com,dcasam.com,distribuidoracasam,dc,grupocasan,0,66.4,1,0,54.0,60,50,50,50,50,0,1
289,MEX96799,MEX177831,DISTRIBUIDORA CASAN SA DE CV,dcasam.com,grupocasan.com,distribuidoracasan,dc,dcasam,0,62.6,1,0,54.4,73,42,42,42,100,0,1


In [9]:
# Share of new ids sharing the best website
df_builtwith_multimodal.drop_duplicates("new_id").old_ids_share_best_website.value_counts()/df_builtwith_multimodal.drop_duplicates("new_id").old_ids_share_best_website.value_counts().sum()

1    0.627119
0    0.372881
Name: old_ids_share_best_website, dtype: float64

In [10]:
# Threshold to define if website is correctly assigned
# Are all the max scores above the threshold? 
df_builtwith_multimodal["all_max_scores_above_threshold"] = df_builtwith_multimodal.groupby("new_id")["max_score"].transform(lambda x: (x > th_max_score).all()).astype(int)

# If the max score is above the threshold, then assign website to the new id
df_builtwith_multimodal["assign_website"] = ((df_builtwith_multimodal["all_max_scores_above_threshold"] == 1) & (df_builtwith_multimodal["old_ids_share_best_website"] == 1)).astype(int)

In [11]:
df_builtwith_multimodal.assign_website.value_counts()

0    170
1    121
Name: assign_website, dtype: int64

In [12]:
df_builtwith_multimodal[df_builtwith_multimodal["assign_website"] == 0].iloc[:60]

Unnamed: 0,new_id,old_id,panjiva_raw_name,website,website_with_highest_score,cleaned_name,initials_name,cleaned_website,old_ids_share_best_website,max_score,...,avg_score,score_partial_ratio,score_ratio,score_token_sort_ratio,score_token_set_ratio,score_initials_partial_ratio,score_initials_used,avg_score_used,all_max_scores_above_threshold,assign_website
4,MEX123788,MEX123788,"DIICSA INFRAESTRUCTURA, SA DE CV",diicsacv.com,diicsacv.com,diicsainfraestructura,di,diicsacv,1,63.2,...,63.2,86,48,48,48,100,0,1,0,0
5,MEX123788,MEX123788,"DIICSA INFRAESTRUCTURA, SA DE CV",grupodica.com,diicsacv.com,diicsainfraestructura,di,grupodica,1,63.2,...,37.4,53,27,27,27,100,0,1,0,0
6,MEX123788,MEX97820,"DICA INFRAESTRUCTURA,S.A. DE C.V.",diicsacv.com,diicsacv.com,dicainfraestructura,di,diicsacv,1,49.0,...,49.0,67,37,37,37,100,0,1,0,0
7,MEX123788,MEX97820,"DICA INFRAESTRUCTURA,S.A. DE C.V.",grupodica.com,diicsacv.com,dicainfraestructura,di,grupodica,1,49.0,...,42.2,62,29,29,29,100,0,1,0,0
8,MEX123969,MEX123969,LLANTAS Y SERVICIOS DEGA DE SALTILLO SA DE CV,degasaltillo.com,degasaltillo.com,llantasyserviciosdegasaltillo,lsds,degasaltillo,0,75.4,...,75.4,100,59,59,59,40,0,1,1,0
9,MEX123969,MEX123969,LLANTAS Y SERVICIOS DEGA DE SALTILLO SA DE CV,llantasdega.com,degasaltillo.com,llantasyserviciosdegasaltillo,lsds,llantasdega,0,75.4,...,64.2,78,55,55,55,50,0,1,1,0
10,MEX123969,MEX119440,LLANTAS Y SERVICIOS DEGA SA DE CV,degasaltillo.com,llantasdega.com,llantasyserviciosdega,lsd,degasaltillo,0,72.6,...,34.4,50,24,24,24,50,0,1,1,0
11,MEX123969,MEX119440,LLANTAS Y SERVICIOS DEGA SA DE CV,llantasdega.com,llantasdega.com,llantasyserviciosdega,lsd,llantasdega,0,72.6,...,72.6,78,69,69,69,67,0,1,1,0
12,MEX125129,MEX125129,AMERICA LOGISTICS SA DE CV,americalogistics.com,americalogistics.com,americalogistics,al,americalogistics,0,100.0,...,100.0,100,100,100,100,100,0,1,1,0
13,MEX125129,MEX125129,AMERICA LOGISTICS SA DE CV,americalogisticsgroup.com,americalogistics.com,americalogistics,al,americalogisticsgroup,0,100.0,...,91.6,100,86,86,86,100,0,1,1,0


In [14]:
# Save data 
df_builtwith_multimodal.to_stata( 
    path = "../../Data/Mexico/processed_data/fuzzy_builtwith_websites_scores_MEX.dta",
    variable_labels = {
        "new_id": "New ID",
        "old_id": "Old ID", 
        "panjiva_raw_name": "Panjiva raw name", 
        "cleaned_name": "Cleaned Panjiva raw name", 
        "initials_name": "Initials of cleaned Panjiva name", 
        "website": "Builtwith website", 
        "cleaned_website": "Cleaned Builtwith website", 
        "avg_score": "Avg score(partial, partial ratio, token sort, set ratio)", 
        "score_partial_ratio": "Score from partial ratio", 
        "score_ratio": "Score from ratio", 
        "score_token_sort_ratio": "Score from token sort ratio", 
        "score_token_set_ratio": "Score from token set ratio", 
        "sum_scores": "Scores sum (partial, ratio, token sort and token set ratio)",
        "score_initials_partial_ratio": "Partial ratio score using initials", 
        "score_initials_used": "Is initials-based score used? (1=Yes,0=No)", 
        "avg_score_used": "Is the average score used? (1=Yes,0=No)",
        "website_with_highest_score": "Website with highest score for the given old ID", 
        "max_score": "Score of the website with highest score", 
        "old_ids_share_best_website": "Do old IDs share same top website? (1=Yes,0=No)", 
        "is_max_score_initials_score": "Is the max score a initials-based score? (1=Yes,0=No)",
        "is_max_score_avg_score": "Is the max score the average score? (1=Yes,0=No)", 
        "all_max_scores_above_threshold":"For a given new ID, are all the max scores above the threshold?", 
        "assign_website": "Based on text similarity, is best website correctly assigned?(1=Yes,0=No)"
    }, 
    write_index = False
)


# 2. Fuzzy matching for new IDs with multiple old IDs but with only one URL  retrieved from Builtwith 

In this section, we analyze cases where there are new IDs with multiple old IDs, but it was only possible to retrive the website for one of these old IDs, i.e., only one old ID was matched to Builtwith. 

In [17]:
# Read data of new IDs with multiple old IDs that have only URL retrieved from Builtwith. This is different from the 
# unimodal case, as these are new IDs that have the flag of multiple old IDs -- because in the correspondence table they 
# do have multiple website -- but only the URL of one of these old IDs was able to be retrieved from Builtwith (i.e., only 
# one of these old IDs matched to Builtwith)
only_one_url = pd.read_csv("../../Data/Mexico/processed_data/one_url_retrieved_cases_for_fuzzy_MEX.csv").drop("Unnamed: 0", axis = 1)[["old_id", "new_id", "panjiva_raw_firm_name", "builtwith_website", "only_one_url_retrieved"]]
only_one_url.head(30)

Unnamed: 0,old_id,new_id,panjiva_raw_firm_name,builtwith_website,only_one_url_retrieved
0,MEX101734,MEX101734,FARMOQUIMIA SA DE CV,adyfarm.mx,1
1,MEX148237,MEX103015,GABLACSA DE CV,gablac.com,1
2,MEX102485,MEX10382,GO GLOBAL S DE RL DE CV,gogloballlc.com,1
3,MEX104130,MEX104130,NEUMATICOS MUEVETIERRA DE PUEBLA SA DE CV,neumaticosmexico.com,1
4,MEX216963,MEX104878,DART INTERNATIONAL S. DE R.L. DE C.V.,dart.biz,1
5,MEX106606,MEX106606,HOTEL ROYAL PLAYACAR SA DE CV,barcelo.com,1
6,MEX10775,MEX10775,DAIDO METAL MEXICO SALES SA DE CV,daidometal.com,1
7,MEX107884,MEX107884,THE AMERICAN SCHOOL FOUNDATION OF GUADALAJARA AC,asfg.mx,1
8,MEX109027,MEX109027,"VST TECHNOLOGY MEXICO, S.A. DE C.V.",vstmexico.com,1
9,MEX102041,MEX1122,CATERPILLAR LOGISTICS SERVICES DE MEXICOS DE R...,caterpillar.com,1


## 2.2. Preprocessing 

We follow the same preprocessing steps as in the section 1.2. to clean Panjiva raw names and Builtwith websites. 

In [None]:
only_one_url[["cleaned_name", "cleaned_website", "initials_name"]] = only_one_url.apply(lambda x: preprocess(x["panjiva_raw_firm_name"], x["builtwith_website"]), axis=1, result_type = "expand")
only_one_url

## 2.3 Scoring Algorithm 


In [None]:
# Function to compute Scores
def compute_scores(clean_name, clean_website, initials):
    
    # Scores comparing cleaned names and cleaned websites 
    score_ratio = fuzz.ratio(clean_name, clean_website)
    score_partial_ratio = fuzz.partial_ratio(clean_name, clean_website)
    score_token_sort_ratio = fuzz.token_sort_ratio(clean_name, clean_website)
    score_token_set_ratio = fuzz.token_set_ratio(clean_name, clean_website)
            
    # Average score 
    avg_score = (score_ratio + score_partial_ratio + score_token_sort_ratio + score_token_set_ratio) / 4
            
    # Scores comparing initials vs websites (only use partial_ratio as it specifically
    # helps to focus the match on any substring within the website domain that aligns
    # best with the initials)
    initials_score = fuzz.partial_ratio(initials, clean_website)
    
    # Max score (define better based on same conditionals)
    if (initials_score > th_init) and (initials_score > avg_score) and len(clean_website) <= 6 and len(initials) > 1: 
        max_score = initials_score
        is_max_score_initials_score = 1
    elif (score_partial_ratio > avg_score) and (score_partial_ratio >= th_partial):  # Create a threshold to know when to use the score_partial_ratio
        max_score = score_partial_ratio
        is_max_score_initials_score = 1
    else: 
        max_score = avg_score
        is_max_score_initials_score = 1

    return score_ratio, score_partial_ratio, score_token_sort_ratio, score_token_set_ratio, avg_score, initials_score, max_score, is_max_score_initials_score

In [None]:
# Create variables with scores 
only_one_url[
    ["score_ratio", "score_partial_ratio",
     "score_token_sort_ratio","score_token_set_ratio", 
     "avg_score", "initials_score", "max_score", "is_max_score_initials_score"]
] = only_one_url.apply(lambda x: compute_scores(x["cleaned_name"], x["cleaned_website"], x["initials_name"]), 
                       axis =1, result_type = "expand")


# Threshold to assign website to panjiva raw name:  
only_one_url[["assign_website"]] = (only_one_url[["max_score"]] > th_max_score).astype(int)

In [None]:
only_one_url.assign_website.value_counts()

In [None]:
only_one_url

In [None]:
# Save data 
only_one_url.to_stata( 
    path = "../../Data/Mexico/processed_data/fuzzy_aberdeen_names_scores_MEX.dta",
    variable_labels = {
        "new_id": "New ID",
        "old_id": "Old ID", 
        "panjiva_raw_name": "Panjiva raw name", 
        "cleaned_name": "Cleaned Panjiva raw name", 
        "initials_name": "Initials of cleaned Panjiva name", 
        "website": "Builtwith website", 
        "cleaned_website": "Cleaned Builtwith website", 
        "avg_score": "Avg score(partial, partial ratio, token sort, set ratio)", 
        "score_partial_ratio": "Score from partial ratio", 
        "score_ratio": "Score from ratio", 
        "score_token_sort_ratio": "Score from token sort ratio", 
        "score_token_set_ratio": "Score from token set ratio", 
        "score_initials_partial_ratio": "Partial ratio score using initials", 
        "score_initials_used": "Is initials-based score used? (1=Yes,0=No)", 
        "website_with_highest_score": "Website with highest score for the given old ID (best website)", 
        "max_score": "Score of the website with highest score", 
        "is_max_score_initials_score": "Is the max score a initials-based score? (1=Yes,0=No)", 
        "assign_website": "Based on text similarity, is best website correctly assigned?(1=Yes,0=No)"
    }, 
    write_index = False
)

# 3. Fuzzy matching for Mexican new IDs with multiple old IDs and multimodal distribution in Aberdeen names for each new ID

## 3.1 Read Data

In [14]:
# Read data of new IDs with multiple old IDs that have a multimodal distribution of aberdeen names for each new ID
mult_aberdeen_ind = pd.read_csv("../../Data/Mexico/processed_data/new_ids_cases_for_fuzzy_aberdeen_MEX.csv")
mult_aberdeen_ind[["new_id", "old_id", "domestic", "aberdeen_name"]].head(60)

Unnamed: 0,new_id,old_id,domestic,aberdeen_name
0,MEX10602,MEX658,ROBERT BOSCH SISTEMAS AUTOMOTRICES SA DE CV,Robert Bosch Sistemas Automotrices S.A. De C.V.
1,MEX10602,MEX4972,ROBERT BOSCH TOOL DE MEXICOSA DE CV,Robert Bosch Tool De México S.A. De C.V.
2,MEX10886,MEX8715,MOLINOS AZTECA DE CHALCO SA DE C V,Molinos Azteca De Chalco S.A. De C.V.
3,MEX10886,MEX10886,MOLINOS AZTECA DE CHIAPAS SA DE CV,Molinos Azteca De Chiapas S.A. De C.V.
4,MEX116214,MEX116214,CHEM ADDITIVES DE MEXICO SA DE CV,Chem Additives De México S.A De C.V.
5,MEX116214,MEX13903,CHEMICAL ADDITIVES DE MEXICO SA DE CV,Chemical Additives De México S.A. De C.V.
6,MEX1265,MEX1265,LG ELECTRONICS MEXICALI SA DE CV,Lg Electronics Mexicali S.A. De C.V.
7,MEX1265,MEX503,LG ELECTRONICS MONTERREY MEXICO SA DE CV,Lg Electronics Monterrey México S.A. De C.V.
8,MEX1265,MEX856,LG ELECTRONICS MEXICO SA DE CV,Lg Electronics México S.A. De C.V.
9,MEX1265,MEX217,LG ELECTRONICS REYNOSA SA DE CV,Lg Electronics Reynosa S.A. De C.V.


## 3.2 Preprocessing

In [20]:
# Remove common suffixes and other non-alphanumeric characters from name and website
mexican_suffixes = ["sa de cv", "s a de c v", "s de rl de cv", "sa cv", "s de rl", " sa ", "s. a. de c. v.",  
                "s.a.", "s.a. de c.v.", "inc", "corp", "llc", "r.l", "c.v", "s. de r.l.", " de ", " s.a. ", " s. ",
                    " de c.v.", "s.a. de c.v.", "sa de c v"]

def preprocess_aberdeen(name_panjiva, name_aberdeen):
    # Normalize: lowercase and remove accents
    name_panjiva = unidecode.unidecode(name_panjiva.lower())
    name_aberdeen = unidecode.unidecode(name_aberdeen.lower())
    
    # Remove common suffixes and other non-alphanumeric characters from name
    for suffix in mexican_suffixes:
        name_panjiva = name_panjiva.replace(suffix, '')
        name_aberdeen = name_aberdeen.replace(suffix, '')
    
    # Remove all non-alphanumeric characters from name
    name_panjiva = ''.join(e for e in name_panjiva if e.isalnum())
    name_aberdeen = ''.join(e for e in name_aberdeen if e.isalnum())


    return name_panjiva, name_aberdeen

## 3.3 Scoring

In [21]:
# Empty dictionary to store results 
scores_aberdeen = {}
# Array with new IDs
new_ids = mult_aberdeen_ind["new_id"].unique()

# Iterate over new ids
for new_id in new_ids:
    # Filter dataset to get only observations of old ids associated with that new id 
    df_filtered = mult_aberdeen_ind[mult_aberdeen_ind["new_id"] == new_id]
    # Create an empty dictionary for the new id
    scores_aberdeen[new_id] = {}
    # Create an empty list to store websites with the highest scores
    best_names = []
    
    # Iterate over old ids-raw names associated with the new id
    for old_id, raw_name in zip(df_filtered.old_id.to_numpy(), df_filtered.domestic.to_numpy()):
        # Create an empty dictionary for each old id 
        scores_aberdeen[new_id][old_id] = {}
        # Assign raw name to the old id dictionary 
        scores_aberdeen[new_id][old_id][raw_name] = {}  
        
        max_score = 0
        best_name = None
        
        # Iterate over websites 
        for aberdeen_name in df_filtered.aberdeen_name.to_numpy():
            
            # Preprocess raw name and website and get initials of the cleaned name 
            clean_name, clean_aberdeen = preprocess_aberdeen(raw_name, aberdeen_name)
            
            # Scores comparing cleaned names and cleaned websites 
            score_ratio = fuzz.ratio(clean_name, clean_aberdeen)
            score_partial_ratio = fuzz.partial_ratio(clean_name, clean_aberdeen)
            score_token_sort_ratio = fuzz.token_sort_ratio(clean_name, clean_aberdeen)
            score_token_set_ratio = fuzz.token_set_ratio(clean_name, clean_aberdeen)
            
            # Average score 
            avg_score = (0.30*score_ratio + 0.40*score_partial_ratio + 0.15*score_token_sort_ratio + 0.15*score_token_set_ratio)
            
            
            if avg_score > max_score: 
                max_score = avg_score
                best_name = aberdeen_name
            
            scores_aberdeen[new_id][old_id][raw_name][aberdeen_name] = {
                "cleaned_name": clean_name,
                "cleaned_aberdeen": clean_aberdeen,
                "score_ratio": score_ratio,
                "score_partial_ratio": score_partial_ratio,
                "score_token_sort_ratio": score_token_sort_ratio,
                "score_token_set_ratio": score_token_set_ratio,
                "avg_score": avg_score, 
            } 
            
        
                
        # Store the best aberdeen name and its score for the current old_id
        scores_aberdeen[new_id][old_id]['best_name'] = {
            "aberdeen_name": best_name,
            "max_score": max_score, 
        }
        
        # Add the best name to the list 
        best_names.append(best_name)
        
        
    # Do all old IDs within the same new ID share the same "best" website?
    if len(set(best_names)) == 1:
        scores_aberdeen[new_id]["share_best_aberdeen_name"] = 1     
    else:
        scores_aberdeen[new_id]["share_best_aberdeen_name"] = 0  
      

In [22]:
# Create an empty list to store data for DataFrame
data_list = []

# Iterate through the dictionary to extract required information
for new_id, old_ids in scores_aberdeen.items():
    for old_id, contents in old_ids.items():
        if old_id == "share_best_aberdeen_name":
            continue
        for panjiva_raw_name, aberdeen_names in contents.items():
            if panjiva_raw_name == "best_name":
                continue
            for aberdeen_name, info in aberdeen_names.items():
                data_list.append({
                    "new_id": new_id,
                    "old_id": old_id,
                    "panjiva_raw_name": panjiva_raw_name,
                    "cleaned_panjiva_name": info["cleaned_name"],
                    "aberdeen_name": aberdeen_name,
                    "cleaned_aberdeen_name": info["cleaned_aberdeen"],
                    "avg_score": info["avg_score"],
                    "score_partial_ratio": info["score_partial_ratio"],
                    "score_ratio": info["score_ratio"],
                    "score_token_sort_ratio": info["score_token_sort_ratio"],
                    "score_token_set_ratio": info["score_token_set_ratio"],
                    "aberdeen_name_with_highest_score": contents["best_name"]["aberdeen_name"],
                    "max_score": contents["best_name"]["max_score"],
                    "old_ids_share_best_aberdeen_name": old_ids["share_best_aberdeen_name"] 
                })

# Create a DataFrame from the list of dictionaries
df_aberdeen_multimodal = pd.DataFrame(data_list)
df_aberdeen_multimodal

Unnamed: 0,new_id,old_id,panjiva_raw_name,cleaned_panjiva_name,aberdeen_name,cleaned_aberdeen_name,avg_score,score_partial_ratio,score_ratio,score_token_sort_ratio,score_token_set_ratio,aberdeen_name_with_highest_score,max_score,old_ids_share_best_aberdeen_name
0,MEX10602,MEX658,ROBERT BOSCH SISTEMAS AUTOMOTRICES SA DE CV,robertboschsistemasautomotrices,Robert Bosch Sistemas Automotrices S.A. De C.V.,robertboschsistemasautomotrices,100.0,100,100,100,100,Robert Bosch Sistemas Automotrices S.A. De C.V.,100.0,0
1,MEX10602,MEX658,ROBERT BOSCH SISTEMAS AUTOMOTRICES SA DE CV,robertboschsistemasautomotrices,Robert Bosch Tool De México S.A. De C.V.,robertboschtoolmexico,65.6,71,62,62,62,Robert Bosch Sistemas Automotrices S.A. De C.V.,100.0,0
2,MEX10602,MEX4972,ROBERT BOSCH TOOL DE MEXICOSA DE CV,robertboschtoolmexico,Robert Bosch Sistemas Automotrices S.A. De C.V.,robertboschsistemasautomotrices,65.6,71,62,62,62,Robert Bosch Tool De México S.A. De C.V.,100.0,0
3,MEX10602,MEX4972,ROBERT BOSCH TOOL DE MEXICOSA DE CV,robertboschtoolmexico,Robert Bosch Tool De México S.A. De C.V.,robertboschtoolmexico,100.0,100,100,100,100,Robert Bosch Tool De México S.A. De C.V.,100.0,0
4,MEX10886,MEX8715,MOLINOS AZTECA DE CHALCO SA DE C V,molinosaztecachalcodecv,Molinos Azteca De Chalco S.A. De C.V.,molinosaztecachalco,94.0,100,90,90,90,Molinos Azteca De Chalco S.A. De C.V.,94.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,MEX66543,MEX112317,UNI-TRADE BROKERS S.C.,unitradebrokerssc,Uni Trade Brokers S.C.,unitradebrokerssc,100.0,100,100,100,100,Uni Trade Brokers S.C.,100.0,0
121,MEX69751,MEX1509,SCHNEIDER ELECTRIC MEXICO SA DE CV,schneiderelectricmexico,Schneider Electric México S.A. De C.V.,schneiderelectricmexico,100.0,100,100,100,100,Schneider Electric México S.A. De C.V.,100.0,0
122,MEX69751,MEX1509,SCHNEIDER ELECTRIC MEXICO SA DE CV,schneiderelectricmexico,Schneider Electric Software México S.A. De C.V.,schneiderelectricsoftwaremexico,85.4,86,85,85,85,Schneider Electric México S.A. De C.V.,100.0,0
123,MEX69751,MEX31705,SCHNEIDER ELECTRIC SOFTWARE MEXICO SA DE CV,schneiderelectricsoftwaremexico,Schneider Electric México S.A. De C.V.,schneiderelectricmexico,85.4,86,85,85,85,Schneider Electric Software México S.A. De C.V.,100.0,0


In [23]:
df_aberdeen_multimodal.head(60)

Unnamed: 0,new_id,old_id,panjiva_raw_name,cleaned_panjiva_name,aberdeen_name,cleaned_aberdeen_name,avg_score,score_partial_ratio,score_ratio,score_token_sort_ratio,score_token_set_ratio,aberdeen_name_with_highest_score,max_score,old_ids_share_best_aberdeen_name
0,MEX10602,MEX658,ROBERT BOSCH SISTEMAS AUTOMOTRICES SA DE CV,robertboschsistemasautomotrices,Robert Bosch Sistemas Automotrices S.A. De C.V.,robertboschsistemasautomotrices,100.0,100,100,100,100,Robert Bosch Sistemas Automotrices S.A. De C.V.,100.0,0
1,MEX10602,MEX658,ROBERT BOSCH SISTEMAS AUTOMOTRICES SA DE CV,robertboschsistemasautomotrices,Robert Bosch Tool De México S.A. De C.V.,robertboschtoolmexico,65.6,71,62,62,62,Robert Bosch Sistemas Automotrices S.A. De C.V.,100.0,0
2,MEX10602,MEX4972,ROBERT BOSCH TOOL DE MEXICOSA DE CV,robertboschtoolmexico,Robert Bosch Sistemas Automotrices S.A. De C.V.,robertboschsistemasautomotrices,65.6,71,62,62,62,Robert Bosch Tool De México S.A. De C.V.,100.0,0
3,MEX10602,MEX4972,ROBERT BOSCH TOOL DE MEXICOSA DE CV,robertboschtoolmexico,Robert Bosch Tool De México S.A. De C.V.,robertboschtoolmexico,100.0,100,100,100,100,Robert Bosch Tool De México S.A. De C.V.,100.0,0
4,MEX10886,MEX8715,MOLINOS AZTECA DE CHALCO SA DE C V,molinosaztecachalcodecv,Molinos Azteca De Chalco S.A. De C.V.,molinosaztecachalco,94.0,100,90,90,90,Molinos Azteca De Chalco S.A. De C.V.,94.0,0
5,MEX10886,MEX8715,MOLINOS AZTECA DE CHALCO SA DE C V,molinosaztecachalcodecv,Molinos Azteca De Chiapas S.A. De C.V.,molinosaztecachiapas,80.0,89,74,74,74,Molinos Azteca De Chalco S.A. De C.V.,94.0,0
6,MEX10886,MEX10886,MOLINOS AZTECA DE CHIAPAS SA DE CV,molinosaztecachiapas,Molinos Azteca De Chalco S.A. De C.V.,molinosaztecachalco,84.8,89,82,82,82,Molinos Azteca De Chiapas S.A. De C.V.,100.0,0
7,MEX10886,MEX10886,MOLINOS AZTECA DE CHIAPAS SA DE CV,molinosaztecachiapas,Molinos Azteca De Chiapas S.A. De C.V.,molinosaztecachiapas,100.0,100,100,100,100,Molinos Azteca De Chiapas S.A. De C.V.,100.0,0
8,MEX116214,MEX116214,CHEM ADDITIVES DE MEXICO SA DE CV,chemadditivesmexico,Chem Additives De México S.A De C.V.,chemadditivesmexicosa,97.0,100,95,95,95,Chem Additives De México S.A De C.V.,97.0,0
9,MEX116214,MEX116214,CHEM ADDITIVES DE MEXICO SA DE CV,chemadditivesmexico,Chemical Additives De México S.A. De C.V.,chemicaladditivesmexico,89.2,88,90,90,90,Chem Additives De México S.A De C.V.,97.0,0


In [24]:
# Share of new ids sharing the best aberdeen name
df_aberdeen_multimodal.drop_duplicates("new_id").old_ids_share_best_aberdeen_name.value_counts()/df_aberdeen_multimodal.drop_duplicates("new_id").old_ids_share_best_aberdeen_name.value_counts().sum()

0    1.0
Name: old_ids_share_best_aberdeen_name, dtype: float64

In [None]:
# Are all the max scores above the threshold? 
df_aberdeen_multimodal["all_max_scores_above_threshold"] = df_aberdeen_multimodal.groupby("new_id")["max_score"].transform(lambda x: (x > th_max_score).all()).astype(int)

# If the max score is above the threshold, then assign website to the new id
df_aberdeen_multimodal["assign_aberdeen_name"] = ((df_aberdeen_multimodal["all_max_scores_above_threshold"] == 1) & (df_aberdeen_multimodal["old_ids_share_best_aberdeen_name"] == 1)).astype(int)

In [None]:
df_aberdeen_multimodal.assign_aberdeen_name.value_counts()

In [None]:
df_aberdeen_multimodal[["new_id", "old_id", "panjiva_raw_name", "aberdeen_name", "cleaned_panjiva_name" ,"cleaned_aberdeen_name", "aberdeen_name_with_highest_score", "old_ids_share_best_aberdeen_name","assign_aberdeen_name"]].head(60)

In [None]:
# Save data 
df_aberdeen_multimodal.to_stata( 
    path = "../../Data/Mexico/processed_data/fuzzy_aberdeen_names_scores_MEX.dta",
    variable_labels = {
        "new_id": "New ID",
        "old_id": "Old ID", 
        "panjiva_raw_name": "Panjiva raw name", 
        "cleaned_name": "Cleaned Panjiva raw name", 
        "aberdeen_name": "Aberdeen raw name", 
        "cleaned_aberdeen_name": "Cleaned Aberdeen name", 
        "avg_score": "Avg score(partial, partial ratio, token sort, set ratio)", 
        "score_partial_ratio": "Score from partial ratio", 
        "score_ratio": "Score from ratio", 
        "score_token_sort_ratio": "Score from token sort ratio", 
        "score_token_set_ratio": "Score from token set ratio", 
        "aberdeen_name_with_highest_score": "Aberdeen name with highest score for the given old ID (best aberdeen name)", 
        "max_score": "Score of the aberdeen name with highest score", 
        "old_ids_share_best_aberdeen_name": "Do old IDs share same top aberden name? (1=Yes,0=No)", 
        "all_max_scores_above_threshold":"For a given new ID, are all the max scores above the threshold (75)?", 
        "assign_aberdeen_name": "Based on text similarity, is best aberdeen name correctly assigned?(1=Yes,0=No)"
    }, 
    write_index = False
)

# 4. Fuzzy matching for new IDs with multiple old IDs but with only one old ID by new ID that was matched to Aberdeen (one Aberdeen name)

In this section, we analyze cases where there are new IDs with multiple old IDs, but it was only possible to retrive the website for one of these old IDs, i.e., only one old ID was matched to Builtwith. 

In [None]:
# Read data of new IDs with multiple old IDs that have only one aberdeen name matched. This is different from the 
# unimodal case, as these are new IDs that have the flag of multiple old IDs -- because in the correspondence table they 
# do have multiple aberdeen names -- but only the panjiva name of one of these old IDs was able to be matched to Aberdeen  
only_one_aberdeen_name = pd.read_csv("../../Data/Mexico/processed_data/one_aberdeen_name_matched_cases_for_fuzzy_MEX.csv").drop("Unnamed: 0", axis = 1)[["old_id", "new_id", "domestic", "aberdeen_name", "only_one_aberdeen_matched"]]

## 4.2. Preprocessing 

We follow similar preprocessing steps as in the section 1.2. to clean Panjiva raw names and Aberdeen names. 

In [None]:
only_one_aberdeen_name[["cleaned_name", "cleaned_aberdeen_name"]] = only_one_aberdeen_name.apply(lambda x: preprocess_aberdeen(x["domestic"], x["aberdeen_name"]), axis=1, result_type = "expand")
only_one_aberdeen_name

## 4.3 Scoring Algorithm 

In [None]:
# Function to compute Scores
def compute_scores_aberdeen(clean_name, clean_aberdeen):
    
    # Scores comparing cleaned names and cleaned websites 
    score_ratio = fuzz.ratio(clean_name, clean_aberdeen)
    score_partial_ratio = fuzz.partial_ratio(clean_name, clean_aberdeen)
    score_token_sort_ratio = fuzz.token_sort_ratio(clean_name, clean_aberdeen)
    score_token_set_ratio = fuzz.token_set_ratio(clean_name, clean_aberdeen)
            
    # Average score 
    avg_score = (score_ratio + score_partial_ratio + score_token_sort_ratio + score_token_set_ratio) / 4
                        
    if (score_partial_ratio > avg_score) and (score_partial_ratio >= th_partial):  # Create a threshold to know when to use the score_partial_ratio
        max_score = score_partial_ratio
    else: 
        max_score = avg_score

    return score_ratio, score_partial_ratio, score_token_sort_ratio, score_token_set_ratio, avg_score, max_score 

In [None]:
# Create variables with scores 
only_one_aberdeen_name[
    ["score_ratio", "score_partial_ratio",
     "score_token_sort_ratio","score_token_set_ratio", 
     "avg_score", "max_score"]
] = only_one_aberdeen_name.apply(lambda x: compute_scores_aberdeen(x["cleaned_name"], x["cleaned_aberdeen_name"]), 
                       axis =1, result_type = "expand")


# Threshold to assign website to panjiva raw name:  
only_one_aberdeen_name[["assign_aberdeen_name"]] = (only_one_aberdeen_name[["max_score"]] > th_max_score).astype(int)

In [None]:
only_one_aberdeen_name.assign_aberdeen_name.value_counts()

In [None]:
only_one_aberdeen_name.columns

In [None]:
only_one_aberdeen_name[["domestic", "aberdeen_name", "assign_aberdeen_name"]].head(60)

In [None]:
# Save data
only_one_aberdeen_name.to_stata( 
    path = "../../Data/Mexico/processed_data/fuzzy_one_aberdeen_name_scores_MEX.dta",
    variable_labels = {
        "new_id": "New ID",
        "old_id": "Old ID", 
        "panjiva_raw_name": "Panjiva raw name", 
        "cleaned_name": "Cleaned Panjiva raw name", 
        "aberdeen_name": "Builtwith website", 
        "cleaned_aberdeen_name": "Cleaned Builtwith website", 
        "avg_score": "Avg score(partial, partial ratio, token sort, set ratio)", 
        "score_partial_ratio": "Score from partial ratio", 
        "score_ratio": "Score from ratio", 
        "score_token_sort_ratio": "Score from token sort ratio", 
        "score_token_set_ratio": "Score from token set ratio", 
        "score_initials_partial_ratio": "Partial ratio score using initials", 
        "score_initials_used": "Is initials-based score used? (1=Yes,0=No)", 
        "max_score": "Score of the website with highest score", 
        "is_max_score_initials_score": "Is the max score a initials-based score? (1=Yes,0=No)", 
        "assign_aberdeen_name": "Based on text similarity, is the aberdeen name correctly assigned?(1=Yes,0=No)"
    }, 
    write_index = False
)