# Portuguese Experiment

Similarly to the English Experiment, here we try to retrieve pesticide names based on substrings. The difference here is that we translate the substrings from the other experiments to use as seeds.

**Libraries**

The *os* module is a built-in library that provides functions for interacting with the *operating system*.

The *re* module provides *regular expression* matching operations.

*Pandas* is a library used for data manipulation and analysis


In [1]:
import os
import re
import pandas as pd

**Token count**

Pre-processing and total count

In [2]:
#Token Count

def token_count(folder_path):
    token_list = []
    # Loop through all .txt files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                text = file.read().lower()
                text = re.sub(r'[^\w\s-]', '', text)  # keep hyphens, remove other punctuation
                tokens = re.findall(r'\b\w+(?:-\w+)*\b', text)
                token_list.extend(tokens)

    total_tokens = len(token_list)
    return total_tokens

Folder = "Docs_Portuguese"
token_count(Folder)

682688

**Predicting keywords based on 'morphemes'**

The goal of this experiment is to extract pesticide names by using substrings (that may or may not be morphemes) from a list of strings. Based on the top 10 substrings ranging in size from *2 to 5* characters. This experiment is run 4 times, with a decreasing number of substrings used in each run (2-5, 3-5, 4-5, and 5).

In [3]:
def extract_matching_strings_from_txt(path, substrings):
    matches = set()
    with open(path, "r", encoding="utf-8") as file:
        text = file.read()
        text = re.sub(r'[^\w\s-]', '', text)
        words = re.findall(r"\b\w+(?:-\w+)*\b", text)  # extract all words
        for word in words:
            if any(sub in word for sub in substrings):
                matches.add(word)
    return matches

def scrape_txt_folder(folder_path, substrings):
    results = {}
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            full_path = os.path.join(folder_path, filename)
            matched_words = extract_matching_strings_from_txt(full_path, substrings)
            if matched_words:
                results[filename] = matched_words
    return results

def merge_unique(found_matches):
    merged = set()
    for words in found_matches.values():
        merged.update(words)
    return sorted(merged)

def save_list_to_excel(string_list, filename="unique_matches_pt_1.xlsx"):
    df = pd.DataFrame(string_list, columns=["Matched Strings"])
    df.to_excel(filename, index=False)
    print(f"Excel file saved as: {filename}")



In [4]:
ground_truth = [
    "acefato", "azametifós", "azinfos", "azinfós", "azinfos metil", "azinfós metil",
    "azinfós-etílico", "azinfos-metil", "azinfós-metilico", "azinfós-metílico", "azinfosmetil",
    "bromofós", "bromofós etílico", "bromofós-etílico", "bromofós-metílico", "bromofósetílico",
    "cadusafós", "carbofenotion", "carbofenotiona", "carbofention", "chlorpyrifos",
    "clorfenrifós", "clorfenvinfos", "clorfenvinfós", "clorpirifos", "clorpirifós",
    "clorpirifós etil", "clorpirifós etílico", "clorpirifós metílico", "clorpirifós-metil",
    "clorpirifós-metílico", "clorpirifos-oxon", "clorpirifós-oxon", "clorpiripos",
    "clorpirofós", "clorthion", "clortiofos", "clortion", "crufomato", "cumafós", "DDVP", "DEF",
    "demetom-S-metílico", "demeton", "demeton-S", "demeton-S-metílico", "diazinom", "diazinon",
    "diazinona", "diazoxon", "dichlorvos", "diclorvos", "diclorvós", "diclórvos", "dicrotofos",
    "dimetoato", "dimixion", "dioxation", "dissulfotom", "dissulfoton", "dissulfotona",
    "disulfoton", "edifenfós", "etefom", "etefon", "ethion", "etil paration", "etil-paration",
    "etion", "etiona", "etoprofos", "etoprofós", "etrinfos", "etrinfós", "etropofós", "fenamifos",
    "fenamifós", "fenclorfós", "fenitrothion", "fenitrotion", "fenitrotiona", "fensulfotion",
    "fenthion", "fention", "fentiona", "fentoato", "forato", "formotion", "formotiona", "fosalona",
    "fosetil", "fosetyl al", "fosfamidom", "fosfamidon", "fosfamidona", "fosmete", "fostiazato",
    "foxim", "glifosate", "glifosato", "glufosinato", "glyphosate", "heptenofos", "hostathion",
    "iodofenfós", "iprobenfos", "isazofós", "isocarbophos", "isomalathion", "isomalation",
    "isoxation", "leptofós", "malaoxon", "malaoxona", "malathion", "malatião", "malation",
    "malationa", "metamidofos", "metamidofós", "metaminofós", "methamidofós", "methamidophos",
    "methyl parathion", "metidation", "metidationa", "metil paraoxon", "metil paration",
    "metil paroxon", "metil-paraoxon", "metil-paration", "metilparaoxon", "metilparation",
    "mevinfos", "mevinfós", "mipafox", "monocrotofos", "monocrotofós", "naled", "nalede",
    "paraoxon", "paraoxon etílico", "paraoxon-etílico", "paraoxon-metílico", "paraoxona",
    "paraoxona etílica", "parathion", "parathion methyl", "paratião", "paratião-metil",
    "paratiom metílico", "paration", "paration etílico", "paration metílico", "parationa",
    "parationa metílica", "parationa-etílica", "parationa-metílica", "parationametílica",
    "paraxon", "paroxon", "pirazofós", "piridafentiona", "pirimifos metílico", "pirimifós metílico",
    "pirimifós-etílico", "pirimifós-metil", "pirimifós-metílico", "profenofós", "prothiofos",
    "protiofós", "quinafós", "quinalfos", "quinalfós", "quinófos", "quinolphos", "sulfotepp",
    "sulprofós", "tebupirifós", "tebupirinfós", "TEEP", "temefós", "temephos", "TEPP", "terbufos",
    "terbufós", "tetraclorvinfós", "tiometon", "tiometona", "tolclofos metil", "tolclofosmetil",
    "triazofos", "triazofós", "tribufós", "trichlorfon", "triclorfom", "triclorfon", "vamidationa",
    "vamidotion", "vamidotiona"] #Ground truth

**Experiment 1**

Strings 2-5

In [5]:
substrings_to_search1 = ['os', 'on', 'ho', 'f', 't', 'fo', 'hi', 'et', 'io', 'ox', 
                        'fo', 'os', 'ti', 'fos', 'io', 'ion', 'cl', 'lo', 'lor', 'met',
                        'fos', 'tio', 'ion', 'clo', 'lor', 'para', 'oxon', 'ati', 'ofo', 'lorp',
                        'tion', 'clor', 'atio', 'ofos', 'lorp', 'nfos', 'lorpi', 'orpir', 'demet', 'meton']

folder_path = "Docs_Portuguese"

found_matches1 = scrape_txt_folder(folder_path, substrings_to_search1)

all_unique_words1 = merge_unique(found_matches1)

save_list_to_excel(all_unique_words1)

Excel file saved as: unique_matches_pt_1.xlsx


In [7]:
def read_column_as_list(filename, column_name):
    df = pd.read_excel(filename)
    return df[column_name].dropna().astype(str).tolist()

def get_common_strings(list1, list2):
    return sorted(set(list1) & set(list2))

list1 = read_column_as_list("unique_matches_pt_1.xlsx", "Matched Strings")

true_positives1 = get_common_strings(list1, ground_truth)

count_ground_truth = len(ground_truth)

count_predictions1 = len(list1)

false_positives1 = count_predictions1 - len(true_positives1)

false_negatives1 = count_ground_truth - len(true_positives1)

print('## Experiment 1 - 2-5 ##\n')
print('Predicted words: ', count_predictions1)
print('Ground truth: ', count_ground_truth)
print('Matches:', len(true_positives1))
print('False positives: ', false_positives1) 
print('False negatives: ', false_negatives1)


## Experiment 1 - 2-5 ##

Predicted words:  24393
Ground truth:  200
Matches: 155
False positives:  24238
False negatives:  45


**Evaluation 1**

Precision, Recall and F1

In [8]:
def precision_recall(true_positives, false_positives, false_negatives):
    tp = len(true_positives)
    fp = false_positives
    fn = false_negatives

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    print('Precision: ', precision), print('Recall: ', recall), print('F1: ', f1)

precision_recall(true_positives1,false_positives1, false_negatives1)

Precision:  0.006354281966137827
Recall:  0.775
F1:  0.012605212865449517


**Experiment 2**

Strings 3-5

In [9]:
substrings_to_search2 = ['fo', 'os', 'ti', 'fos', 'io', 'ion', 'cl', 'lo', 'lor', 'met',
                        'fos', 'tio', 'ion', 'clo', 'lor', 'para', 'oxon', 'ati', 'ofo', 'lorp',
                        'tion', 'clor', 'atio', 'ofos', 'lorp', 'nfos', 'lorpi', 'orpir', 'demet', 'meton']

folder_path = "Docs_Portuguese"

found_matches2 = scrape_txt_folder(folder_path, substrings_to_search2)

all_unique_words2 = merge_unique(found_matches2)

save_list_to_excel(all_unique_words2, filename='unique_matches_pt_2.xlsx')

Excel file saved as: unique_matches_pt_2.xlsx


In [10]:
list2 = read_column_as_list("unique_matches_pt_2.xlsx", "Matched Strings")

true_positives2 = get_common_strings(list2, ground_truth)

count_predictions2 = len(list2)

false_positives2 = count_predictions2 - len(true_positives2)

false_negatives2 = count_ground_truth - len(true_positives2)

print('## Experiment 2 - 3-5 ##\n')
print('Predicted words: ', count_predictions2)
print('Ground truth: ', count_ground_truth)
print('Matches:', len(true_positives2))
print('False positives: ', false_positives2) 
print('False negatives: ', false_negatives2)

## Experiment 2 - 3-5 ##

Predicted words:  11665
Ground truth:  200
Matches: 126
False positives:  11539
False negatives:  74


**Evaluation 2**

Precision, Recall and F1

In [11]:
precision_recall(true_positives2,false_positives2, false_negatives2)

Precision:  0.010801543077582512
Recall:  0.63
F1:  0.021238938053097345


**Experiment 3**

Strings 4-5

In [12]:
substrings_to_search3 = ['fos', 'tio', 'ion', 'clo', 'lor', 'para', 'oxon', 'ati', 'ofo', 'lorp',
                        'tion', 'clor', 'atio', 'ofos', 'lorp', 'nfos', 'lorpi', 'orpir', 'demet', 'meton']

folder_path = "Docs_Portuguese"

found_matches3 = scrape_txt_folder(folder_path, substrings_to_search3)

all_unique_words3 = merge_unique(found_matches3)

save_list_to_excel(all_unique_words3, filename='unique_matches_pt_3.xlsx')

Excel file saved as: unique_matches_pt_3.xlsx


In [13]:
list3 = read_column_as_list("unique_matches_pt_3.xlsx", "Matched Strings")

true_positives3 = get_common_strings(list3, ground_truth)

count_predictions3 = len(list3)

false_positives3 = count_predictions3 - len(true_positives3)

false_negatives3 = count_ground_truth - len(true_positives3)

print('## Experiment 3 - 4-5 ##\n')
print('Predicted words: ', count_predictions3)
print('Ground truth: ', count_ground_truth)
print('Matches:', len(true_positives3))
print('False positives: ', false_positives3) 
print('False negatives: ', false_negatives3)

## Experiment 3 - 4-5 ##

Predicted words:  2709
Ground truth:  200
Matches: 102
False positives:  2607
False negatives:  98


**Evaluation 3**

Precision, Recall and F1

In [14]:
precision_recall(true_positives3,false_positives3, false_negatives3)

Precision:  0.03765227021040975
Recall:  0.51
F1:  0.07012719147473359


**Experiment 4**

Strings 5

In [15]:
substrings_to_search4 = ['tion', 'clor', 'atio', 'ofos', 'lorp', 'nfos', 'lorpi', 'orpir', 'demet', 'meton']

folder_path = "Docs_Portuguese"

found_matches4 = scrape_txt_folder(folder_path, substrings_to_search4)

all_unique_words4 = merge_unique(found_matches4)

save_list_to_excel(all_unique_words4, filename='unique_matches_pt_4.xlsx')

Excel file saved as: unique_matches_pt_4.xlsx


In [16]:
list4 = read_column_as_list("unique_matches_pt_4.xlsx", "Matched Strings")

true_positives4 = get_common_strings(list4, ground_truth)

count_predictions4 = len(list4)

false_positives4 = count_predictions4 - len(true_positives4)

false_negatives4 = count_ground_truth - len(true_positives4)

print('## Experiment 4 - 5 ##\n')
print('Predicted words: ', count_predictions4)
print('Ground truth: ', count_ground_truth)
print('Matches:', len(true_positives4))
print('False positives: ', false_positives4) 
print('False negatives: ', false_negatives4)

## Experiment 4 - 5 ##

Predicted words:  926
Ground truth:  200
Matches: 67
False positives:  859
False negatives:  133


**Evaluation 4**

Precision, Recall and F1

In [17]:
precision_recall(true_positives4,false_positives4, false_negatives4)

Precision:  0.07235421166306695
Recall:  0.335
F1:  0.11900532859680285
