<center><h2><b>TP1: Traitement basique d’un texte :</br>
 Expression régulière et mesure d’édition<b><h2><center/>

<h3><strong>Exercice 1 :</strong></h3>
<p>On veut récupérer les adresses émail, les codes postaux et les numéros de téléphone qui se figurent dans les pages "contactez-nous" des sites web de quelques instituts universitaires tunisiens. Le problème est que ces pages représentent ces informations de plusieurs façons. Pour récupérer
 et unifier la forme de ces informations, on va utiliser les expressions régulières.</p>
<ol>
<h4><strong><li>Informations recherchées:</strong></h4>
<p>Ici, on va décrire quelques variations existantes des informations qui existent dans les pages. Ce n’est pas une liste complète; donc, il faut ouvrir les pages où le système a échoué afin de localiser les formes non reconnues.</p>
<ul>
 <li><h5><strong>Téléphones:</strong></h5>
<p>Il existe plusieurs formes des numéros de téléphone. Par exemple :
 • (216) 73 683 100;
 • (+216) 73 683 100;
 • (216) 73683100;
 • 73683100;
 • (+216) 73 68 31 00 ,
 • 73683100,
 • etc.
 Cette liste n’est pas complète. On doit examiner les fichiers pour détecter toutes les formes possibles.
 La forme voulue est : "(+216) XX XX XX XX" où X est un chiffre.</p>
 <li><h5><strong>Adresses mail:</strong></h5>
<p>Il n’y a pas de conditions sur les adresses mail.</p>
<li><h5><strong>Les code postaux:</strong></h5>
<p>Les codes postaux se trouvent généralement dans les adresses des instituts. Ils se composent
 généralement de 4 chiffres (exemple : 3021).</p>
 </ul>
<h3><strong><li>Travail à faire:</strong></h3>
<p>Vous trouvez dans votre espace de classrooms un code non complet. Vous devez compléter ce
 code par l’ensemble d’expressions régulières qui permettent de détecter les numéros de téléphone, les
 codes postaux et les adresses émail. À la fin du travail, vous devez donner les valeurs de l’évaluation
 de votre travail en terme de rappel, précision et F-mesure.</p></ol>

### **Imported Libraries**

In [968]:
import re
import os
import pandas as pd
from functools import reduce

### **Processing Phase**

In [969]:
# Define regular expressions that match each contact type
contact_re = {
    'mails': re.compile(r'\b(?:(?!mailto)\w+)\.?\w+@\w+\.\w+\.?[a-z]*\b'),
    'code': re.compile(r"[^/][a-z]?(\d{4})(?! is )(?: - \w{2,3} \d+|  ?-?[\u0621-\u064AA-Za-z]+)(?!. Tous|’|. All)\b"),
    'tels': re.compile(r'\b(?:Tel|T\s?:?\s?(?:\(\+216\)|\+216)?)?\s?(\d{2} \d{3} \d{3}|\d{2} \d{2} \d{2} \d{2}|\d{2} \d{6})\b')
}

In [970]:
def list_to_dict_of_counts(type_matches):
    """
    Convert a list of items into a dictionary of counts.

    Parameters:
    - type_matches (list): A list of items.

    Returns:
    - dict: A dictionary where keys are unique items from the list, and values
      are the counts of each item in the list.
    """
    # Initialize an empty dictionary to store counts.
    counts = dict()

    # Iterate through the list of items.
    for item in type_matches:
        # Update the count for the current item in the dictionary.
        counts[item] = counts.get(item, 0) + 1

    # Return the resulting dictionary of counts.
    return counts

In [971]:
def edit_phone_number_format(matches):
    """
    Edit the format of phone numbers in a list of matches.

    Parameters:
    - matches (list): A list of strings containing phone numbers.

    Returns:
    - list: A list of strings with the phone numbers edited to the format '(+216) xx xx xx xx'.
    """
    # Define the regex pattern to match phone numbers in various formats.
    pattern = re.compile(r'\b(?:Tel|T\s?:?\s?(?:\(\+216\)|\+216)?)?\s?(\d{2})\s?(\d{2})\s?(\d{2})\s?(\d{2})\b')

    # Check if there are matches to process.
    if matches:
        # Iterate through the list of matches.
        for index in range(len(matches)):
            # Use the regex pattern and the '.sub()' method to edit the format of phone numbers.
            matches[index] = re.sub(pattern, r'(+216) \1 \2 \3 \4', ''.join(matches[index].split()))

    # Return the list of matches with edited phone number formats.
    return matches

In [972]:
def fetch_matches_from_sys_file(file_url):
    """
    Extract and count different types of contacts (mails, codes, tels) from a system file (HTML file).

    Parameters:
    - file_url (str): The URL or path to the file.

    Returns:
    - dict: A dictionary containing counts of different types of contacts extracted from the file.
    """
    # Define a dictionary to store matches for different contact types.
    contact_matches = {'mails': [], 'code': [], 'tels': []}

    # Read the contents of the system file.
    with open(file=file_url) as f:
        # Read all lines from the file.
        lines = f.readlines()

        # Iterate through each line in the file.
        for line in lines:
            # Check each contact type and extract matches using corresponding regex patterns (check first cell after imported packages).
            for type, pattern in contact_re.items():
                matches = pattern.findall(line)

                # If the contact type is 'tels', edit the phone number format.
                if type == 'tels': 
                    matches = edit_phone_number_format(matches)

                # Extend the list of matches for the current contact type.
                contact_matches[type].extend(matches)
    
    # Create a dictionary to store counts of different contact types.
    sys_matches = {}

    # Calculate counts for each contact type using the list_to_dict_of_counts function predefined previously.
    for type in contact_matches:
        sys_matches[type] = list_to_dict_of_counts(contact_matches[type])

    # Create a final dictionary with the file identifier and the counts.
    sys_matches = {file_url.split('.')[1][1:]: sys_matches}

    # Return the final dictionary.
    return sys_matches

In [973]:
def get_htm_files(url='./'):
    """
    Get a list of HTML files via a specified directory.

    Parameters:
    - url (str): The URL or path to the directory (default is the current directory).

    Returns:
    - list: A list of paths to HTML files in the specified directory.
    """
    # Initialize an empty list to store paths to HTML files.
    files = []

    # Iterate through files in the specified directory.
    for f in os.listdir(url):
        # Use a regular expression to check if the file has a '.htm' extension.
        match = re.search(r'(.htm)$', f)

        # If the file has a '.htm' extension, add its path to the list.
        if match:
            url_f = os.path.join(url, f)
            files.append(url_f)
        else:
            continue  # Skip non-HTML files.

    # Return the list of HTML files.
    return files

In [974]:
def merge_dictionaries(dict1, dict2):
    """
    Merge two dictionaries into a new dictionary.

    Parameters:
    - dict1 (dict): The first dictionary.
    - dict2 (dict): The second dictionary.

    Returns:
    - dict: A new dictionary containing the merged key-value pairs.
    """
    # Create a copy of the first dictionary.
    merged_dict = dict1.copy()

    # Update the copy with key-value pairs from the second dictionary.
    merged_dict.update(dict2)

    # Return the merged dictionary.
    return merged_dict

In [975]:
# Verify results
dict_of_matches = [fetch_matches_from_sys_file(f) for f in get_htm_files()]
sys_contacts = reduce(merge_dictionaries, dict_of_matches)
print(sys_contacts)

{"ENET'Com": {'mails': {'contact@enetcom.usf.tn': 3}, 'code': {'3018': 2}, 'tels': {'(+216) 74 86 30 47': 2, '(+216) 74 86 25 00': 2, '(+216) 74 86 30 37': 2}}, 'ENIM': {'mails': {'enim@enim.rnu.tn': 3}, 'code': {'5019': 2}, 'tels': {'(+216) 73 50 05 11': 2, '(+216) 73 50 05 14': 4}}, 'ENIS': {'mails': {'webmaster@enis.tn': 4}, 'code': {'3038': 2}, 'tels': {'(+216) 70 25 85 20': 2, '(+216) 74 27 55 95': 1}}, 'FSEGMA': {'mails': {'fsegma@fsegma.rnu.tn': 2}, 'code': {'5111': 1}, 'tels': {'(+216) 73 68 31 91': 1, '(+216) 73 68 31 92': 1}}, 'FSEGS': {'mails': {'contact@fsegs.rnu.tn': 1}, 'code': {'3018': 1}, 'tels': {'(+216) 74 27 87 77': 1, '(+216) 74 27 91 39': 1}}, 'FSM': {'mails': {'fsm@fsm.rnu.tn': 1}, 'code': {'5019': 2}, 'tels': {'(+216) 73 50 02 76': 1, '(+216) 73 50 02 78': 1}}, 'FSS': {'mails': {'contact@fss.rnu.tn': 3}, 'code': {'3000': 2}, 'tels': {'(+216) 74 27 64 00': 2, '(+216) 74 27 67 63': 2, '(+216) 74 27 44 37': 2}}, 'ISGIS': {'mails': {'direction.isgis@isgis.usf.tn': 2}

In [976]:
def trait_ref_file(url='./ref.txt'):
    """
    Read and process the reference file into a structured dictionary.

    Parameters:
    - url (str): The URL or path to the reference file (default is './ref.txt').

    Returns:
    - dict: A nested dictionary containing structured information from the reference file.
    """
    # Read the reference file into a Pandas DataFrame.
    ref = pd.read_csv(url, delimiter='\t', names=['File', 'Type', 'Value', 'Counts'])
    
    # Convert 'Counts' column to integer type.
    ref['Counts'] = ref['Counts'].astype('int')
    
    # Initialize a dictionary to store processed information from the reference file.
    ref_matches = {}

    # Iterate through rows in the Pandas DataFrame.
    for _, row in ref.iterrows():
        # Check if the file is already in the dictionary; if not, add it.
        if row['File'] not in ref_matches:
            ref_matches[row['File']] = {'mails': {}, 'code': {}, 'tels': {}}
        
        # Check if the type is 'fax'.
        if row['Type'] == 'fax':
            # Check if the fax number is already in the 'tels' sub-dictionary; if not, add it.
            if row['Value'] not in ref_matches[row['File']]['tels']:
                ref_matches[row['File']]['tels'][row['Value']] = row['Counts']
            else:
                # If the fax number is already present, update its count.
                ref_matches[row['File']]['tels'][row['Value']] += row['Counts']
        else:
            # For types other than 'fax', update the corresponding sub-dictionary.
            ref_matches[row['File']][row['Type']][row['Value']] = row['Counts']
    
    # Return the structured dictionary.
    return ref_matches

In [977]:
# Check process state
print(trait_ref_file())

{'ENIS': {'mails': {'webmaster@enis.tn': 4}, 'code': {'3038': 2}, 'tels': {'(+216) 70 25 85 20': 2, '(+216) 74 27 55 95': 1}}, "ENET'Com": {'mails': {'contact@enetcom.usf.tn': 3}, 'code': {'3018': 2}, 'tels': {'(+216) 74 86 30 47': 2, '(+216) 74 86 25 00': 2, '(+216) 74 86 30 37': 2}}, 'ENIM': {'mails': {'enim@enim.rnu.tn': 3}, 'code': {'5019': 2}, 'tels': {'(+216) 73 50 05 11': 2, '(+216) 73 50 05 14': 4}}, 'ISIMa': {'mails': {'isima@isima.rnu.tn': 1}, 'code': {'5111': 1}, 'tels': {'(+216) 73 68 31 00': 1, '(+216) 73 68 31 20': 1}}, 'ISGIS': {'mails': {'direction.isgis@isgis.usf.tn': 2}, 'code': {'3021': 4}, 'tels': {'(+216) 74 86 30 90': 1, '(+216) 74 86 30 92': 1}}, 'FSS': {'mails': {'contact@fss.rnu.tn': 3}, 'code': {'3000': 2}, 'tels': {'(+216) 74 27 64 00': 2, '(+216) 74 27 67 63': 2, '(+216) 74 27 44 37': 2}}, 'FSM': {'mails': {'fsm@fsm.rnu.tn': 1}, 'code': {'5019': 2}, 'tels': {'(+216) 73 50 02 76': 1, '(+216) 73 50 02 78': 1}}, 'FSEGS': {'mails': {'contact@fsegs.rnu.tn': 1}, '

In [978]:
def confront_dictionaries(sys_contacts, ref_contacts):
    """
    Compare two dictionaries containing contact information and compute precision, recall, and F1-score.

    Parameters:
    - sys_contacts (dict): Dictionary containing system-generated contact information.
    - ref_contacts (dict): Dictionary containing reference contact information.

    Returns:
    - tuple: A tuple containing the result dictionary, recall, precision, and F1-score.
    """
    # Initialize counters for intersection, system, and reference counts.
    INT = 0
    SYS = 0
    REF = 0

    # Initialize the result dictionary.
    result = {}

    # Iterate through files in the reference contacts.
    for file in ref_contacts:
        result[file] = {}

        # Get types for the current file from both system and reference contacts.
        ref_types = ref_contacts[file]
        sys_types = sys_contacts.get(file, None)

        # Iterate through contact types: 'mails', 'code', 'tels'.
        for type in ['mails', 'code', 'tels']:
            # Initialize a sub-dictionary for the current contact type in the result.
            res_elements = result[file][type] = {}

            # Iterate through elements (e.g., email addresses, postal codes, phone numbers).
            for element in ref_types[type]:
                # Get the count for the element from both system and reference contacts.
                ref_nbr = ref_types[type][element]
                sys_nbr = 0

                # Check if the element exists in the system contacts.
                if sys_types is not None and element in sys_types[type]:
                    sys_nbr = sys_types[type][element]

                # Update the result dictionary with counts for the current element.
                res_elements[element] = f'sys({sys_nbr}), ref({ref_nbr})'

                # Update counts for intersection, system, and reference.
                SYS += sys_nbr
                REF += ref_nbr
                INT += min(sys_nbr, ref_nbr)

            # Check for elements that exist only in the system contacts.
            if sys_types:
                for element in sys_types[type]:
                    if element not in res_elements:
                        sys_nbr = sys_types[type][element]
                        res_elements[element] = f'sys({sys_nbr}), ref(0)'
                        SYS += sys_nbr

    # Calculate recall, precision, and F1-score.
    R = 0.0 if REF == 0 else INT / REF
    P = 0.0 if SYS == 0 else INT / SYS
    F1 = 0.0 if R + P == 0 else 2 * P * R / (P + R)

    # Return the result dictionary, recall, precision, and F1-score as a tuple.
    return result, R, P, F1

In [979]:
def display_report(contacts, R, P, F1):
    """
    Display a contact information report including counts and evaluation metrics.

    Parameters:
    - contacts (dict): Dictionary containing contact information.
    - R (float): Recall.
    - P (float): Precision.
    - F1 (float): F1-score.

    Returns:
    - None: The function prints the report but doesn't return anything.
    """
    # Iterate through files in the contacts.
    for file in contacts:
        print('************ ', file, ' ************')
        stats = contacts[file]
        
        # Iterate through contact types.
        for type in stats:
            print('>>> ', type, ':')
            stats_type = stats[type]
            
            # Iterate through elements (e.g., email addresses, postal codes, phone numbers).
            for element in stats_type:
                print('\t+ ', element, ': ', stats_type[element])
        print()

    # Display evaluation metrics.
    print('------------------------------------')
    print('R =', R, ', P =', P, ', F1 =', F1)

### **Main Program**

In [980]:
ref_contacts = trait_ref_file()
comp, R, P, F1 = confront_dictionaries(sys_contacts, ref_contacts)
display_report(comp, R, P, F1)

************  ENIS  ************
>>>  mails :
	+  webmaster@enis.tn :  sys(4), ref(4)
>>>  code :
	+  3038 :  sys(2), ref(2)
>>>  tels :
	+  (+216) 70 25 85 20 :  sys(2), ref(2)
	+  (+216) 74 27 55 95 :  sys(1), ref(1)

************  ENET'Com  ************
>>>  mails :
	+  contact@enetcom.usf.tn :  sys(3), ref(3)
>>>  code :
	+  3018 :  sys(2), ref(2)
>>>  tels :
	+  (+216) 74 86 30 47 :  sys(2), ref(2)
	+  (+216) 74 86 25 00 :  sys(2), ref(2)
	+  (+216) 74 86 30 37 :  sys(2), ref(2)

************  ENIM  ************
>>>  mails :
	+  enim@enim.rnu.tn :  sys(3), ref(3)
>>>  code :
	+  5019 :  sys(2), ref(2)
>>>  tels :
	+  (+216) 73 50 05 11 :  sys(2), ref(2)
	+  (+216) 73 50 05 14 :  sys(4), ref(4)

************  ISIMa  ************
>>>  mails :
	+  isima@isima.rnu.tn :  sys(1), ref(1)
>>>  code :
	+  5111 :  sys(1), ref(1)
>>>  tels :
	+  (+216) 73 68 31 00 :  sys(1), ref(1)
	+  (+216) 73 68 31 20 :  sys(1), ref(1)

************  ISGIS  ************
>>>  mails :
	+  direction.isgis@is