# HomeAssistant BERT Training Data generation

This notebook is used to generate data to train the BERT model for using sentences in Catalan.

It is based in the existing intent definition in catalan in:
https://github.com/home-assistant/intents/tree/main/sentences/ca

Data from that repository is not to train a BERT system but for using it as a phrase structure to interpret the senteces to generate intents.

In this notebook, we will expand those phrases to be able to use them to train a BERT system.


## Install required dependencies

In [21]:
%pip install pyyaml pandas --break-system-packages

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Operation cancelled by user[0m[31m
[0m^C
Note: you may need to restart the kernel to use updated packages.


## Import required libs

In [None]:
import yaml
import os
import re
import itertools
import pandas as pd
import random
import json

In [None]:
def load_expansion_rules(common_file_path):
    """
    Load expansion rules from the _common.yaml file.
    """
    print(f"Loading expansion rules from {common_file_path}")
    with open(common_file_path, 'r', encoding='utf-8') as f:
        content = yaml.safe_load(f)
    return content.get('expansion_rules', {})


def expand_rules(sentence, expansion_rules):
    """
    Expand rules in the sentence using the provided expansion rules.
    """
    while '<' in sentence and '>' in sentence:
        match = re.search(r'<(.*?)>', sentence)
        if not match:
            break
        rule_name = match.group(1)
        rule_expansion = expansion_rules.get(rule_name, f"<{rule_name}>")
        old_sentence = sentence
        sentence = sentence.replace(f"<{rule_name}>", rule_expansion, 1)
        if sentence == old_sentence:
            print(f"Warning: No expansion found for {rule_name}. Keeping original.")
            break
    return sentence


def expand_blocks(sentence):
    """
    Expand blocks in the sentence between the specified initial and end characters.
    """
    initial_chars = ['(','[']
    end_chars = [')',']']
    expanded_sentences = []
    expanded=False
    
    #if sentence contains any of the initial characters and end characters
    if any(char in sentence for char in initial_chars) and any(char in sentence for char in end_chars):
        end_char_pos_found= False
        initial_char_pos2_found = False

        
        for initial_char_pos in range(len(sentence)):
            if sentence[initial_char_pos] in initial_chars:
                break;
        for end_char_pos in range(initial_char_pos+1, len(sentence)):
            if sentence[end_char_pos] in end_chars:
                end_char_pos_found = True
                break;

        for initial_char_pos2 in range(initial_char_pos+1, len(sentence)):
            if sentence[initial_char_pos2] in initial_chars:
                initial_char_pos2_found = True
                break;
        
        if end_char_pos_found and initial_char_pos2_found and initial_char_pos2 < end_char_pos:
            #execute the expansion recursive between the initial2 and end characters                       
            generatedsubstrings,expanded = expand_blocks(sentence[initial_char_pos2:end_char_pos+1])
            for generatedsubstring in generatedsubstrings:
                expanded_sentences.append(sentence[:initial_char_pos2] + generatedsubstring + sentence[end_char_pos+1:])
        else:
            #expand the sentence between the initial and end characters generate as may sentences as values separeted by |            
            options = sentence[initial_char_pos+1:end_char_pos].split('|')            
            for option in options:                
                expanded_sentences.append(sentence[:initial_char_pos] + option + sentence[end_char_pos+1:])
                expanded=True
            #if sentence[initial_char_pos]=='[':
            #    expanded_sentences.append(sentence[:initial_char_pos] + sentence[end_char_pos+1:])
            #    expanded=True
    else:
        expanded_sentences = [sentence]
    return expanded_sentences, expanded

def expand_sentence_blocks(sentence):
    """
    Expand blocks in the sentence using the provided expansion rules.
    """    
    sentences= [sentence]
    expanded=True
    while expanded:
        expanded=False
        outsentences = []
        for sentence in sentences:
            expanded_sentences,expanded_inner=expand_blocks(sentence)
            if expanded_inner:
                expanded=True
            for expanded_sentence in expanded_sentences:
                outsentences.append(expanded_sentence)
        #remove duplicates
        for i in range(len(outsentences)):
            for j in range(i+1, len(outsentences)):
                if outsentences[i] == outsentences[j]:
                    outsentences.pop(j)
                    break
        sentences = outsentences
    return sentences

def expand_sentence(sentence, expansion_rules):
    """
    Expand a sentence using the provided expansion rules.
    """
    sentences = [sentence]
    outsentences=[]
    for sentence in sentences:
        outsentences.append(expand_rules(sentence, expansion_rules))

    sentences = outsentences
    outsentences = []
    for sentence in sentences:
        sentence_outsentences = expand_sentence_blocks(sentence)
        for sentence_outsentence in sentence_outsentences:
            outsentences.append(sentence_outsentence)
    sentences = outsentences
    return sentences

def load_sentences_from_yaml(file_path, expansion_rules):
    """
    Load sentences from a YAML file and expand them using the provided expansion rules.
    """
    print(f"Loading sentences from {file_path}")
    with open(file_path, 'r', encoding='utf-8') as f:
        content = yaml.safe_load(f)
    data = []
    # Navigate through the YAML structure
    for intent_name, intent_data in content.get('intents', {}).items():
        for item in intent_data.get('data', []):
            sentences = item.get('sentences', [])
            for sentence in sentences:
                # If slots exist in the YAML file, extract the sentence slots->domain in sentences
                slots = item.get('slots', {})
                domain = slots.get('domain', None)
                if domain is None:
                    # If no domain is found, use the intent name as the domain
                    domain = 'None'
                expanded = expand_sentence(sentence, expansion_rules)
                for s in expanded:
                    data.append({'sentence': s, 'intent': intent_name, 'domain': domain})
    return data

def process_directory(yaml_dir):
    all_data = []
    # Process general YAML files
    common_file_path = os.path.join(yaml_dir, "_common.yaml")
    expansion_rules = load_expansion_rules(common_file_path)

    # Process each YAML file in the directory
    for file_name in os.listdir(yaml_dir):
        print(file_name)
        if file_name.endswith('.yaml') or file_name.endswith('.yml'):
            path = os.path.join(yaml_dir, file_name)
            sentences=load_sentences_from_yaml(path,expansion_rules)
            all_data.extend(sentences)
            print(f"Loaded {len(sentences)} sentences from {file_name}")
    return all_data




## Process directory where intents are present 

In [None]:
if __name__ == "__main__":
    yaml_directory = "from_ha_intents/sentences/ca"
    output_csv = "hass_intents_ca.csv"

    data = process_directory(yaml_directory)
    df = pd.DataFrame(data)
    df.to_csv(output_csv, index=False)
    print(f"Dataset generat amb {len(df)} frases i desat a: {output_csv}")

Loading expansion rules from from_ha_intents/sentences/ca/_common.yaml
script_HassTurnOn.yaml
Loading sentences from from_ha_intents/sentences/ca/script_HassTurnOn.yaml
Loaded 32 sentences from script_HassTurnOn.yaml
cover_HassSetPosition.yaml
Loading sentences from from_ha_intents/sentences/ca/cover_HassSetPosition.yaml
Loaded 11186 sentences from cover_HassSetPosition.yaml
light_HassTurnOff.yaml
Loading sentences from from_ha_intents/sentences/ca/light_HassTurnOff.yaml
Loaded 5310 sentences from light_HassTurnOff.yaml
cover_HassTurnOff.yaml
Loading sentences from from_ha_intents/sentences/ca/cover_HassTurnOff.yaml
Loaded 824 sentences from cover_HassTurnOff.yaml
vacuum_HassVacuumStart.yaml
Loading sentences from from_ha_intents/sentences/ca/vacuum_HassVacuumStart.yaml
Loaded 48 sentences from vacuum_HassVacuumStart.yaml
scene_HassTurnOn.yaml
Loading sentences from from_ha_intents/sentences/ca/scene_HassTurnOn.yaml
Loaded 88 sentences from scene_HassTurnOn.yaml
homeassistant_HassTurnO

In [None]:
def SearchForTypes():
    """
    Search for types in the generated CSV file.
    """
    output_csv = "hass_intents_ca.csv"
    #read the csv file
    df = pd.read_csv(output_csv)
    #serach for the sentences that contain { } and print number of each {types}
    numSentence=0   
    alltypes = []
    for sentence in df['sentence']:    
        numSentence+=1
        if '{' in sentence and '}' in sentence:
            if numSentence < 2: print(f"Sentence {numSentence} : {sentence}")
            #search for the types inside the { } and print them
            types = re.findall(r'\{(.*?)\}', sentence)
            for type in types:
                if numSentence < 2: print(f"Type: {type}")
                #if the type is not in the list, add it
                if type not in alltypes:
                    alltypes.append(type)
    print(f"All Types: {alltypes}")
    return alltypes

alltypes=SearchForTypes()

Sentence 1 : executa l'script el {name}
Type: name
All Types: ['name', 'area', 'position', 'cover_classes:device_class', 'zone:state', 'temperature', 'message']


In [None]:
def DefineTypesAndPossibleValues():

    """
    Define the types and possible values for each type.
    """
    # Define the types and possible values
    alltypes= ['message', 'area', 'temperature', 'name', 'position', 'cover_classes:device_class', 'timer_seconds:start_seconds', 'timer_minutes:start_minutes', 'timer_hours:start_hours', 'timer_name:name', 'on_off_states:state', 'on_off_domains:domain', 'response', 'timer_command:conversation_command', 'timer_seconds:seconds', 'timer_minutes:minutes', 'timer_half:seconds', 'timer_hours:hours', 'timer_half:minutes', 'volume:volume_level', 'zone:state']
    alltypespossiblevalues = {
        'message': ['Bon dia a tothom', 'Bona tarda', 'El sopar està a punt', 'Alarma tothom a fora'],
        'area': ['menjador', 'cuina', 'habitació', 'sala estar', 'bany', 'lavabo'],
        'temperature': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50'],
        'name': ['llum cuina', 'persiana menjador', 'llum bany', 'llum habitació', 'llum sala estar', 'persiana menjador'],
        'position': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60',
                        '61','62','63','64','65','66','67','68','69','70',
                        '71','72','73','74','75','76','77','78','79','80',
                        '81','82','83','84','85','86','87','88','89','90',
                        '91','92','93','94','95','96','97','98','99','100'],
        'cover_classes:device_class': ['tendal', 'persiana', 'cortina', 'porta del garatge', 'porta', 'reixat','porticó', 'finestra'],   
        'timer_seconds:start_seconds': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_minutes:start_minutes': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_hours:start_hours': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_name:name': ['temps cuina', 'temps espera'],
        'on_off_states:state': ['engegat', 'ences', 'aturat', 'apagat', 'desconnectat'],
        'on_off_domains:domain': ['llum', 'ventilador','interruptor','pany'],
        'response': ['Sí', 'No', 'fet', 'no ho sé'],
        'timer_command:conversation_command': ['encendre', 'apagar'],
        'timer_seconds:seconds': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_minutes:minutes': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_half:seconds': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_hours:hours': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'timer_half:minutes': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60'],
        'volume:volume_level': ['0','1','2','3','4','5','6','7','8','9','10',
                        '11','12','13','14','15','16','17','18','19','20',
                        '21','22','23','24','25','26','27','28','29','30',
                        '31','32','33','34','35','36','37','38','39','40',
                        '41','42','43','44','45','46','47','48','49','50',
                        '51','52','53','54','55','56','57','58','59','60',
                        '61','62','63','64','65','66','67','68','69','70',
                        '71','72','73','74','75','76','77','78','79','80',
                        '81','82','83','84','85','86','87','88','89','90',
                        '91','92','93','94','95','96','97','98','99','100'],
        'zone:state': ['menjador', 'cuina', 'habitació', 'sala estar', 'bany', 'lavabo'],
        # Add more types and possible values as needed
    }
    return alltypes, alltypespossiblevalues

In [23]:
def SearchForLabels():
    """Search for types in the generated CSV file."""
    alltypes, alltypespossiblevalues = DefineTypesAndPossibleValues()
    print(f"All Types: {alltypes}")
    print(f"All Types Possible Values: {alltypespossiblevalues}")

    # Define all possible labels based on the types
    all_labels = []
    for type in alltypes:
        all_labels.append("B-" + type)
        all_labels.append("I-" + type)
    all_labels.append("O")

    print(f"All Labels: {all_labels}")
    #save the labels to json file
    with open('slots.json', 'w', encoding='utf-8') as f:
        json.dump(all_labels, f, ensure_ascii=False, indent=4)

    output_csv = "hass_intents_ca.csv"
    # Read the CSV file
    df = pd.read_csv(output_csv)
    numSentence = 0

    # Create a DataFrame with the columns sentence, text, intent, and labels
    df_labels = pd.DataFrame(columns=['sentence', 'text', 'intent', 'labels'])

    # Extract sentences and intents from the CSV file
    for index, row in df.iterrows():
        sentence = row['sentence']
        intent = row['intent']
        domain = row['domain']
        if intent not in ["HassTurnOn", "HassTurnOff"]: continue

        numSentence += 1
        words = sentence.split(' ')
        labels = []
        new_sentence = ""

        for word in words:
            if '{' in word and '}' in word:
                types = re.findall(r'\{(.*?)\}', word)
                for type in types:
                    if type in alltypes:
                        substext = random.choice(alltypespossiblevalues[type])
                        #print(f"Type: {type}, Subtext: {substext}")
                        subtextwords = substext.split(' ')
                        first = True
                        for subtextword in subtextwords:
                            if first:
                                labels.append("B-" + type)
                                first = False
                            else:
                                labels.append("I-" + type)
                            new_sentence += subtextword + " "
                    else:
                        labels.append("O")
                        new_sentence += word + " "
            else:
                labels.append("O")
                new_sentence += word + " "
        new_sentence = new_sentence[:-1]

        # Add the new row to the DataFrame
        new_row = {'sentence': sentence, 'text': new_sentence, 'intent': intent, 'domain': domain, 'labels': labels}
        df_labels = pd.concat([df_labels, pd.DataFrame([new_row])], ignore_index=True)

        #print(f"Sentence {numSentence} : {sentence}, NewSentence: {new_sentence}, Intent: {intent}, Labels: {labels}")

    # Save the DataFrame to a JSON file
    # Converteix el DataFrame a una llista de diccionaris
    data = df_labels.to_dict(orient='records')

    # Escriu el fitxer JSON amb ensure_ascii=False
    with open('dataset_homeassistant.jsonl', 'w', encoding='utf-8') as f:
        f.write('\n'.join([json.dumps(c, ensure_ascii=False) for c in data]) + '\n')

    # Extract all the intents from the CSV file
    all_intents = df['intent'].unique()
    # Add intent "none" to the list of intents
    all_intents = list(all_intents) + ['none']

    
    print(f"All Intents: {all_intents}")
    # Save the intents to a JSON file
    with open('intent-types.json', 'w', encoding='utf-8') as f:
        json.dump(all_intents, f, ensure_ascii=False, indent=4)

    # Extract all the domains from the CSV file
    all_domains = df['domain'].unique()
    # Add domain "none" to the list of domains
    all_domains = list(all_domains) + ['none']
    print(f"All Domains: {all_domains}")
    # Save the domains to a JSON file
    with open('domain-types.json', 'w', encoding='utf-8') as f:
        json.dump(all_domains, f, ensure_ascii=False, indent=4)

        

SearchForLabels()

All Types: ['message', 'area', 'temperature', 'name', 'position', 'cover_classes:device_class', 'timer_seconds:start_seconds', 'timer_minutes:start_minutes', 'timer_hours:start_hours', 'timer_name:name', 'on_off_states:state', 'on_off_domains:domain', 'response', 'timer_command:conversation_command', 'timer_seconds:seconds', 'timer_minutes:minutes', 'timer_half:seconds', 'timer_hours:hours', 'timer_half:minutes', 'volume:volume_level', 'zone:state']
All Types Possible Values: {'message': ['Bon dia a tothom', 'Bona tarda', 'El sopar està a punt', 'Alarma tothom a fora'], 'area': ['menjador', 'cuina', 'habitació', 'sala estar', 'bany', 'lavabo'], 'temperature': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50'], 'name': ['llum cuina', 'persiana m

In [25]:
# Merge with LLM dataset

with open('dataset_homeassistant.jsonl', 'r', encoding='utf-8') as f1:
    with open('../dataset_augmented.jsonl', 'r', encoding='utf-8') as f2:
        with open('../dataset_augmented_merged.jsonl', 'w', encoding='utf-8') as f3:
            f3.write(f1.read() + f2.read())


## Notes Ricard

Faltaria polir del dataset:

* Posar B-ACTION-TURN-ON o B-ACTION-TURN-OFF a la primera paraula de cada frase excepte:
  * Si comença la frase amb pots/podries, llavors és la segona paraula.
  * Si comença amb "tornar a" és la tercera paraula.
* Hi ha moltes frases sense cap etiqueta (les que contenen "llum"). S'hauria de, si no hi ha cap etiqueta, buscar "llum" i posar-li manualment l'etiqueta.
* Hi ha moltes fraess amb "<everywhere>". S'hauria de treure i posar etiqueta b-place (o la que toqui) a les 3 últimes paraules ("A tot arreu").
