# Laden und bereinigen der Daten fürs Finetuning der Modelle
1. Laden der Linux/Bash Command Datasets
2. Laden der SQL Query Datasets
3. Bereinigung der Daten
4. Upload zu Huggingface


Laden der benötigten Libraries

In [30]:
# Für diverse Datenoperationen
import pandas as pd

# Zum Laden von Datasets von Huggingface
from datasets import load_dataset, Dataset as DS
from torch.utils.data import Dataset

import sqlparse
import subprocess

Erstellen eines Pandas Dataframe, in dem alle Datenpunkte gemeinsam gesammelt werden

In [31]:
# Erstellen des leeren Dataframes
data = {
    "nl_prompt": [],
    "command": []
}

complete_data_sql = pd.DataFrame(data)
complete_data_bash = complete_data_sql.copy()


# Linux/Bash Commands

### HF [AnishJoshi/nl2bash-custom](https://huggingface.co/datasets/AnishJoshi/nl2bash-custom)
License=?



In [32]:
dataset1 = load_dataset("AnishJoshi/nl2bash-custom")

Repo card metadata block was not found. Setting CardData to empty.


In [33]:
dataset1

DatasetDict({
    train: Dataset({
        features: ['bash_code', 'nl_command', 'srno'],
        num_rows: 19658
    })
    validation: Dataset({
        features: ['bash_code', 'nl_command', 'srno'],
        num_rows: 2457
    })
    test: Dataset({
        features: ['bash_code', 'nl_command', 'srno'],
        num_rows: 2458
    })
})

Zusammenführung des Train/Validation/Test-Splits des Datensatzes

In [34]:
train_df1 = pd.DataFrame(dataset1['train'])
validation_df1 = pd.DataFrame(dataset1['validation'])
test_df1 = pd.DataFrame(dataset1['test'])

dataset1_complete = pd.concat([
    train_df1.assign(split='train'),
    validation_df1.assign(split='validation'),
    test_df1.assign(split='test')
])

dataset1_complete

Unnamed: 0,bash_code,nl_command,srno,split
0,find -mtime +2 -mtime -5,Look for any files that were modified 2-5 days...,3194,train
1,"tempFile=""$(mktemp ""${TMPDIR:-/tmp/}$(basename...",Creates temporary file in a TMPDIR folder or /...,8529,train
2,find . -type f -name '* *',Find all files with space in their names under...,240,train
3,"#!/bin/bash\n\ndir=""/path/to/directory""\nfile=...",Create a script to check if a file exists in a...,1433,train
4,find . -iname foo -type f,find regular which case-insensitive name is fo...,2029,train
...,...,...,...,...
2453,#!/bin/bash\n\nmounted_filesystems=$(df -h | g...,Create a script to monitor system mounted file...,9052,test
2454,comm -1 -2 file1.sorted file2.sorted,Print only common strings in content of files ...,9321,test
2455,"find . -name ""filename including space"" -print0",display all the files having spaces in the cur...,6952,test
2456,#!/bin/bash\n\nfor ((i=0; i<6; i++)); do\n ...,Monitor system filesystem mount options every ...,2113,test


Anhängen des Datensatzes an das Dataframe complete_data

In [35]:
# übernehmen der erwünschten Spalten
new_rows1 = dataset1_complete[['bash_code', 'nl_command']].copy()

# Umbenennen der übernommenen Spalten
new_rows1.rename(columns={
    'bash_code': 'command',
    'nl_command': 'nl_prompt'
}, inplace=True)

# Anhängen an complete_data
complete_data_bash = pd.concat([complete_data_bash, new_rows1], ignore_index=True)
complete_data_bash

Unnamed: 0,nl_prompt,command
0,Look for any files that were modified 2-5 days...,find -mtime +2 -mtime -5
1,Creates temporary file in a TMPDIR folder or /...,"tempFile=""$(mktemp ""${TMPDIR:-/tmp/}$(basename..."
2,Find all files with space in their names under...,find . -type f -name '* *'
3,Create a script to check if a file exists in a...,"#!/bin/bash\n\ndir=""/path/to/directory""\nfile=..."
4,find regular which case-insensitive name is fo...,find . -iname foo -type f
...,...,...
24568,Create a script to monitor system mounted file...,#!/bin/bash\n\nmounted_filesystems=$(df -h | g...
24569,Print only common strings in content of files ...,comm -1 -2 file1.sorted file2.sorted
24570,display all the files having spaces in the cur...,"find . -name ""filename including space"" -print0"
24571,Monitor system filesystem mount options every ...,#!/bin/bash\n\nfor ((i=0; i<6; i++)); do\n ...


### HF [Romit2004/LinuxCommands](https://huggingface.co/datasets/Romit2004/LinuxCommands)
License=MIT

### Ungenügende Datenqualität! Ignorieren des Datensatzes.

In [36]:
#dataset2 = load_dataset("Romit2004/LinuxCommands")
#dataset2


### HF [bajrangCoder/linux_cmd_alpaca](https://huggingface.co/datasets/bajrangCoder/linux_cmd_alpaca)
License=MIT

### Ungenügende Datenqualität! Ignorieren des Datensatzes.

In [37]:
#dataset3 = load_dataset("bajrangCoder/linux_cmd_alpaca")
#dataset3

### HF [aelhalili/bash-commands-dataset](https://huggingface.co/datasets/aelhalili/bash-commands-dataset)
License=MIT

In [38]:
dataset4 = load_dataset("aelhalili/bash-commands-dataset")
dataset4

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 840
    })
})

Formattierung des Datensatzes in ein Dataframe

In [39]:
dataset4_complete = pd.DataFrame(dataset4['train'])
dataset4_complete

Unnamed: 0,prompt,response
0,Move a file called x from the Desktop to the D...,mv ~/Desktop/x ~/Downloads/
1,Open YouTube and search for videos by Mr Beast,xdg-open 'https://www.youtube.com/results?sear...
2,Create a folder named projects inside the Docu...,mkdir ~/Documents/projects
3,Open the Firefox browser,firefox &
4,Search for all PNG files in the Pictures folder,find ~/Pictures -name '*.png'
...,...,...
835,Start the Apache service,sudo systemctl start apache2
836,Enable Apache to start on boot,sudo systemctl enable apache2
837,Stop the Apache service,sudo systemctl stop apache2
838,Restart the Apache service,sudo systemctl restart apache2


Anhängen des Datensatzes an das Dataframe complete_data

In [40]:
# übernehmen der erwünschten Spalten
new_rows4 = dataset4_complete[['prompt', 'response']].copy()

# Umbenennen der übernommenen Spalten
new_rows4.rename(columns={
    'response': 'command',
    'prompt': 'nl_prompt'
}, inplace=True)

# Anhängen an complete_data
complete_data_bash = pd.concat([complete_data_bash, new_rows4], ignore_index=True)
complete_data_bash

Unnamed: 0,nl_prompt,command
0,Look for any files that were modified 2-5 days...,find -mtime +2 -mtime -5
1,Creates temporary file in a TMPDIR folder or /...,"tempFile=""$(mktemp ""${TMPDIR:-/tmp/}$(basename..."
2,Find all files with space in their names under...,find . -type f -name '* *'
3,Create a script to check if a file exists in a...,"#!/bin/bash\n\ndir=""/path/to/directory""\nfile=..."
4,find regular which case-insensitive name is fo...,find . -iname foo -type f
...,...,...
25408,Start the Apache service,sudo systemctl start apache2
25409,Enable Apache to start on boot,sudo systemctl enable apache2
25410,Stop the Apache service,sudo systemctl stop apache2
25411,Restart the Apache service,sudo systemctl restart apache2


### Kaggle [Kushagragoyal060705/Complex-Linux-commands-from-natual-language](https://www.kaggle.com/datasets/kushagragoyal060705/complex-linux-commands-from-natual-language)
License=Apache 2.0

In [41]:
dataset5 = load_dataset("terminAl-thesis-2025/big_bash")
dataset5

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 1000000
    })
})

In [42]:
dataset5 = pd.DataFrame(dataset5['train'])
dataset5

Unnamed: 0,input,output
0,Find all files modified in the last 7 days,find . -type f -mtime -7
1,Recursively change ownership of a directory to...,chown -R john:john /path/to/directory
2,List open files by a process with PID 1234,lsof -p 1234
3,Monitor system resource usage dynamically,dstat -cdngy
4,Copy all .log files to a backup directory,cp *.log /path/to/backup/
...,...,...
999995,List all running services,systemctl list-units --type=service
999996,Find and replace 'foo' with 'bar' in all .txt ...,sed -i 's/foo/bar/g' *.txt
999997,Check if port 8080 is in use,netstat -tulnp | grep 8080
999998,Display real-time disk I/O stats,iostat -dx 1


Anhängen des Datensatzes an das Dataframe complete_data


In [43]:
# übernehmen der erwünschten Spalten
new_rows5 = dataset5[['input', 'output']].copy()

# Umbenennen der übernommenen Spalten
new_rows5.rename(columns={
    'output': 'command',
    'input': 'nl_prompt'
}, inplace=True)

# Anhängen an complete_data
complete_data_bash = pd.concat([complete_data_bash, new_rows5], ignore_index=True)
complete_data_bash

Unnamed: 0,nl_prompt,command
0,Look for any files that were modified 2-5 days...,find -mtime +2 -mtime -5
1,Creates temporary file in a TMPDIR folder or /...,"tempFile=""$(mktemp ""${TMPDIR:-/tmp/}$(basename..."
2,Find all files with space in their names under...,find . -type f -name '* *'
3,Create a script to check if a file exists in a...,"#!/bin/bash\n\ndir=""/path/to/directory""\nfile=..."
4,find regular which case-insensitive name is fo...,find . -iname foo -type f
...,...,...
1025408,List all running services,systemctl list-units --type=service
1025409,Find and replace 'foo' with 'bar' in all .txt ...,sed -i 's/foo/bar/g' *.txt
1025410,Check if port 8080 is in use,netstat -tulnp | grep 8080
1025411,Display real-time disk I/O stats,iostat -dx 1


# SQL Queries/Commands

### HF [zerolink/zsql-postgres-dpo](https://huggingface.co/datasets/zerolink/zsql-postgres-dpo)
### License=(Multiple Licenses and also "viral" Creative Commons Licenses)! <--  Ignorieren des Datensatzes.

In [44]:
#dataset6 = load_dataset("zerolink/zsql-postgres-dpo")
#dataset6

### HF [omeryentur/text-to-postgresql](https://huggingface.co/datasets/omeryentur/text-to-postgresql)
License=?

In [45]:
dataset7 = load_dataset("omeryentur/text-to-postgresql")
dataset7

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['question', 'query', 'schema', '__index_level_0__'],
        num_rows: 58696
    })
})

Formattierung des Datensatzes in ein Dataframe

In [46]:
dataset7_complete = pd.DataFrame(dataset7['train'])
dataset7_complete

Unnamed: 0,question,query,schema,__index_level_0__
0,What is the total number of addresses for each...,"SELECT country, state_province_county, COUNT(*...",\nCREATE TABLE addresses (\n\taddress_id INTEG...,0
1,Find all addresses in the United States that d...,SELECT * FROM addresses WHERE country = 'Unite...,\nCREATE TABLE addresses (\n\taddress_id INTEG...,1
2,What is the count of addresses in Europe by ci...,"SELECT city, COUNT(*) AS address_count FROM ad...",\nCREATE TABLE addresses (\n\taddress_id INTEG...,2
3,"Get a list of all countries, along with the nu...","SELECT country, COUNT(*) AS address_count FROM...",\nCREATE TABLE addresses (\n\taddress_id INTEG...,3
4,What is the total number of addresses for each...,"SELECT city, COUNT(*) AS total_addresses FROM ...",\nCREATE TABLE addresses (\n\taddress_id INTEG...,4
...,...,...,...,...
58691,What are the languages spoken in the country o...,SELECT language FROM countrylanguage WHERE cou...,"\nCREATE TABLE city (\n\tcity_id INTEGER, \n\t...",71962
58692,What is the city with the highest GDP in the S...,"SELECT city, gdp FROM city WHERE hanzi LIKE 'S...","\nCREATE TABLE city (\n\tcity_id INTEGER, \n\t...",71963
58693,What is the average population of cities in th...,SELECT AVG(regional_population) FROM city WHER...,"\nCREATE TABLE city (\n\tcity_id INTEGER, \n\t...",71965
58694,Which country has the highest GDP among countr...,"SELECT country_name, SUM(gdp) AS total_gdp FRO...","\nCREATE TABLE city (\n\tcity_id INTEGER, \n\t...",71966


Anhängen des Datensatzes an das Dataframe complete_data

In [47]:
# übernehmen der erwünschten Spalten
new_rows7 = dataset7_complete[['question', 'query']].copy()

# Umbenennen der übernommenen Spalten
new_rows7.rename(columns={
    'query': 'command',
    'question': 'nl_prompt'
}, inplace=True)

# Anhängen an complete_data
complete_data_sql = pd.concat([complete_data_sql, new_rows7], ignore_index=True)
complete_data_sql

Unnamed: 0,nl_prompt,command
0,What is the total number of addresses for each...,"SELECT country, state_province_county, COUNT(*..."
1,Find all addresses in the United States that d...,SELECT * FROM addresses WHERE country = 'Unite...
2,What is the count of addresses in Europe by ci...,"SELECT city, COUNT(*) AS address_count FROM ad..."
3,"Get a list of all countries, along with the nu...","SELECT country, COUNT(*) AS address_count FROM..."
4,What is the total number of addresses for each...,"SELECT city, COUNT(*) AS total_addresses FROM ..."
...,...,...
58691,What are the languages spoken in the country o...,SELECT language FROM countrylanguage WHERE cou...
58692,What is the city with the highest GDP in the S...,"SELECT city, gdp FROM city WHERE hanzi LIKE 'S..."
58693,What is the average population of cities in th...,SELECT AVG(regional_population) FROM city WHER...
58694,Which country has the highest GDP among countr...,"SELECT country_name, SUM(gdp) AS total_gdp FRO..."


In [48]:
new_rows7

Unnamed: 0,nl_prompt,command
0,What is the total number of addresses for each...,"SELECT country, state_province_county, COUNT(*..."
1,Find all addresses in the United States that d...,SELECT * FROM addresses WHERE country = 'Unite...
2,What is the count of addresses in Europe by ci...,"SELECT city, COUNT(*) AS address_count FROM ad..."
3,"Get a list of all countries, along with the nu...","SELECT country, COUNT(*) AS address_count FROM..."
4,What is the total number of addresses for each...,"SELECT city, COUNT(*) AS total_addresses FROM ..."
...,...,...
58691,What are the languages spoken in the country o...,SELECT language FROM countrylanguage WHERE cou...
58692,What is the city with the highest GDP in the S...,"SELECT city, gdp FROM city WHERE hanzi LIKE 'S..."
58693,What is the average population of cities in th...,SELECT AVG(regional_population) FROM city WHER...
58694,Which country has the highest GDP among countr...,"SELECT country_name, SUM(gdp) AS total_gdp FRO..."


### HF [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)
License=Apache-2.0

In [49]:
dataset8 = load_dataset("gretelai/synthetic_text_to_sql")
dataset8

DatasetDict({
    train: Dataset({
        features: ['id', 'domain', 'domain_description', 'sql_complexity', 'sql_complexity_description', 'sql_task_type', 'sql_task_type_description', 'sql_prompt', 'sql_context', 'sql', 'sql_explanation'],
        num_rows: 100000
    })
    test: Dataset({
        features: ['id', 'domain', 'domain_description', 'sql_complexity', 'sql_complexity_description', 'sql_task_type', 'sql_task_type_description', 'sql_prompt', 'sql_context', 'sql', 'sql_explanation'],
        num_rows: 5851
    })
})

Zusammenführung des Train/Test-Splits des Datensatzes


In [50]:
train_df8 = pd.DataFrame(dataset8['train'])
test_df8 = pd.DataFrame(dataset8['test'])

dataset8_complete = pd.concat([
    train_df8.assign(split='train'),
    test_df8.assign(split='test')
])

dataset8_complete

Unnamed: 0,id,domain,domain_description,sql_complexity,sql_complexity_description,sql_task_type,sql_task_type_description,sql_prompt,sql_context,sql,sql_explanation,split
0,5097,forestry,Comprehensive data on sustainable forest manag...,single join,"only one join (specify inner, outer, cross)",analytics and reporting,"generating reports, dashboards, and analytical...",What is the total volume of timber sold by eac...,"CREATE TABLE salesperson (salesperson_id INT, ...","SELECT salesperson_id, name, SUM(volume) as to...","Joins timber_sales and salesperson tables, gro...",train
1,5098,defense industry,"Defense contract data, military equipment main...",aggregation,"aggregation functions (COUNT, SUM, AVG, MIN, M...",analytics and reporting,"generating reports, dashboards, and analytical...",List all the unique equipment types and their ...,CREATE TABLE equipment_maintenance (equipment_...,"SELECT equipment_type, SUM(maintenance_frequen...",This query groups the equipment_maintenance ta...,train
2,5099,marine biology,"Comprehensive data on marine species, oceanogr...",basic SQL,basic SQL with a simple select statement,analytics and reporting,"generating reports, dashboards, and analytical...",How many marine species are found in the South...,"CREATE TABLE marine_species (name VARCHAR(50),...",SELECT COUNT(*) FROM marine_species WHERE loca...,This query counts the number of marine species...,train
3,5100,financial services,Detailed financial data including investment s...,aggregation,"aggregation functions (COUNT, SUM, AVG, MIN, M...",analytics and reporting,"generating reports, dashboards, and analytical...",What is the total trade value and average pric...,"CREATE TABLE trade_history (id INT, trader_id ...","SELECT trader_id, stock, SUM(price * quantity)...",This query calculates the total trade value an...,train
4,5101,energy,Energy market data covering renewable energy s...,window functions,"window functions (e.g., ROW_NUMBER, LEAD, LAG,...",analytics and reporting,"generating reports, dashboards, and analytical...",Find the energy efficiency upgrades with the h...,"CREATE TABLE upgrades (id INT, cost FLOAT, typ...","SELECT type, cost FROM (SELECT type, cost, ROW...",The SQL query uses the ROW_NUMBER function to ...,train
...,...,...,...,...,...,...,...,...,...,...,...,...
5846,5847,museums,"Visitor demographics, exhibition analytics, co...",basic SQL,basic SQL with a simple select statement,analytics and reporting,"generating reports, dashboards, and analytical...","How many visitors are from the city of ""Seattl...","CREATE TABLE visitor (visitor_id INT, visitor_...",SELECT COUNT(*) FROM visitor WHERE visitor_cit...,"This SQL query counts all the rows in the ""vis...",test
5847,5848,waste management,"Waste generation metrics, recycling rates, lan...",basic SQL,basic SQL with a simple select statement,analytics and reporting,"generating reports, dashboards, and analytical...",What is the total waste generation in kilogram...,"CREATE TABLE organizations (id INT, name TEXT,...",SELECT SUM(annual_waste_generation_kg) FROM or...,This query calculates the total waste generati...,test
5848,5849,water resources,"Water usage metrics, drought impact assessment...",aggregation,"aggregation functions (COUNT, SUM, AVG, MIN, M...",analytics and reporting,"generating reports, dashboards, and analytical...",What is the maximum wastewater volume treated ...,CREATE TABLE WasteWaterTreatment (Id INT PRIMA...,"SELECT Plant, MAX(Volume) FROM WasteWaterTreat...",This SQL query calculates the maximum wastewat...,test
5849,5850,fitness industry,"Workout data, membership demographics, wearabl...",aggregation,"aggregation functions (COUNT, SUM, AVG, MIN, M...",analytics and reporting,"generating reports, dashboards, and analytical...",Calculate the total workout duration and numbe...,"CREATE TABLE Workouts (user_id INT, workout_da...","SELECT user_id, SUM(workout_duration) as total...",The query filters the data to include only wor...,test


Anhängen des Datensatzes an das Dataframe complete_data


In [51]:
# übernehmen der erwünschten Spalten
new_rows8 = dataset8_complete[['sql_prompt', 'sql']].copy()

# Umbenennen der übernommenen Spalten
new_rows8.rename(columns={
    'sql': 'command',
    'sql_prompt': 'nl_prompt'
}, inplace=True)

# Anhängen an complete_data
complete_data_sql = pd.concat([complete_data_sql, new_rows8], ignore_index=True)
complete_data_sql

Unnamed: 0,nl_prompt,command
0,What is the total number of addresses for each...,"SELECT country, state_province_county, COUNT(*..."
1,Find all addresses in the United States that d...,SELECT * FROM addresses WHERE country = 'Unite...
2,What is the count of addresses in Europe by ci...,"SELECT city, COUNT(*) AS address_count FROM ad..."
3,"Get a list of all countries, along with the nu...","SELECT country, COUNT(*) AS address_count FROM..."
4,What is the total number of addresses for each...,"SELECT city, COUNT(*) AS total_addresses FROM ..."
...,...,...
164542,"How many visitors are from the city of ""Seattl...",SELECT COUNT(*) FROM visitor WHERE visitor_cit...
164543,What is the total waste generation in kilogram...,SELECT SUM(annual_waste_generation_kg) FROM or...
164544,What is the maximum wastewater volume treated ...,"SELECT Plant, MAX(Volume) FROM WasteWaterTreat..."
164545,Calculate the total workout duration and numbe...,"SELECT user_id, SUM(workout_duration) as total..."


# Bereinigung der Daten

Entferne Duplikate und Prüfe SQL-Commands auf Validität

In [52]:
# Filtere nach eindeutigen Werten
unique_sql_nl_prompts = complete_data_sql["nl_prompt"].unique()
unique_sql_commands = complete_data_sql["command"].unique()

print(f"Length of original DataFrame: {len(complete_data_sql)}")
print(f"Unique NL Prompts: {len(unique_sql_nl_prompts)}")
print(f"Unique Commands: {len(unique_sql_commands)}")

# Entferne doppelte oder mehrfache Prompts (da mehr Commands als Prompts, ansonsten doppelte Commands entfernen)
complete_data_sql = complete_data_sql.drop_duplicates(subset="nl_prompt", keep="first").copy()
print(f"Length of complete data: {len(complete_data_sql)}")
print(f"Unique Commands: {len(complete_data_sql['command'].unique())}")
print(f"Verbleibende doppelte Einträge: {len(complete_data_sql)-len(complete_data_sql['command'].unique())}")

def is_valid_sql(command: str) -> bool:
    try:
        parsed = sqlparse.parse(command)
        return bool(parsed) and all(stmt.tokens for stmt in parsed)
    except Exception:
        return False

# Apply syntax check to each row
complete_data_sql["check"] = complete_data_sql["command"].apply(is_valid_sql)

Length of original DataFrame: 164547
Unique NL Prompts: 161477
Unique Commands: 163733
Length of complete data: 161477
Unique Commands: 160674
Verbleibende doppelte Einträge: 803


Prüfe Bash-Commands auf Validität

**Vorsicht! Sicherheitshalber diesen Befehl in einer gesicherten Umgebung (Docker oder VM) ausführen**

In [53]:
# Filtere nach eindeutigen Werten
unique_bash_nl_prompts = complete_data_bash["nl_prompt"].unique()
unique_bash_commands = complete_data_bash["command"].unique()

print(f"Length of original DataFrame: {len(complete_data_bash)}")
print(f"Unique NL Prompts: {len(unique_bash_nl_prompts)}")
print(f"Unique Commands: {len(unique_bash_commands)}")

# Entferne doppelte oder mehrfache Prompts (da mehr Commands als Prompts, ansonsten doppelte Commands entfernen)
complete_data_bash = complete_data_bash.drop_duplicates(subset="nl_prompt", keep="first").copy()
print(f"Length of complete data: {len(complete_data_bash)}")
print(f"Unique Commands: {len(complete_data_bash['command'].unique())}")
print(f"Verbleibende doppelte Einträge: {len(complete_data_bash)-len(complete_data_bash['command'].unique())}")

def is_valid_bash_syntax(command: str) -> bool:
    try:
        result = subprocess.run(
            ['bash', '-n'],
            input=command.encode('utf-8'),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            timeout=2,
            check=False
        )
        return result.returncode == 0
    except Exception:
        return False

complete_data_bash["check"] = complete_data_bash["command"].apply(is_valid_bash_syntax)


Length of original DataFrame: 1025413
Unique NL Prompts: 13690
Unique Commands: 13152
Length of complete data: 13690
Unique Commands: 11335
Verbleibende doppelte Einträge: 2355


Zusammenführen der Dataframes

In [54]:
complete_data = pd.concat([complete_data_sql, complete_data_bash])

Prüfe auf falsche Syntax und entferne diese

In [55]:
false_count = (complete_data['check'] == False).sum()
print(f"Einträge mit falscher Syntax zu entfernen: {false_count}")

# Remove rows where column 'A' is False
complete_data = complete_data[complete_data['check']]

Einträge mit falscher Syntax zu entfernen: 188


In [56]:
complete_data

Unnamed: 0,nl_prompt,command,check
0,What is the total number of addresses for each...,"SELECT country, state_province_county, COUNT(*...",True
1,Find all addresses in the United States that d...,SELECT * FROM addresses WHERE country = 'Unite...,True
2,What is the count of addresses in Europe by ci...,"SELECT city, COUNT(*) AS address_count FROM ad...",True
3,"Get a list of all countries, along with the nu...","SELECT country, COUNT(*) AS address_count FROM...",True
4,What is the total number of addresses for each...,"SELECT city, COUNT(*) AS total_addresses FROM ...",True
...,...,...,...
25418,List all running services,systemctl list-units --type=service,True
25419,Find and replace 'foo' with 'bar' in all .txt ...,sed -i 's/foo/bar/g' *.txt,True
25420,Check if port 8080 is in use,netstat -tulnp | grep 8080,True
25421,Display real-time disk I/O stats,iostat -dx 1,True


 # Upload zu Huggingface

In [57]:
hf_complete_data = DS.from_pandas(complete_data)
hf_complete_data.push_to_hub("terminAl-thesis-2025/combined_dataset")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/175 [00:00<?, ?ba/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/terminAl-thesis-2025/combined_dataset/commit/7e8e7f0199037e8c9852db3c71e30c02d123983c', commit_message='Upload dataset', commit_description='', oid='7e8e7f0199037e8c9852db3c71e30c02d123983c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/terminAl-thesis-2025/combined_dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='terminAl-thesis-2025/combined_dataset'), pr_revision=None, pr_num=None)