In this notebook I explore the use of XGBoost to generate graphDB queries (Apstra QE) based on natural language input from users. 

## Outcomes:
XGBoost is not built to handle generative tasks, but is better suited for classification and prediction tasks. 
With our usecase here, we can only have the model return a closest match query from the training. It will not be able to construct a new query. 
It is recommended to use a transformer or Seq2Seq model architecture. 

In [1]:
!pip install datasets huggingface_hub pandas xgboost --upgrade --quiet

In [2]:
import os
import boto3
import s3fs
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
import xgboost as xgb

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Load HuggingFace Dataset

In [3]:
from datasets import load_dataset
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Initialize the SageMaker session

In [4]:
sagemaker_session = sagemaker.Session()

In [36]:
hg_source_dataset = 'deepwaters/apstra-qe-queries'
dataset = load_dataset(hg_source_dataset, split='train')

Downloading readme:   0%|          | 0.00/300 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.41k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/35 [00:00<?, ? examples/s]

In [37]:
print(dataset.shape)
print(dataset.column_names)
print(dataset[0])

(35, 2)
['label', 'query']
{'label': "get all links in the fabric belonging to the routing-zone 'blue'", 'query': "match(node('system', role='spine', deploy_mode='deploy').out('hosted_interfaces').node('interface', name='leaf_intf').out('link').node('link', role='spine_leaf').in_('link').node('interface').in_('hosted_interfaces').node('system', role='leaf'),node(name='leaf_intf').in_('member_interfaces').node('sz_instance').in_('instantiated_by').node('security_zone', vrf_name='blue')"}


## Pre-proccess: Vectorize data for XGBoost

Given your goal to predict the "query" based on the "label" input: 
* Target Variable: "query"
* Input Features: "label"


## organize and clean dataset

In [46]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.DataFrame(dataset)

# Drop rows with missing values
df.dropna(subset=['query', 'label'], inplace=True)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
query_tfidf = tfidf_vectorizer.fit_transform(df['query'])

# One-hot encoding
onehot_encoder = OneHotEncoder()
label_encoded = onehot_encoder.fit_transform(df[['label']]).toarray()

# Combine features
features = hstack([query_tfidf, label_encoded])

# Convert to DataFrame
# Assuming the first column 'label' needs to be integer-encoded
labels = onehot_encoder.categories_[0].searchsorted(df['label'])
full_dataset = pd.DataFrame(features.todense())
full_dataset.insert(0, 'label', labels)

# Split dataset into training and testing
train_df, test_df = train_test_split(full_dataset, test_size=0.2, random_state=42)


In [47]:
## save to csv
train_df.to_csv('train.csv', header=False, index=False)
test_df.to_csv('test.csv', header=False, index=False)

## Load dataset to S3 for SageMaker usage
For now, there is only a training split so we will use it for both train and test datasets.
#TODO: create a test split

In [59]:
bucket = sagemaker_session.default_bucket()

s3 = boto3.client('s3')
s3.upload_file('train.csv', bucket, 'train/train.csv')
s3.upload_file('test.csv', bucket, 'test/test.csv')


# Define the S3 paths
train_s3_path = f"s3://{bucket}/train/train.csv"
test_s3_path = f"s3://{bucket}/test/test.csv"



print(f"Dataset {hg_source_dataset} saved to S3 bucket {train_s3_path}")
print(f"Dataset {hg_source_dataset} saved to S3 bucket {test_s3_path}")

Dataset deepwaters/apstra-qe-queries saved to S3 bucket s3://sagemaker-us-west-1-983186512003/train/train.csv
Dataset deepwaters/apstra-qe-queries saved to S3 bucket s3://sagemaker-us-west-1-983186512003/test/test.csv


In [52]:
with open('train.csv', 'rb') as file:
    for i in range(5):
        print(file.readline())

b'22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21361745475479457,0.0,0.0,0.0,0.3641665482804358,0.0,0.0,0.1808502265564514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21361745475479457,0.0,0.0,0.38570175448168215,0.0,0.0,0.0,0.0,0.0,0.0,0.11390083476208064,0.0,0.0,0.22780166952416128,0.0,0.0,0.34170250428624194,0.0,0.0,0.32016729808499567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3108679795799754,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3641665482804358,0.0,0.0,0.0,0.0,0.2850515158620618,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n'
b'25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3887043107411136,0.0,0.0,0.0,0.0,0.5830564661116705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.

## Create the XGBoost estimator

In [54]:
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50",
        "verbosity": "2",
}

# Instance type used for training
ec2_instance_type = "ml.m5.2xlarge"

In [61]:
# set an output path where the trained model will be saved
prefix = 'graph-query-xgboost'
output_path = f"s3://{bucket}/{prefix}/abalone-dist-xgb/output"

xgboost_container = sagemaker.image_uris.retrieve("xgboost", 'us-west-1', "1.7-1")

# construct a SageMaker XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = sagemaker.estimator.Estimator(
    image_uri=xgboost_container, 
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),
    instance_count=1, 
    instance_type='ml.m5.large', 
    volume_size=5, # 5 GB 
    output_path=output_path
)

train_input = sagemaker.inputs.TrainingInput(s3_data=train_s3_path, content_type='csv')
validation_input = sagemaker.inputs.TrainingInput(s3_data=test_s3_path, content_type='csv')

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


# Train the Model
## Start the training job

In [62]:
estimator.fit({'train': train_input, 'validation': validation_input})

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-06-14-17-25-33-316


2024-06-14 17:25:33 Starting - Starting the training job...
2024-06-14 17:25:49 Starting - Preparing the instances for training...
2024-06-14 17:26:15 Downloading - Downloading input data...
2024-06-14 17:26:55 Downloading - Downloading the training image......
2024-06-14 17:28:00 Training - Training image download completed. Training in progress..[34m[2024-06-14 17:28:05.920 ip-10-0-185-79.us-west-1.compute.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-06-14 17:28:05.942 ip-10-0-185-79.us-west-1.compute.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2024-06-14:17:28:06:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-06-14:17:28:06:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34m[2024-06-14:17:28:06:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-06-14:17:28:06:INFO] Running XGBoost Sagemake

## Deploy the model for inference

In [70]:
xgb_predictor = estimator.deploy(initial_instance_count=1, instance_type=ec2_instance_type)

# Ensure that the endpoint is deployed with the correct configuration
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()
xgb_predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-06-14-18-08-56-730
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-06-14-18-08-56-730
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-06-14-18-08-56-730


------!

## Make Predictions

In [83]:
def make_prediction(input_text: str):
    input_text = [input_text]
    input_tfidf = tfidf_vectorizer.transform(input_text)
    input_features = input_tfidf

    input_features_dense = input_features.todense()
    input_csv = ','.join([str(i) for i in np.squeeze(np.asarray(input_features_dense))])

    # Making the prediction
    prediction = xgb_predictor.predict(input_csv)
    return prediction["predictions"][0]["score"]

In [104]:
scores = []
for index, row in df.iterrows():
    scores.append(make_prediction(row["label"]))
    
scores_df = df.assign(score=scores)
    
print(scores_df)
    

                                                label  \
0   get all links in the fabric belonging to the r...   
1                   Get all leafs and access switches   
2                          Return all fabric switches   
3   Get all active interfaces in vrf 'NCP' on all ...   
4   Find associated loopbacks on all switches in V...   
5                  Get all managed devices (any role)   
6   Get all external systems with a list of tags '...   
7       Get all managed devices of a given ASIC model   
8           Get all managed devices of a given vendor   
9   Get the BGP peering domain of an external rout...   
10  Find all MLAG peer-link interfaces on leaf swi...   
11  Get external links with the following list of ...   
12      Get all links between leaf and spine switches   
13                              get all logical VTEPs   
14  Get all internal links with a list of tags '['...   
15  Get all interfaces used for BGP peering with t...   
16  Get all external links with

In [108]:
input_text = "Get all leafs and access switches"

predicted_score = make_prediction(input_text=input_text)
print(f"Model Prediction: {predicted_score}")

scores_df['difference'] = abs(scores_df['score'] - predicted_score)
closest_match = scores_df.loc[scores_df['difference'].idxmin()]
print(f"Closest Match: {closest_match}")

Model Prediction: 23.679279327392578
Closest Match: label         get all links in the fabric belonging to the r...
query         match(node('system', role='spine', deploy_mode...
score                                                 23.679279
difference                                                  0.0
Name: 0, dtype: object


# Delete the endpoint after use

In [109]:
xgb_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-xgboost-2024-06-14-18-08-56-730
INFO:sagemaker:Deleting endpoint with name: sagemaker-xgboost-2024-06-14-18-08-56-730
