Skip to content

shaikhq/QueryFormer

 
 

Repository files navigation

QueryFormer Forked Repo - README

Query representation learning plays a crucial role in various AI-based database tasks, such as query optimization and workload forecasting. QueryFormer is a method that uses attention to learn query representation.

Zhao, Yue, Gao Cong, Jiachen Shi, and Chunyan Miao. "Queryformer: A Tree Transformer Model for Query Plan Representation." Proceedings of the VLDB Endowment 15, no. 8 (2022): 1658-1670.

QueryFormer learns query representations that can be utilized for different prediction tasks. The QueryFormer code, designed for PyTorch, was released over 2 years ago. The python ecosystem has progressed since then. I updated the QueryFormer code to work with the latest Python version (3.12.2), PyTorch, and other required packages. I’ve finally managed to run the QueryFormer code on the latest Python runtime after investing several days of effort. In this README, I have detailed the steps for setting up the Python environment and making the code working.

The training pipeline is in the Training V1.ipynb file. To make the pipeline run faster, I have changed the original batch size from 1024 to 128. Also, in the training dataset, instead of loading all 18 files, I loaded 2 files. You'll find them in the Training V1.ipynb file.

Setting up a Python 3.12.2 Virtual Environment

  1. check available python versions installed via pyenv
pyenv versions
  1. set a python version for the current project
cd project_dir
pyenv local 3.12.2

confirm that the local python was set to the target python version:

python --version
  1. create and activate a python virtual env:
python -m venv .venv
source .venv/bin/activate
  1. install python dependencies
pip install -r requirements.txt
  1. Set the project's python interpreter to the python installed in the project's virtual environment (.venv) MacOS: cmd + shift + p

click Python: Select Interpreter alt text

From the available python interpreter options, choose the python from the project's virtual environment as shown in the example below: alt text

Run the training pipeline

  1. Open Training V1.ipynb
  2. Set its kernel to the project's python interpreter

alt text

alt text

alt text

  1. When running the Notebook, you'll see the following prompt to install the jupyter and notebook package. Click install and wait until the installation completes. alt text

Note: On my Red Hat system, my installation of the Jupyter and Notebook package failed with the following error message: Failed to start the Kernel. ERROR: Could not build wheels for pyzmq, which is required to install pyproject.toml-based projects. View Jupyter log for further details.

I fixed this issue, by manually installing the following packages:

sudo yum update
sudo yum install zeromq-devel

Then using the terminal, installed the following 2 packages in the .venv of the project:

pip install pyzmq
pip install jupyter
  1. Next, I got the following error:
AttributeError                            Traceback (most recent call last)
Cell In[6], line 1
----> 1 hist_file = get_hist_file(data_path + 'histogram_string.csv')
      2 cost_norm = Normalizer(-3.61192, 12.290855)
      3 card_norm = Normalizer(1,100)

File ~/QueryFormer/model/database_util.py:75, in get_hist_file(hist_path, bin_number)
     73 for i in range(len(hist_file)):
     74     freq = hist_file['freq'][i]
---> 75     freq_np = np.frombuffer(bytes.fromhex(freq), dtype=np.float)
     76     hist_file['freq'][i] = freq_np
     78 table_column = []

File ~/QueryFormer/.venv/lib/python3.12/site-packages/numpy/__init__.py:324, in __getattr__(attr)
    319     warnings.warn(
    320         f"In the future `np.{attr}` will be defined as the "
    321         "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    323 if attr in __former_attrs__:
--> 324     raise AttributeError(__former_attrs__[attr])
    326 if attr == 'testing':
    327     import numpy.testing as testing

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I fixed this error by making the following code change in the database_util.py file: before:

freq_np = np.frombuffer(bytes.fromhex(freq), dtype=np.float)

after: before:

freq_np = np.frombuffer(bytes.fromhex(freq), dtype=float)
  1. Next, I got the following error from the code:
imdb_path = './data/imdb/'
full_train_df = pd.DataFrame()
for i in range(18):
    file = imdb_path + 'plan_and_cost/train_plan_part{}.csv'.format(i)
    df = pd.read_csv(file)
    full_train_df = full_train_df.append(df)

val_df = pd.DataFrame()
for i in range(18,20):
    file = imdb_path + 'plan_and_cost/train_plan_part{}.csv'.format(i)
    df = pd.read_csv(file)
    val_df = val_df.append(df)
table_sample = get_job_table_sample(imdb_path+'train')

I fixed this error with the following code change: before:

after:

imdb_path = './data/imdb/'
dfs = []  # list to hold DataFrames
for i in range(18):
    file = imdb_path + 'plan_and_cost/train_plan_part{}.csv'.format(i)
    df = pd.read_csv(file)
    dfs.append(df)

full_train_df = pd.concat(dfs)

val_dfs = []  # list to hold DataFrames
for i in range(18,20):
    file = imdb_path + 'plan_and_cost/train_plan_part{}.csv'.format(i)
    df = pd.read_csv(file)
    val_dfs.append(df)

val_df = pd.concat(val_dfs)
  1. Next, I got this error:
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 train_ds = PlanTreeDataset(full_train_df, None, encoding, hist_file, card_norm, cost_norm, to_predict, table_sample)
      2 val_ds = PlanTreeDataset(val_df, None, encoding, hist_file, card_norm, cost_norm, to_predict, table_sample)

NameError: name 'table_sample' is not defined

I fixed this by adding the following line of code before executing the above 2 lines of code:

table_sample = get_job_table_sample(imdb_path+'train')

Upgrading the NVIDIA driver on my Red Hat Linux 64-bit system

Next, I got the following error:

RuntimeError: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx

I fixed this issue by upgading the NVIDIA driver on my Red Hat Linux System. Here are my steps:

  1. identify the NVIDIA card model on my system: lspci | grep -i nvidia

I got this response: VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1) 2. For my NVIDIA card model, GeForce GTX 1050 Ti, I downloaded its latest driver by visiting this link. From this page, I chose GeForce, GeForce 10 Series, and then GeForce GTX 1050 Ti. Next, I picked my OS: Linux 64-bit. I figured out the bit for my OS, by running the following command:

uname -m
  1. Disabled the nouveau driver, which is an open-source driver for NVIDIA and could come pre-installed.
sudo vi /etc/default/grub

Added the following line, by replacing the previous value for this parameter.

GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=rhel/root crashkernel=auto  vconsole.keymap=us rd.driver.blacklist=nouveau nouveau.modeset=0 rhgb quiet"

Then re-generated the grub file:

grub2-mkconfig -o /boot/grub2/grub.cfg

rebooted the system:

sudo reboot

I installed the downloaded driver:

sudo bash NVIDIA-Linux-x86_64-XXX.XX.run

Rebooted the system again:

reboot

Confirmed that the driver was updated:

nvidia-smi

Useful Shell commands

printing the size of each file in the current directory:

du -sh *

printing the size of files in the current directory in the descending order of size:

du -sh * | sort -hr

compressing a folder:

tar -czvf name-of-archive.tar.gz /path/to/directory/

printing the size of all files, including hidden ones, in the current directory:

du -ahc | sort -hr

printing the size of the files and folders at the current directory without nesting. I.e., if the current folder has any subdirectory, I just want to see its total size, not the files and subdir within the subdirectory:

du -hd1 | sort -hr
  • -d1 option keeps the output to depth 1, i.e., the current directory level, prevent nesting

compressing a directory while excluding one of its subdirectory in the compressed form:

tar -czvf QueryFormer_gpu1.tar.gz --exclude='.venv' QueryFormer

generating requirements.txt from a venv:

source .venv/bin/activate
pip freeze > requirements.txt

Handling code errors and warnings

warning 1:

/home/shaikhq/QueryFormer/model/dataset.py:109: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:275.)
  'features' : torch.FloatTensor(features),

before:

 return {
            'features' : torch.FloatTensor(features),
            'heights' : torch.LongTensor(heights),
            'adjacency_list' : torch.LongTensor(np.array(adj_list)),
          
        }

after:

return {
        'features' : torch.FloatTensor(np.array(features)),
        'heights' : torch.LongTensor(heights),
        'adjacency_list' : torch.LongTensor(np.array(adj_list)),
      
    }

error 1: AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?

Traceback (most recent call last):
  File "/home/shaikhq/QueryFormer/TrainingV1.py", line 102, in <module>
    model, best_path = train(model, train_ds, val_ds, crit, cost_norm, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/model/trainer.py", line 147, in train
    best_model_path = logging(args, epoch, test_scores, filename = 'log.txt', save_model = True, model = model)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/model/trainer.py", line 179, in logging
    df = df.append(pd.DataFrame(res, index=[0]), ignore_index=True)
         ^^^^^^^^^
  File "/home/shaikhq/QueryFormer/.venv/lib/python3.12/site-packages/pandas/core/generic.py", line 6299, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?

Before:

 df = df.append(res, ignore_index=True)

After:

df = pd.concat([df, res], ignore_index=True)

Next, got this error:

Traceback (most recent call last):
  File "/home/shaikhq/QueryFormer/TrainingV1.py", line 102, in <module>
    model, best_path = train(model, train_ds, val_ds, crit, cost_norm, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/model/trainer.py", line 147, in train
    best_model_path = logging(args, epoch, test_scores, filename = 'log.txt', save_model = True, model = model)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/model/trainer.py", line 178, in logging
    df = pd.concat([df, res], ignore_index=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/.venv/lib/python3.12/site-packages/pandas/core/reshape/concat.py", line 382, in concat
    op = _Concatenator(
         ^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/.venv/lib/python3.12/site-packages/pandas/core/reshape/concat.py", line 448, in __init__
    ndims = self._get_ndims(objs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/shaikhq/QueryFormer/.venv/lib/python3.12/site-packages/pandas/core/reshape/concat.py", line 489, in _get_ndims
    raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid

Changed this code again to the following:

res_df = pd.DataFrame([res])
df = pd.concat([df, res_df], ignore_index=True)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 85.8%
  • Python 14.2%