# Pretraining for the `BAAI/bge-large-en-v1.5` RAG model on a subsample of the data

### 1. [Self supervised pretraining](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain)
### 2. [Contrastive learning finetuning](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)

## Setup

On a computer with GPU GeForce RTX 2070 8 GB with 16GB of RAM I can not pretrain/train `BAAI/bge-large-en-v1.5`. That is why we use Dorsal server for this purpose. We generate the data here and then send them to Dorsal to make the training

In [1]:
import sys
sys.path.append("../..")
import linetracker.pretty_print as pp
import linetracker.utils as u
from importlib import reload
reload(pp)
reload(u)
import sklearn.preprocessing as prep
import json
import re
import pandas as pd

A file named `stats_dataset.json` has been already preprocess to make a json with the following columns

In [2]:
%%time
df_src = pd.read_json("../../data/stats_dataset.json")
df_src.head()

CPU times: user 12.7 s, sys: 11.1 s, total: 23.7 s
Wall time: 24.7 s


Unnamed: 0,dup_id,event_id,group_id,line_num,log_name,planid,raw,template,text,variables,name,build_log
0,20a0d208-1053-49d3-8b44-ac1bbc77935f,000007c8-387e-43af-b3e5-854dceeb6e84,4fc2fbc7-fc6d-4ef0-81fc-4e7a247f1b53,2045,ddf_vx_simbc69,243808,2023-11-02 20:18:36 /localdisk/6500_repo/ome/v...,error undeclared first use in this function,2023-11-02 20:18:36 /localdisk/6500_repo/ome/v...,"[pTcb, /localdisk/6500_repo/ome/vobs/optnet_os...","243808, ddf_vx_simbc69","243808, ddf_vx_simbc69"
1,83caed2f-f8d1-4206-a2cf-8ba06557a1af,0000194e-612e-41e8-9c62-1a1c58c84b7e,ae294c83-8695-4df0-b780-3b0711921280,2398,50ghz_wss_sim,243489,cp: cannot stat '/localdisk/6500_repo/ome/vobs...,cp cannot stat no such file or directory,cp: cannot stat '/localdisk/6500_repo/ome/vobs...,[/localdisk/6500_repo/ome/vobs/viking_build/bu...,"243489, 50ghz_wss_sim","243489, 50ghz_wss_sim"
2,e07912c0-d331-42ab-a982-571ba55cde7d,0000429b-6c55-4c68-a463-3be09b4085cd,4eeca6ec-7a15-49b4-8aec-d41e4b70abb7,4201,otsc_simbc,245234,cp: cannot stat '/localdisk/6500_repo/ome/vobs...,cp cannot stat no such file or directory,cp: cannot stat '/localdisk/6500_repo/ome/vobs...,[/localdisk/6500_repo/ome/vobs/viking_build/bu...,"245234, otsc_simbc","245234, otsc_simbc"
3,93183c2e-90a9-464b-be6e-9a3153ce9e0a,00004e59-683f-4718-a101-9dd1f931b41d,f7d765e7-3b2e-445d-a231-936bbc382b5f,1395,otsc_ppc,242344,2023-10-27 07:05:36 sed: can't read *.h.temp: ...,sed can t read temp no such file or directory,2023-10-27 07:05:36 sed: can't read *.h.temp: ...,[*.h],"242344, otsc_ppc","242344, otsc_ppc"
4,79f3f012-c911-46c8-9513-85c1c9faba89,00008fd9-f276-422a-9e0c-eb76293814a5,4238f418-9c0a-4699-be15-1440868fbd83,5091,otn_200g_motr_ppc,244245,fatal: no such path 'vobs/optnet_broadband_app...,fatal no such path in head,fatal: no such path 'vobs/optnet_broadband_app...,[vobs/optnet_broadband_apps/hi/cards/otn_200g_...,"244245, otn_200g_motr_ppc","244245, otn_200g_motr_ppc"


For demonstration purposes, to avoid a long computation time, we will use a subsample of the data. We have different log file sizes (number of lines). To ensure representatativeness of the data, we sample 5 log files of each size:

In [3]:
%%time
sample = u.sample_log_files(df_src)
df = df_src[df_src['build_log'].isin(sample)]
df.loc[:, "text"] = df["text"].apply(u.remove_date_time)
df

CPU times: user 11.3 s, sys: 0 ns, total: 11.3 s
Wall time: 11.3 s


Unnamed: 0,dup_id,event_id,group_id,line_num,log_name,planid,raw,template,text,variables,name,build_log
61,958f792f-0280-4e95-acb8-f857b095fb9d,0005d6bb-3783-4dde-8455-ddc749f16faa,e60b9d0e-5659-438e-a1c2-7e7e275f6452,2765,ddf_vx_simbc69,245149,2023-11-08 20:31:28 /localdisk/6500_repo/ome/v...,error for each function it appears in,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,"245149, ddf_vx_simbc69","245149, ddf_vx_simbc69"
72,9ba0835c-8344-48e3-967a-f46f0e02db8c,0006f1b5-54a2-4ee5-ae1d-d30cc12af35a,0373f9c4-22a3-4bdf-9eee-3ce1fac16b1d,6730,hybrid_500G_trib_R1290_simbc,243909,2023-11-03 07:55:07 /localdisk/6500_repo/ome/v...,error previous declaration here,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,"[int abs(int), /localdisk/6500_repo/ome/vobs/o...","243909, hybrid_500G_trib_R1290_simbc","243909, hybrid_500G_trib_R1290_simbc"
191,2af671c7-e7ee-4d3f-a226-adffef0a6df5,0011f37e-9119-47b3-a0b1-ecde5ab400c1,558575d7-6e27-4fe1-a416-340b8ae6e7f7,6740,hybrid_500G_trib_simbc,242656,2023-10-30 07:31:43 /localdisk/6500_repo/ome/v...,error previous declaration here,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,"[div_t div(int, int), /localdisk/6500_repo/ome...","242656, hybrid_500G_trib_simbc","242656, hybrid_500G_trib_simbc"
205,7c013cc3-79cf-4db2-8c9a-75408b1b0ccf,0012b6d5-1757-4e51-a612-8aeb028e3b11,a677b971-fc8d-49e2-8322-a5357cd86533,39408,sp3_simbc,245445,2023-11-09 22:07:09 6.3 compile failed usin...,compile failed using,6.3 compile failed using .. \n,[6.3],"245445, sp3_simbc","245445, sp3_simbc"
242,5f7da74f-dca5-4669-83d5-2134ae8e2a55,001659c7-71a8-4814-b20d-bf7759df1bee,1fdc9d4f-8581-49bb-8fe0-92e171bc4e2b,1092,sp3_simbc,246106,2023-11-14 08:13:17 tput: No value for $TERM a...,tput no value for term and no t specified,tput: No value for $TERM and no -T specified\n,[],"246106, sp3_simbc","246106, sp3_simbc"
...,...,...,...,...,...,...,...,...,...,...,...,...
680016,6a1b3be9-eef7-47b6-873c-c05c046acfef,ffe88cc3-8348-4f27-9c74-429e44458cca,c6fc7925-da73-4835-9575-7f87623b5827,1864,ddf_vx_simbc69,240698,2023-10-19 15:28:44 /localdisk/6500_repo/ome/v...,error syntax error before token,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,"[}, /localdisk/6500_repo/ome/vobs/optnet_os/vx...","240698, ddf_vx_simbc69","240698, ddf_vx_simbc69"
680053,4fe52884-7300-433b-895e-d098cf6be00b,ffeb75a2-94f4-486d-83e0-a4ff6d2770e1,37213ab4-7d06-4328-8a22-384bbc942041,714,ddf_vx_simbc69,241086,2023-10-21 06:47:13 find: ‘/localdisk/6500_rep...,find no such file or directory,find: ‘/localdisk/6500_repo/plug-controller/ya...,[/localdisk/6500_repo/plug-controller/yang],"241086, ddf_vx_simbc69","241086, ddf_vx_simbc69"
680138,5f7da74f-dca5-4669-83d5-2134ae8e2a55,fff2a2fe-71a6-421c-9cdc-67171db44260,50857f24-ff41-4cb6-a7b3-6ee6a78bd67e,3317,hybrid_100G_trib_simbc,245598,2023-11-10 11:37:25 tput: No value for $TERM a...,tput no value for term and no t specified,tput: No value for $TERM and no -T specified\n,[],"245598, hybrid_100G_trib_simbc","245598, hybrid_100G_trib_simbc"
680150,10388dee-c920-433d-96bb-035c97d52c77,fff3c19f-74be-444b-bedc-942876fde9d4,89269c92-6824-42cc-b783-57d2606dfb91,2560,ddf_vx_simbc69_spap3,245533,2023-11-10 07:28:55 /localdisk/6500_repo/ome/v...,error requested alignment is not a constant,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,"245533, ddf_vx_simbc69_spap3","245533, ddf_vx_simbc69_spap3"


## 1. Self supervised pretraining

### 1.1 Generation of the data

We want a json of the following format (removing the date from the logs)
```
{"text": str}
```

We take all log lines shuffled, and apply the preprocessing

In [4]:
%%time
pretrain_df = df.sample(frac=1)
pretrain_df = pretrain_df[['text']]
pretrain_df.loc[:, "text"] = pretrain_df['text'].apply(u.remove_date_time)
u.dicts_to_jsonl(pretrain_df[['text']].to_dict(orient="records"), "../../data/pretrain_data.jsonl")
pretrain_df

CPU times: user 97.4 ms, sys: 0 ns, total: 97.4 ms
Wall time: 98.9 ms


Unnamed: 0,text
467648,cp: cannot stat '/localdisk/6500_repo/ome/vobs...
268217,jq: error (at <stdin>:24): null (null) has no ...
215536,tput: No value for $TERM and no -T specified\n
300206,==============FAIL =================\n
658237,cp: cannot stat '/localdisk/6500_repo/ome/vobs...
...,...
129316,/localdisk/6500_repo/ome/vobs/optnet_core/incl...
69775,/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...
657049,cp: cannot stat '/localdisk/6500_repo/ome/vobs...
655307,cp: cannot stat '/localdisk/6500_repo/ome/vobs...


Then we transfer the data on dorsal:

In [None]:
!rsync  -rltv --info=progress2  ./../../data/pretrain_data.jsonl ...your_username...@...dorsalip_vpn_or_eduroam...:/home/...your_username.../

### 1.2 Pretraining on Dorsal

We connect on Dorsal server with ssh
```
ssh ...
```

There we can either submit a job shell script or request a console to execute step by step the algorithm

1. If step by step: ask for the ressources

For instance, to ask for a terminal for 12 hours, with 1 cpu and 1 gpu we do 

```
salloc --time=12:0:0 --ntasks=1 --cpus-per-task=1  --exclude=epyc2 --gpus-per-task=1
```

Note that if ssh connection is lost there is no guarantee that the job is still running

Connect to the node (here epyc1)

```
ssh ...your_username...@...epyc1ip...
```

Then follow the command in the script of part 2 to load the modules and install dependancies

2. With shell script

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=1 # change this parameter to 2,4,6,... and increase "--num_workers" accordingly to see the effect on performance
#SBATCH --time=01-00:00:00           # duration (JJ-HH:MM:SS)
#SBATCH --output=log-%x-%j.out
#SBATCH --error=log-%x-%j.err
#SBATCH --mail-user=robin.moine456@gmail.com
#SBATCH --mail-type=ALL

# start of installation of the environment
module load py-torch py-transformers py-datasets py-flagembedding/1.2.3-gcc-9.4.0-qrjtb7n

# launch the pretraining
torchrun \
    --nnodes=1 \
    -m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \
    --output_dir ./bge-large-en-v1.5-pretrained \
    --model_name_or_path BAAI/bge-large-en-v1.5 \
    --train_data ./pretrain_data.jsonl \
    --learning_rate 2e-5 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --dataloader_drop_last True \
    --max_seq_length 500 \
    --logging_steps 10 \
    --dataloader_num_workers 1
```

In [None]:
!rsync  -rltv --info=progress2  ...your_username...@...dorsalip_vpn_or_eduroam...:/home/...your_username.../bge-large-en-v1.5-pretrained/ ../../data/bge-large-en-v1.5-pretrained

## 2. Contrastive learning finetuning

### 2.1 Generation of the data

We want a json of the following format (removing the date from the logs)
```
{"query": str, "pos": List[str], "neg":List[str]}
```

query is the log line to find similar log lines of.
pos are the lines that are similar to the one of the query
neg are the lines that are different from the one of the query

Note: Multiple solution are possible to do find the negative values
1. either you know them
2. you can use the [hard negatives](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) method

In this notebook we will consider the negative samples to be lines in different clusters from the reference file of CIENA

In [30]:
df_finetune = df[['event_id','build_log','text','group_id']].copy()
diff_lines = {}
same_lines = {}
for build_log, row_vals in df_finetune.groupby("build_log"):
    labels = prep.LabelEncoder().fit_transform(row_vals['group_id'])
    for l,id in zip(labels,row_vals['event_id']):
        # negative samples will be samples where the cluster is different
        diff_lines[id] = [text for other_label,text in zip(labels,row_vals['text']) if other_label!=l]
        # negative samples will be samples where the cluster is the same
        same_lines[id] = [text for other_label,text in zip(labels,row_vals['text']) if other_label==l]
df_finetune.loc[:, 'pos'] = df_finetune['event_id'].apply(lambda x:same_lines[x])
df_finetune.loc[:, 'neg'] = df_finetune['event_id'].apply(lambda x:diff_lines[x])
df_finetune.rename({"text":"query"},axis=1,inplace=True)
df_finetune

Unnamed: 0,event_id,build_log,query,group_id,pos,neg
61,0005d6bb-3783-4dde-8455-ddc749f16faa,"245149, ddf_vx_simbc69",/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,e60b9d0e-5659-438e-a1c2-7e7e275f6452,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,[cp: cannot stat '/localdisk/6500_repo/ome/vob...
72,0006f1b5-54a2-4ee5-ae1d-d30cc12af35a,"243909, hybrid_500G_trib_R1290_simbc",/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,0373f9c4-22a3-4bdf-9eee-3ce1fac16b1d,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,[cp: cannot stat '/localdisk/6500_repo/ome/vob...
191,0011f37e-9119-47b3-a0b1-ecde5ab400c1,"242656, hybrid_500G_trib_simbc",/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,558575d7-6e27-4fe1-a416-340b8ae6e7f7,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,[tput: No value for $TERM and no -T specified\...
205,0012b6d5-1757-4e51-a612-8aeb028e3b11,"245445, sp3_simbc",6.3 compile failed using .. \n,a677b971-fc8d-49e2-8322-a5357cd86533,"[6.3 compile failed using .. \n, =========...",[tput: No value for $TERM and no -T specified\...
242,001659c7-71a8-4814-b20d-bf7759df1bee,"246106, sp3_simbc",tput: No value for $TERM and no -T specified\n,1fdc9d4f-8581-49bb-8fe0-92e171bc4e2b,[tput: No value for $TERM and no -T specified\...,[cp: cannot stat '/localdisk/6500_repo/ome/vob...
...,...,...,...,...,...,...
680016,ffe88cc3-8348-4f27-9c74-429e44458cca,"240698, ddf_vx_simbc69",/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,c6fc7925-da73-4835-9575-7f87623b5827,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,"[cannot open directory: dhcp\n, cannot open di..."
680053,ffeb75a2-94f4-486d-83e0-a4ff6d2770e1,"241086, ddf_vx_simbc69",find: ‘/localdisk/6500_repo/plug-controller/ya...,37213ab4-7d06-4328-8a22-384bbc942041,[find: ‘/localdisk/6500_repo/ddy-diag/yang’: N...,[cc1: error: unrecognized command line option ...
680138,fff2a2fe-71a6-421c-9cdc-67171db44260,"245598, hybrid_100G_trib_simbc",tput: No value for $TERM and no -T specified\n,50857f24-ff41-4cb6-a7b3-6ee6a78bd67e,[tput: No value for $TERM and no -T specified\...,[cp: cannot stat '/localdisk/6500_repo/ome/vob...
680150,fff3c19f-74be-444b-bedc-942876fde9d4,"245533, ddf_vx_simbc69_spap3",/localdisk/6500_repo/ome/vobs/optnet_os/vxwork...,89269c92-6824-42cc-b783-57d2606dfb91,[/localdisk/6500_repo/ome/vobs/optnet_os/vxwor...,"[==============FAIL =================\n, cp: c..."


Let us see some statistics about the number of positive and negative samples

In [31]:
print("Negative length statistic")
print(df_finetune['neg'].apply(len).describe())
print("\nPositive length statistic")
print(df_finetune['pos'].apply(len).describe())

Negative length statistic
count    15977.000000
mean        35.090192
std         27.937891
min          0.000000
25%         12.000000
50%         33.000000
75%         50.000000
max        147.000000
Name: neg, dtype: float64

Positive length statistic
count    15977.000000
mean        32.335169
std         40.820016
min          1.000000
25%         10.000000
50%         13.000000
75%         31.000000
max        145.000000
Name: pos, dtype: float64


We remove the log lines where there are 0 negative samples

In [32]:
df_finetune.loc[:, "neg_length"] = df_finetune['neg'].apply(len)
df_finetune = df_finetune.query("neg_length > 0").copy()
df_finetune = df_finetune.drop(["neg_length"],axis=1)

In [33]:
print("Negative length statistic")
print(df_finetune['neg'].apply(len).describe())
print("\nPositive length statistic")
print(df_finetune['pos'].apply(len).describe())
u.dicts_to_jsonl(df_finetune.to_dict(orient="records"), "../data/finetune_data.jsonl")

Negative length statistic
count    15802.000000
mean        35.478800
std         27.845682
min          1.000000
25%         13.000000
50%         33.000000
75%         50.000000
max        147.000000
Name: neg, dtype: float64

Positive length statistic
count    15802.000000
mean        32.558917
std         40.986540
min          1.000000
25%         10.000000
50%         14.000000
75%         31.000000
max        145.000000
Name: pos, dtype: float64


Then we transfer the data on dorsal:

```bash
!rsync  -rltv --info=progress2  ./../../data/finetune_data.jsonl ...your_username...@...dorsalip_vpn_or_eduroam...:/home/...your_username.../
```

In [34]:
!rsync  -rltv --info=progress2  ./../../data/finetune_data.jsonl rmoine@132.207.72.24:/home/rmoine/

sending incremental file list
finetune_data.jsonl
    127,869,012 100%   11.11MB/s    0:00:10 (xfr#1, to-chk=0/1)

sent 108,045,092 bytes  received 78,999 bytes  8,009,191.93 bytes/sec
total size is 127,869,012  speedup is 1.18


### 2.2 Finetuning on Dorsal

We connect on Dorsal server with ssh
```
ssh ...
```

There we can either submit a job shell script or request a console to execute step by step the algorithm

1. If step by step: ask for the ressources

For instance, to ask for a terminal for 12 hours, with 1 cpu and 1 gpu we do 

```
salloc --time=12:0:0 --ntasks=1 --cpus-per-task=1  --exclude=epyc2 --gpus-per-task=1
```

Note that if ssh connection is lost there is no guarantee that the job is still running

Connect to the node (here epyc1)

```
ssh ...your_username...@...epyc1ip...
```

Then follow the command in the script of part 2 to load the modules and install dependancies

2. With shell script

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=1 # change this parameter to 2,4,6,... and increase "--num_workers" accordingly to see the effect on performance
#SBATCH --time=01-00:00:00           # duration (JJ-HH:MM:SS)
#SBATCH --output=log-%x-%j.out
#SBATCH --error=log-%x-%j.err
#SBATCH --mail-user=robin.moine456@gmail.com
#SBATCH --mail-type=ALL

# start of installation of the environment
module load py-torch py-transformers py-datasets py-flagembedding/1.2.3-gcc-9.4.0-qrjtb7n py-huggingface-hub/0.14.1-gcc-9.4.0-vudr6hg py-pip/23.1.2-gcc-9.4.0-otmrfim
pip install -U sentence_transformers
# launch the pretraining
torchrun \
    --nnodes=1 \
    -m FlagEmbedding.baai_general_embedding.finetune.run \
    --output_dir ./bge-large-en-v1.5-finetuned/ \
    --model_name_or_path BAAI/bge-large-en-v1.5 \
    --train_data ./finetune_data.jsonl \
    --learning_rate 1e-3 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 4 \
    --dataloader_drop_last True \
    --normlized True \
    --temperature 0.02 \
    --query_max_len 500 \
    --passage_max_len 500 \
    --train_group_size 2 \
    --logging_steps 10 \
    --query_instruction_for_retrieval "" 
```

In [None]:
!rsync  -rltv --info=progress2  ...your_username...@...dorsalip_vpn_or_eduroam...:/home/...your_username.../bge-large-en-v1.5-finetuned/ ../../data/bge-large-en-v1.5-finetuned