# RUN INFERENCE SERVER


The cell below is a startup script for our inference server. It will load the model checkpoint we have chosen and allow us to send `PUT` requests to the server to generate synthetic tabular data! The script is loaded shown below. The server is started on port 5000 by default.

It may take a few moments to load the checkpoint and start the server...

<strong><u>Do not shut down the kernel for this notebook until completed with `3b_Inference.ipynb`</u></strong>

The pretraining will save the checkpoint files periodically at the `CHECKPOINT_PATH`. `save-interval` in the script controls the frequency. We use 10k in this example. 
While the training job is running, we can use the <a href="http://localhost:6006">tensorboard at port 6006</a> to monitor the training. Following are the training curves I have for training dataset and validation dataset.
<!-- ![images/tensorboard_loss.png](images/tensorboard_loss.png) -->
<center><img src=images/tensorboard_loss.png width="30%" height="40%" /></center>
<center><strong>Figure:</strong> Example training and validation loss curves</center>
</br>

Clearly, the model is overfitted as shown in the validation curve. We can take the checkpoint file at step `76k` by modifying the `latest_checkpointed_iteration.txt` file to `76000` at the `CHECKPOINT_PATH`. Let's check the checkpoint file it generates

### Adjust the model checkpoint to use the pretrained model we have provided for you

In [1]:
!echo 30000 > checkpoints/gpt_creditcard/latest_checkpointed_iteration.txt

In [2]:
!cat checkpoints/gpt_creditcard/latest_checkpointed_iteration.txt

30000


In [3]:
!cat ./run_data_gen_server.sh

#!/bin/bash
# This example will start serving the model.
source ./model_config.sh
SEED="${1:-42}"
PORT="${2:-5000}"
echo $SEED

DISTRIBUTED_ARGS="--nproc_per_node 1 \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

# CHECKPOINT=$LOADPATH
VOCAB_FILE=credit_card_coder.pickle

python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py \
       --tensor-model-parallel-size $TENSOR_MP_SIZE  \
       --pipeline-model-parallel-size $PIPELINE_MP_SIZE  \
       --num-layers $NUM_LAYERS  \
       --hidden-size $HIDDEN_SIZE  \
       --load $CHECKPOINT_PATH  \
       --num-attention-heads $NUM_HEADS  \
       --max-position-embeddings $MAX_POS_EMD  \
       --tokenizer-type TabularTokenizer \
       --fp16  \
       --micro-batch-size 1  \
       --seq-length $SEQ_LEN  \
       --out-seq-length $SEQ_LEN  \
       --temperature 1.0  \
       --vocab-file $VOCAB_FILE  \


Running the server will print out the incoming PUT requests. To silence these, comment out the `print` statements in `megatron/text_generation_server.py:MegatronGenerate:put` and re-run the script below

In [4]:
!./run_data_gen_server.sh

42
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 1
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... None
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval .................