Skip to content

An image to text model/pipeline using VIT and Transformers and deployment using Nvidia's Pytrition and Streamlit app.

License

Notifications You must be signed in to change notification settings

suryanshgupta9933/Scene-Script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👁‍🗨 Scene-Script

Introduction

The goal of this project is to create an easy to use and deploy pipeline for training, validation and inference that can take an image as input and output text description. The pipeline consists of a VIT model for feature extraction and a Transformer model for text generation. The pipeline is deployed using Nvidia's PyTrition and Streamlit app.

Dataset

The dataset used for the model is Flickr30k. The dataset consists of 31,783 images collected from Flickr. Each image is paired with 5 captions. The dataset can be downloaded from here.

Installation

Clone the repository using the following command:

git clone https://github.com/smackiaa/Scene-Script.git

Now, install the required packages using the following command:

pip install -r requirements.txt

Usage

Preprocessing

Preprocess the captions text file to remove unnecessary data using the following command:

python preprocess.py --caption_file <path/to/captions/file>
  • caption_file is the path to the captions file.

Note: The preprocessed captions file is saved in the same directory as the original captions file. It overwrites the original captions file. So, either use a copy of the original captions file or rename the original captions file before running the above command or use the one given in the repository in the data directory.

Model Configuration

  • The model configuration json file contains the model parameters.
  • There are three different model configurations given in the configs directory.
  • The smallest one being 92 million parameters and the largest one being 112 million parameters.
  • The default model configuration is set to 97 million parameters which is a good balance between the model size and the performance.

Hyperparameters

  • The model configuration json file also contains the training parameters.
  • The learning_rate is set to 1e-6.
  • The batch_size is set to 128.
  • The num_epochs is set to 10.

Note: Change the batch size and the number of epochs according to the GPU memory and the time available.

Training

To train the model, run the following command:

python train.py --model_config <model-config> --images_dir <path/to/images/directory> --caption_file <path/to/captions/file>
  • model_config is the path to the model configuration file. The model configuration file is a json file that contains the model parameters and the training parameters.

Note: Default is set to 97 million parameters.

  • images_dir is the path to the directory containing the images.
  • caption_file is the path to the captions file.

Validation

  • The validation loop is run after the training and the validation loss is calculated.
  • The validation data is set to 0.1 of the total dataset. To change the validation data size, change the split parameter in the train.py file.

Saving the Model

  • The model is saved after the training and the validation loop is completed.
  • The model is saved in the weights directory.

Inference

To run inference on a single image, run the following command:

python inference.py --model_config <model-config> --model_weights <path/to/model/weights> --image_path <path/to/image>
  • model_config is the path to the model configuration file. The model configuration file is a json file that contains the model parameters.

Note: Default is set to 97 million parameters.

  • model_weights is the path to the model weights file.

Note: Default is set to the 'weights/scene_script.pth' file.

  • image_path is the path to the image.

Nvidia PyTrition

  • Nvidia PyTrition is a light-weight python wrapper for the Triton Inference Server.
  • The library allows serving Machine Learning models directly from Python through NVIDIA's Triton Inference Server.
  • The library can be installed using the following command:
pip install -U nvidia-pytrition

About

An image to text model/pipeline using VIT and Transformers and deployment using Nvidia's Pytrition and Streamlit app.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages