ImageLingo is an image captioning project that uses deep learning to generate captions for images. The project is built using PyTorch and includes training, evaluation, and deployment components.
- Python 3.8 or higher
- PyTorch
- Docker
- DVC
- MLflow
-
Clone the repository:
git clone https://github.com/Yuval728/imagelingo.git cd imagelingo -
Create a virtual environment and activate it:
python -m venv env source env/bin/activate # On Windows use `env\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Set up DVC:
dvc pull
ImageLingo is designed to generate descriptive captions for images using a deep learning model. The project leverages a combination of Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural Networks (RNNs) for sequence generation.
The dataset used for training the model is the Flickr8k dataset, which contains 8,000 images each with five different captions. The data preparation involves the following steps:
- Image Preprocessing: Images are resized and normalized to ensure consistency.
- Caption Tokenization: Captions are tokenized and converted into sequences of word indices.
- Vocabulary Creation: A vocabulary is created based on the frequency of words in the captions.
The model consists of two main components:
- Encoder (CNN): A pre-trained CNN (such as ResNet) is used to extract features from the images. The final convolutional layer's output is used as the image representation.
- Decoder (RNN): An RNN (such as LSTM) is used to generate captions based on the image features. The decoder is trained to predict the next word in the sequence given the previous words and the image features.
The training process involves optimizing the model to minimize the difference between the generated captions and the actual captions. The following steps are performed:
- Forward Pass: The image is passed through the encoder to obtain the image features. The decoder then generates a caption based on these features.
- Loss Calculation: The loss is calculated based on the difference between the generated caption and the actual caption.
- Backward Pass: The gradients are computed and the model parameters are updated to minimize the loss.
The model is evaluated using standard metrics such as BLEU score, which measures the similarity between the generated captions and the actual captions. The evaluation process involves:
- Generating Captions: The model generates captions for the test images.
- Calculating Metrics: The generated captions are compared with the actual captions using metrics like BLEU score.
The trained model is deployed using Docker and TorchServe. The deployment process involves:
- Creating Model Archive: The model is packaged into a model archive file (MAR) using Torch Model Archiver.
- Building Docker Image: A Docker image is created with the necessary dependencies and the model archive.
- Running Docker Container: The Docker container is run to serve the model and provide an API for generating captions.
This project is licensed under the MIT License - see the LICENSE file for details.