Visual Question Answering through modal dialogue (B.Tech Project) + API
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Visual Question Answering through Modal Dialogue

We’re already seeing incredible applications of object detection in our daily lives. One such interesting application is Visual Question Answering. It is a new and upcoming problem in Computer Vision where the data consists of open-ended questions about images. In order to answer these questions, an effective system would need to have an understanding of “vision, language and common-sense.”

Before proceeding further, I would highly encourge you to quickly read the full VQA Post here.

Try it now on FloydHub


Click this button to open a Workspace on FloydHub that will train this model.

Do remember to execute inside a terminal everytime you restart your workspace to install relevant dependencies.

This post will first dig into the basic theory behind the Visual Question Answering task. Then, we’ll discuss and build two approaches to VQA: the “bag-of-words” and the “recurrent” model. Finally, we’ll provide a tutorial workflow for training your own models and setting up a REST API on FloydHub to start detecting objects in your own images. The project code is in Python (Keras + TensorFlow). You can view my experiments directly on FloydHub, as well as the code (along with the weight files and data) on Github.

Since I've already preprocessed the data & stored everything in a FloydHub dataset, here's what we're going to do -

  • Checkout the preprocessed data from the VQA Dataset.
  • Build & train two VQA models using Keras & Tensorflow.
  • Assess the models on the VQA validation sets.
  • Run the model to generate some really cool predictions.

Serving Models on FloydHub

I've created a separate repository here to serve models since it avoids the overhead of pushing the entire code/data in the training repo to Floyd over & over again.

For Offline Execution

The following are a couple of instructions that must be gone through in order to execute different (or all) sections of this project. You will need a NVIDIA GPU to train these models.

  1. Clone the project, replacing VQAMD with the name of the directory you are creating:

     $ git clone VQAMD
     $ cd VQAMD
  2. Make sure you have python 3.5.x running on your local system. If you do, skip this step. In case you don't, head head here.

  3. virtualenv is a tool used for creating isolated 'virtual' python environments. It is advisable to create one here as well (to avoid installing the pre-requisites into the system-root). Do the following within the project directory:

     $ [sudo] pip install virtualenv
     $ virtualenv --system-site-packages VQAMD
     $ source VQAMD/bin/activate

To deactivate later, once you're done with the project, just type deactivate.

  1. Install the pre-requisites from requirements.txt & run tests/ to check if all the required packages were correctly installed:

     $ pip install -r requirements.txt
     $ bash

Contributing to VQA

I welcome contributions to this little project. If you have any new ideas or approaches that you'd like to incorporate here, feel free to open up an issue.

Please refer to each project's style guidelines and guidelines for submitting patches and additions. In general, we follow the "fork-and-pull" Git workflow.

  1. Fork the repo VQAMD on GitHub
  2. Clone the project to your own machine
  3. Commit changes to your own branch
  4. Push your work back up to your fork
  5. Submit a Pull request so that we can review your changes

NOTE: Be sure to merge the latest from "upstream" before making a pull request!


Feel free to submit issues and enhancement requests.