The main goal of this project model is to assign each pixel of an image in a category label. This network provides a complete understanding of the scene. It predicts the label, location as well as shape of each element in the image.
The first part is handled by CNNs and the second is handled by RNNs. Use both Natural Language Processing and Computer Vision to generate the captions.
If we are told to describe this image,
“মাঠের মধ্যে একটি ছেলে বল ধরে আছে ।” or “খালি গায়ে শিশুটি খুশিতে বল নিয়ে দাড়িয়ে আছে।"
While forming the description, we are seeing the image but at the same time, we are looking to create a meaningful sequence of words.
- Developing Deep Learning Model -> Google Colab
- Generate New Captions
- Preparing Photo Data
- Preparing Text Data
- Each photo has two described captions
- Evaluate Model
- RNN + CNN
- Encoder-decoder model
- Extract feature vector from input image
- Based on pretrained ResNet50
- Only require very small modifies
- LSTM: Long Short Term Memory networks
- Multiple Copies of the same network
- Contained three gates to control the cell state
- Capable of learning long-term dependencies.
Notepad++, Avro (Bangla Writings), Flatten, Convolution2D, Dropout, LSTM, TimeDistributed, Embedding, Bidirectional, Activation, RepeatVector, Concatenate.