## **EqualEyes: Image Caption Generator**

### **Team**
- Shashank Guda (POC)
 https://github.com/gudashashank
- Soundarya Ravi
 https://github.com/soravi18
- Ramya Chowdary Patchala
 https://github.com/ramyacp14 
- Rithika Gurram
https://github.com/Rithika2000
- Vishnu Charugundla
https://github.com/VishnuCharugundla

### **Introduction**

The project aims to push image captioning technology beyond what currently exists. We will combine recent advances in image recognition and language modeling to generate novel descriptive captions that go beyond just naming objects and actions. The goal is a system that provides rich, detailed, and natural descriptions of photographs to make them more accessible and meaningful for all. Training on open and varied image datasets will be key to ensuring the system can generalize well and avoid bias. We will also prioritize evaluation methods that measure how well the captions describe the full context of images, not just presence of objects. There are still challenges ahead, but by focusing on inclusivity and the uniqueness of this multi-modal approach, we hope to create something that moves image captioning technology forward in an impactful way.

The objective of *EqualEyes* is to be helpful for anyone who interacts with images and relies on captions to understand the content better. This includes individuals with visual impairments who use screen readers (especially the color blind people), people browsing social media platforms or news websites, researchers analyzing image datasets, and developers working on applications that involve image recognition and understanding. Ultimately, the goal is to make image captioning more inclusive and accessible for all users, regardless of their abilities or needs.


### **Literature Review**

Microsoft Research has developed a system that generates detailed captions for real-world images by combining powerful image recognition with tools for generating, ranking and contextualizing captions. The system utilizes a state-of-the-art image recognition model to understand image content. It can also identify celebrities and landmarks to create richer, more informative captions. Testing shows the system outperforms previous ones, especially for images from diverse sources like Instagram. This research represents a major advance in creating accurate, detailed image captions with practical real-world applications. However, challenges remain in ensuring fairness and inclusivity. Strategies like data augmentation and balanced sampling have been proposed to mitigate biases. 

The system may struggle to generalize beyond its training data and battle biases. Its reasoning abilities are also limited, focusing mainly on basic description rather than deeper relationships. Additionally, evaluating caption quality systematically remains difficult. While the approach marks an advance, further innovation around generalization, reasoning, bias mitigation, evaluation methodology and integration into practical applications is needed to fully deliver on the potential of AI-generated image captions. Addressing these challenges could yield additional breakthroughs in creating inclusive, ethical and useful captioning systems. **<sup><small>[1]</small></sup>**

The discussion around improving inclusiveness, mitigating bias, and ensuring equitable performance of image captioning systems across different demographics relates most directly to Section 1.4 of the ACM Code of Ethics and Professional Responsibilities. Specifically:

***“Be fair and take action not to discriminate.” <sup><small>[2]</small></sup>***


#### **Stakeholders**

**1. Individuals with Visual Impairments:**
- Individuals who rely on screen readers to access image content.
- Accessible and descriptive image captions that accurately convey the content and context of images, enabling them to comprehend visual information effectively.

**2. Social Media Users:**
- Users of social media platforms such as Instagram, Twitter, and Facebook.
- Engaging and informative image captions that enhance the browsing experience and facilitate better understanding of shared images.

**3. Teachers and Educators:**
- Professionals involved in educating their peers, including those who train other teachers in the effective use of technology in the classroom. 
- It creates an engaging learning experience, supports early literacy with younger children where it makes learning fun, it also improves the language learning, it can provide additional vocabulary to the peers,highly useful in bilingual or multilingual classroom environments.


### **Data & Methods**

The image captioning task aims to automatically generate descriptive captions for images. This involves combining computer vision techniques to understand image content with natural language processing methods to translate that understanding into fluent text.  The standard image captioning framework uses a two-part deep neural network. First, a pretrained convolutional neural network (CNN) encodes the image into a feature vector. Typically a CNN architecture like ResNet pre-trained on image classification. Second, a recurrent neural network (RNN) language model like LSTM takes the image vector and generates the caption text word-by-word. The CNN image encoder and RNN caption generator are trained end-to-end, with the CNN extracting salient image features and the RNN learning to output accurate caption text.

The system is trained on image and caption pairs to learn associations between image features and words. The models are optimized to predict captions matching ground truth captions for each image. At inference, the trained CNN encoder and RNN decoder work together to generate captions for new images.


We’re assuming these are the few preprocessing steps which would be required before training the model:
- Resize images to standard dimensions suitable for the model architecture.
- Normalize pixel values to a common scale (typically 0-1) to align with model expectation.
- Convert image format to match model input (e.g. RGB for CNN).


### **Timeline**

| Date       | Activity                                      | Milestone                                                        |
|------------|-----------------------------------------------|------------------------------------------------------------------|
| 2/27 - 3/4 | Stakeholder analysis                          | Project Setup, Datasets Collection                               |
|            | Project Setup                                 | Perform detailed Stakeholder analysis.                            |
|            | Datasets Collection                           | Collecting diverse image datasets suitable for training.         |
| 3/5 - 3/11 | Data Preprocessing                            | Preprocess collected data, Resizing images, Tokenizing captions  |
| 3/12 - 3/18| Model Development and Evaluation               | Implement CNN Encoder and LSTM Decoder, Evaluate the model       |
| 3/19 - 3/25| Model Evaluation and Refinement               | Project CheckPoint, Refine the model to enhance performance      |
|            | Project CheckPoint                            | Evaluate the trained model                                      |
| 3/26 - 4/1 | Addressing Risks                              | Bias Mitigation, Analyze biases, Mitigate biases, Validate       |
|            | Bias Mitigation                               | through data augmentation, balanced sampling, and fairness      |
|            |                                               | constraints                                                      |
| 4/2 - 4/8  | Fine Tuning                                   | Optimization, Fine-tune the model on additional datasets         |
|            |                                               | Optimize the model for efficiency and scalability                |
| 4/9 - 4/15 | Final Testing and Validation                  | Perform final testing on optimized model, Validate outputs       |
|            |                                               | through human evaluation                                         |
| 4/16 - 4/22| Documentation and Presentation Preparation   | Project Checkpoint, Document the entire project, Prepare         |
|            |                                               | presentation materials, Practice delivery for presentation      |
| 4/23 - 4/29| Finalization and Submission                   | Finalize project report, documentation, and presentation         |
|            |                                               | materials, Submit project deliverables                           |


### **Risks**

**Data Bias**

One of the risks is that biases in the training data can lead to biased or problematic captions. Most image datasets suffer from lack of diversity and underrepresentation of certain groups. Models trained on such data may not generate fair or equitable captions

**Limited Generalization**

Image captioning models are prone to overfitting on the patterns in the training dataset. This can limit their ability to generalize to new scenarios and describe unfamiliar images. 

**Limited Reasoning**

Current image captioning systems have limited ability to perform true reasoning about the spatial, logical, and causal relationships depicted in an image. Captions are confined to surface-level descriptions and can fail to describe complex image contexts.

**Evaluation Challenges**

Automated evaluation metrics have limitations in assessing caption quality, human evaluation is essential but costly. As such, while more labor-intensive, complementary human evaluation remains absolutely vital for holistic, faithful assessment.


### **References**
1. Tran, K., He, X., Zhang, L., Sun, J., Carapcea, C., Thrasher, C., Buehler, C., Sienkiewicz, C. (Year). Rich Image Captioning in the Wild. Microsoft Research. [Publication](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/ImageCaptionInWild-1.pdf)

2. ACM Code of Ethics [Link](https://www.acm.org/code-of-ethics)