<a href="https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_AI_Image_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain openai transformers

In [None]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

import os

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or 'Your OPENAI API Key'

In [None]:
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0, model_name='gpt-3.5-turbo')

# CUDA

CUDA is a parallel computing platform and programming model created by NVIDIA. With more than 20 million downloads to date, CUDA helps developers speed up their applications by harnessing the power of GPU accelerators. 

https://blogs.nvidia.com/blog/2012/09/10/what-is-cuda-2/

# BLIP

The BLIP model was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

BLIP is a model that is able to perform various multi-modal tasks including

    Visual Question Answering
    Image-Text retrieval (Image-text matching)
    Image Captioning

https://huggingface.co/docs/transformers/model_doc/blip

# Image to Text HuggingFace Model

Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).

https://huggingface.co/Salesforce/blip-image-captioning-large

In [None]:
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration

image_to_text_model = "Salesforce/blip-image-captioning-large"
device = 'cuda' if torch.cuda.is_available() else 'cpu'

processor = BlipProcessor.from_pretrained(image_to_text_model)
model = BlipForConditionalGeneration.from_pretrained(image_to_text_model).to(device)

In [None]:
from transformers.models.oneformer.modeling_oneformer import OneFormerModelOutput
import requests
from PIL import Image

def describeImage(image_url):
  image_object = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
  # image
  inputs = processor(image_object, return_tensors="pt").to(device)
  outputs = model.generate(**inputs)
  return processor.decode(outputs[0], skip_special_tokens=True)

In [None]:
description = describeImage('https://images.unsplash.com/photo-1673207520321-c27d09eb0955?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1035&q=80')

In [None]:
description

In [None]:
from langchain.tools import BaseTool

class DescribeImageTool(BaseTool):
    name = "Describe Image Tool"
    description = 'use this tool to describe an image.'

    def _run(self, url: str):
        description = describeImage(url)
        return description
    
    def _arun(self, query: str):
        raise NotImplementedError("Async operation not supported yet")

tools = [DescribeImageTool()]

# LangChain Agent Types

https://python.langchain.com/en/latest/modules/agents/agents/agent_types.html

## chat-conversational-react-description

a specific type of agent (chat-conversational-react-description) which expects to be used with a memory component.

https://python.langchain.com/en/latest/modules/agents/agents/examples/chat_conversation_agent.html

In [None]:
agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=ConversationBufferWindowMemory(
        memory_key='chat_history',
        k=5,
        return_messages=True
    )
)

In [None]:
image_url = 'https://images.unsplash.com/photo-1673207520321-c27d09eb0955?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1035&q=80'
agent(f"Describe the following image:\n{image_url}")

In [None]:
agent(f"What is the brand of car in the following image:\n{image_url}")

In [None]:
image_url = 'https://images.unsplash.com/photo-1682228287072-5e23cbffd487?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=987&q=80'
agent(f"Please describe the following image:\n{image_url}")


In [None]:
agent.memory.buffer