# Image-to-Text Experimentation - LLaVA, LLaVA-Phi3, BakLLaVA (Ollama)

Rejoice Hu, 2024

## Setup
https://www.youtube.com/watch?v=_TUvb6NtpGA
https://ollama.com/library/llava

In [1]:
import ollama
from ollama import generate

import glob
import pandas as pd
from PIL import Image

import os
from io import BytesIO

In [2]:
# Load dataframe from csv file, or create new df if the file doesn't exist
def load_or_create_dataframe(filename):
    if os.path.isfile(filename):
        df = pd.read_csv(filename)
    else:
        df = pd.DataFrame(columns=['image_file', 'description'])
    return df

In [3]:
df = load_or_create_dataframe('image_descriptions.csv')

df

Unnamed: 0,image_file,description
0,./images/ms_welcome.png,The image is a screenshot of a webpage featur...


In [4]:
def get_png_files(folder_path):
    return glob.glob(f"{folder_path}/*.png")

# Get list of image files in the folder
image_files = get_png_files("./images") 
image_files.sort()

print(image_files[:3])
print(df.head())

['./images/ms_tech.png', './images/ms_welcome.png']
                image_file                                        description
0  ./images/ms_welcome.png   The image is a screenshot of a webpage featur...


In [1]:
# import base64
def process_image(image_file, model, prompt, temperature=0.8):
    print(f"\nModel: {model} - Processing {image_file}\n")
    
    # Process image / convert to base64 encoding
    with Image.open(image_file) as img:
        with BytesIO() as buffer:
            img.save(buffer, format='PNG')
            image_bytes = buffer.getvalue()
            # img_str = base64.b64encode(image_bytes)

    full_response = ''
    # Generate response for the image
    for response in generate(model=model, 
                             prompt=prompt,
                             images=[image_bytes], 
                             options={"temperature": temperature},
                             stream=True):
        # Print the response to the console and add it to the full response
        print(response['response'], end='', flush=True)
        full_response += response['response']

    # Add new row to the DataFrame
    df.loc[len(df)] = [image_file, full_response]

In [12]:
# change image processing: just file path instead of base64 encoding
def process_image2(image_file, model, prompt, temperature):
    print(f"\nModel: {model} - Processing {image_file}\n")

    full_response = ''
    # Generate a description of the image
    for response in generate(model=model, 
                             prompt=prompt,
                             images=[image_file], 
                             options={"temperature": temperature},
                             stream=True):
        # Print the response to the console and add it to the full response
        print(response['response'], end='', flush=True)
        full_response += response['response']

    # Add a new row to the DataFrame
    df.loc[len(df)] = [image_file, full_response]

## LLaVA through Ollama
https://ollama.com/library/llava

In [None]:
prompt1="Describe this image and make sure to include anything notable about it (include text you see in the image):', 

In [8]:
for image_file in image_files:
    if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt1) #latest: 7b-v1.6


Processing ./images/ms_welcome.png

 The image is a screenshot of a webpage featuring an article about Morgan Stanley's involvement in the rollercoaster ride at the 2020 Democratic National Convention. The headline reads "What We Do." Below this, there are two images: one of a rollercoaster with a group of individuals riding on it and another showing a rollercoaster with American flags prominently displayed.

There is text overlaying the image that includes statistics about Morgan Stanley's political contributions. It mentions "Morgan Stanley gave $138,040 to Democratic candidates" and states "The top recipients of this money include Joe Biden's campaign ($276,524) and the DNC ($591,736)."

Additionally, there is a disclaimer that says "These figures are based on publicly available data and may not be exhaustive." The webpage is from a news or political analysis site, as indicated by the logo at the top left corner. The style of the image suggests it is intended for informational or a

In [9]:
# Save the DataFrame to a CSV file
# df.to_csv('image_descriptions.csv', index=False)

#### ANALYSIS

ACCURATE:
- headline reads "What We Do"
- recognizes rollercoaster ride with American flags
- style is informational

INACCURATE:
- there is 1 image only, not 2
- hallucinations: 
    - involvement in 2020 Democratic National Convention
    - statistics about MS's political contributions
    - mentioning lots of text that do not exist in the actual screenshot ("Morgan Stanley gave $138,040 to Democratic candidates", disclaimer at the bottom)
- missing rest of the menu and most of the text

OVERALL ANALYSIS: 
extremely high incidence of hallucinations, description is more inaccurate than accurate

POSSIBLE CAUSES: 
prompt is not specific enough, model is too simple -- to improve: specify the kind of information we are interested in learning about the image

In [17]:
# PROMPT ENGINEERING: 
prompt2='Describe this image in as much detail as possible within 3 paragraphs, focusing on the structure, position/coordinates, and color of ALL the different elements like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt2) #7b-v1.6


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a webpage. At the top of the page, there is a navigation bar with links such as "Home," "About Us," "Blog," "Contact Us," and a menu icon on the right side. Below this, centered on the page, is a large header image showing an American flag on a pole, with a clear blue sky in the background. To the left of the flag, there's text that reads "MORGAN STANLEY," indicating the name associated with the content of the webpage.

Just beneath the header, centered again, is a tagline or slogan that says "WHAT WE DO." This is likely part of the branding and messaging strategy for the company represented on the page. Below this tagline, there are additional menu items including "Blog," "Investment Strategies," and "Research & Insights." These seem to be categories or sections within the site's content structure.

The overall color scheme of the webpage is primarily white with text in a darker font color, creatin

#### ANALYSIS

ACCURATE:
- recognizes some kind of navigation bar with links at the top righthand of the page that includes 
"Home," "About Us," "Blog," "Contact Us"
- recognizes presence of American flag and clear blue sky in the background
- position of image = below menu and centered on page, "MORGAN STANLEY" text = left of the flag
- "WHAT WE DO" is indeed below the header
- accurate color scheme
    - primarily white with text in a darker font color, creating a clean and professional appearance
    - header image is mainly blue

INACCURATE:
- hallucinations: 
    - made up menu items "Blog," "Contact Us"
    - made up additional menu below "WHAT WE DO" - "Blog," "Investment Strategies," and "Research & Insights."
- missing rest of the menu
- no mention of rollercoaster for the image
- "WHAT WE DO" is on the left, NOT centered

    
OVERALL ANALYSIS: 
fewer hallucinations resulted with the more detailed prompt, but description still contains too many inaccuracies/hallucinated content

POSSIBLE CAUSES: 
prompt engineering, model is too simple -- to improve: would like to focus more on how to navigate the page instead of color

In [18]:
# PROMPT ENGINEERING: 
prompt3='Describe this image in as much detail as possible within 250 words, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt3) #7b-v1.6


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a webpage. At the top, there is a header with the logo and name "Morgan Stanley" on the left side. Next to the logo, there is a navigation bar with menu items such as "Home," "Markets," "Research," "Trading," "Wealth," "Insights," and others, but the full text of all the menu items is not visible due to the resolution of the image.

Below the header, the main content area shows a large banner with a background image of an amusement ride in a park-like setting during what appears to be dusk or dawn, with the sky showing various hues of blue and orange. In front of this background, there is a dark-colored overlay containing text. The text reads "WHAT WE DO" in white capital letters at the top of the banner. Below this title, there are three bullet points each describing a different area of the company's activities: "CUSTOMIZED INVESTMENT SOLUTIONS," "GLOBAL RESEARCH AND ANALYSIS," and "EXCLUSIVE ACCES

#### ANALYSIS

ACCURATE:
- recognizes header with the logo and name "Morgan Stanley" on the top left
- recognizes navigation bar with menu items next to the logo such
- accurate description of image and its position (below header, amusement park at dusk/dawn)
- recognizes "WHAT WE DO" text
- recognizes existence of text at the very bottom of the image, and it contains a link
- color scheme does indeed convey the website is for informational purposes

INACCURATE:
- hallucinations: 
    - made up menu items "Home," "Markets," "Research," "Trading," "Wealth," "Insights," and others
    - made up text below "WHAT WE DO" text: there does NOT EXIST three bullet points each describing a different area of the company's activities: "CUSTOMIZED INVESTMENT SOLUTIONS," "GLOBAL RESEARCH AND ANALYSIS," and "EXCLUSIVE ACCESS TO COMPANY EVENTS."
- falsely claims that "WHAT WE DO" is in white capital letters at the top of the image (wrong color + position)
- image: missing mention of American flags

    
OVERALL ANALYSIS: 
fewer hallucinations resulted with the more detailed prompt compared to prompt1 and contains more navigational details, but description still contains too many inaccuracies/hallucinated content + possibly more hallucinations than prompt2

POSSIBLE CAUSES: 
temperature/creativity is too high, model is too simple -- to improve: lower the temperature of the model

In [11]:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt3, 0.3) #7b-v1.6


Model: llava - Processing ./images/ms_welcome.png

 The image is a screenshot of a webpage with a predominantly white background. At the top, there's a navigation bar with menu items such as "Home," "About Us," "Contact Us," and "Careers." Below this, in the center of the page, there's a large image of an amusement park ride with several people seated on it. The ride appears to be in motion, suggesting a dynamic scene.

On the right side of the page, there is a section titled "Morgan Stanley" in bold black letters. Below this title, there are two columns of text. The left column contains the name "MORGAN STANLEY" repeated twice in large black font, followed by a subtitle that reads "WHAT WE DO" in smaller black font. The right column lists various items related to Morgan Stanley's services and offerings, including "Investment Banking," "Equity Capital Markets," "Fixed Income," "Asset Management," and "Wealth Management."

The bottom of the page features a footer with additional text t

#### ANALYSIS

ACCURATE:
- recognizes predominantly white background
- recognizes presence of navigation bar at the correct position (top) and text "About Us" and "Careers"
- recognizes presence of image at the correct position (below menu and center of page) and content (amusement park ride)
- recognizes "Morgan Stanley" text and color (black)
- recognizes "WHAT WE DO" text and color (black)
- recognizes presence of text at the bottom
- accurate description of style of image

INACCURATE:
- hallucinations: 
    - made up menu items "Home," "Contact Us,"
    - "MORGAN STANLEY" is NOT repeated twice
    - made up services+offerings "Investment Banking," "Equity Capital Markets," "Fixed Income," "Asset Management," and "Wealth Management."
    - footer text "Privacy Statement," "Terms of Use," "Accessibility Statement," "Copyright Information," and "Morgan Stanley & Co. LLC."
- "Morgan Stanley" text is NOT on the right-hand side

    
OVERALL ANALYSIS: 
didn't perform that much better in terms of hallucinations compared to higher temperature, also less detailed

POSSIBLE CAUSES: 
model is too simple
    
##### GENERAL OBSERVATIONS/PATTERNS
- "Morgan Stanley" header text + position is often recognized and mostly accurate
- recognizes presence of header image and mostly captures essence of the image
- color analysis is accurate
- the text inside the header image is NEVER mentioned / always skipped
- the most amount of hallucinated content come from describing the the menu items on the top right

In [25]:
# LOWER THE TEMPERATURE EVEN MORE (0.0)
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt3, 0.0) #7b-v1.6


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a webpage with a predominantly white background. At the top, there is a navigation bar with text that reads "MORGAN STANLEY" in uppercase letters. Below this, on the left side of the page, there is a vertical menu with items such as "Home," "About Us," "Careers," and "Contact Us."

The main content area features a large image of a roller coaster with several people riding it. The roller coaster is set against a clear sky background. Overlaying the image are two pieces of text in different fonts and sizes. The larger text at the top reads "WHAT WE DO" in bold, uppercase letters. Below this, there is smaller text that says "Morgan Stanley is a leading global financial services firm."

On the right side of the page, there is additional text providing information about Morgan Stanley's commitment to diversity and inclusion, with bullet points detailing their efforts in this area. The text mentions "100%

### REDUCING HALLUCINATIONS

In [14]:
prompt5='Describe this image in as much detail as possible. Describe the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt5)


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a website with the URL "morganstanley.com" visible in the address bar at the top of the browser window. The main focus of the webpage appears to be an article or blog post titled "What we do." The text on the page is not entirely legible due to the angle and quality of the image, but there are several elements that can be discerned:

1. In the top right corner, there's a navigation menu with items like "Home," "About," "Contact," "Blog," "Portfolio," "Services," "Privacy Policy," and "Terms of Service."
2. Below the navigation menu is a header section with a subheading that reads "Morgan Stanley" in bold font, followed by a smaller text that says "Morgan Stanley is a global financial services firm."
3. The main content area of the webpage features an image of a roller coaster with multiple cars and passengers visible, set against a clear blue sky.
4. Below the image of the roller coaster are several

In [18]:
# prompt6='Describe this image in as much detail as possible within 3 paragraphs. Describe the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc. Only describe what you can see.'
prompt3='Describe this image in as much detail as possible within 250 words, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image4(image_file, "llava", prompt3)


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a webpage. At the top of the page, there is a navigation bar with menu items such as "Home," "About," "Contact," and "Log In." Below this, the main content area features a large photograph of a roller coaster with several people riding it. The roller coaster is set against a clear sky background.

The central part of the page contains text that reads "MORGAN STANLEY" in bold, uppercase letters, indicating the name associated with the content or organization represented by the webpage. Below this, there's a subheading that says "WHAT WE DO," followed by a list of bullet points detailing the company's services and focus areas such as "Election Management," "Campaign Strategy," "Digital Advertising," and "Data Analytics."

To the right of this text, there is a section titled "OUR TEAM" with placeholder images of people, suggesting that this area would typically show actual team members. Each placeholde

In [22]:
prompt6='Describe this image in as much detail as possible within 3 paragraphs. Describe the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image5(image_file, "llava", prompt6)


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a webpage. At the top, there is a navigation bar with menu items such as "Home," "About Us," "Contact Us," and "Blog." Below this, the main content area features a large image of a roller coaster with several people riding it. The roller coaster is set against a clear sky background.

On the right side of the page, there is a vertical column with text that reads "Morgan Stanley" at the top and "What We Do" below it. Below this heading, there are bullet points listing various activities or services provided by the company. The text is black on a white background.

The bottom part of the image shows a footer with additional information about the company's activities and values. This section includes a list of items such as "Sustainability," "Diversity & Inclusion," "Community Development," and "Corporate Responsibility." The text is smaller than the main content area and is also black on a white backg

In [23]:
prompt='Describe all the text in this screenshot of a website in as much detail as possible within 3 paragraphs. Describe what the texts read, and their position and coordinates within the image. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image5(image_file, "llava", prompt)


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a website with a predominantly white background. At the top, there is a navigation bar with text that reads "MORGAN STANLEY" in uppercase letters. Below this, on the left side of the screen, there is a large image of an American flag waving from a roller coaster car, with several people seated in the cars behind it.

In the center of the screenshot, there is a prominent text that reads "WHAT WE DO" in uppercase letters, suggesting a section or category on the website. To the right of this central text, there are three smaller images showing different scenes: one with a roller coaster car, another with a person standing next to a building, and a third with a group of people sitting at a table.

Below these images, there is a section titled "OUR SERVICES" followed by a list of services provided by the company, which includes "Financial Planning," "Investment Management," "Retirement Planning," and "Es

In [24]:
prompt5='Describe all the images in this screenshot of a website in as much detail as possible within 3 paragraphs. Describe the content of each image, the colors, and their position and coordinates within the image with respect to all the other components of the website. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image5(image_file, "llava", prompt5)


Model: llava - Processing ./images/ms_welcome.png

 The image is a screenshot of a website, specifically the homepage of Morgan Stanley. The website features a large banner at the top with the company's logo on the left and the text "Morgan Stanley" in white letters on the right. Below this banner, there are two smaller images placed side by side.

The image on the left shows a group of people riding a roller coaster. The roller coaster is predominantly yellow with red accents, and it's set against a clear blue sky. The people on the ride appear to be enjoying themselves, with some waving at the camera.

The image on the right depicts an American flag flying high in the air. The flag is red, white, and blue, and it's positioned centrally within the image.

Below these images, there are two blocks of text. The first block of text reads "What we do" and is accompanied by a small icon of a globe. The second block of text reads "Our values" and includes an icon of a person.

The website h

In [12]:
prompt='Describe all the text in this screenshot of a website in as much detail as possible within 3 paragraphs. Describe what the texts read, and their position and coordinates within the image. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt, 0.0)


Model: llava - Processing ./images/ms_welcome.png

 The image displays a screenshot of a website with a predominantly white background. At the top, there is a navigation bar with text that reads "MORGAN STANLEY" in uppercase letters. Below this, on the left side of the screen, there is a large image of an American flag waving from a roller coaster car, with several people seated in the cars behind it.

In the center of the screenshot, there is a prominent text that reads "WHAT WE DO" in uppercase letters, suggesting a section or category on the website. To the right of this central text, there are three smaller images showing different scenes: one with a roller coaster car, another with a person standing next to a building, and a third with a group of people sitting at a table.

Below these images, there is a section titled "OUR SERVICES" followed by a list of services provided by the company, which includes "Financial Planning," "Investment Management," "Retirement Planning," and "Es

In [13]:
prompt='Describe all the images in this screenshot of a website in as much detail as possible within 3 paragraphs. Describe the content of each image, the colors, and their position and coordinates within the image with respect to all the other components of the website. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava", prompt, 0.0)


Model: llava - Processing ./images/ms_welcome.png

 The image is a screenshot of a website, specifically the homepage of Morgan Stanley. The website features a large banner at the top with the company's logo on the left and the text "Morgan Stanley" in white letters on the right. Below this banner, there are two smaller images placed side by side.

The image on the left shows a group of people riding a roller coaster. The roller coaster is predominantly yellow with red accents, and it's set against a clear blue sky. The people on the ride appear to be enjoying themselves, with some waving at the camera.

The image on the right depicts an American flag flying high in the air. The flag is red, white, and blue, and it's positioned centrally within the image.

Below these images, there are two blocks of text. The first block of text reads "What we do" and is accompanied by a small icon of a globe. The second block of text reads "Our values" and includes an icon of a person.

The website h

## Reducing Hallucinations #2
More detailed prompts, also some attempts at few-shot

In [9]:
example_img = "./images/ms_tech.png"
img = "./images/ms_welcome.png"
image_files = [example_img, img]

In [35]:
# DETAILED PROMPT, NO EXAMPLES YET
detailed_prompt="""Here is a webpage on a website. Your goal is to describe this webpage to someone who is blind. In as much detail as possible, please describe ALL the components that exist on the webpage, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. 

Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page."""

process_image(img, "llava", detailed_prompt, 0.0)


Model: llava - Processing ./images/ms_welcome.png

 The image appears to be a screenshot of a webpage. At the top, there is a navigation bar with menu items such as "Home," "About Us," "Contact Us," and "Careers." Below this, there is a large banner image featuring an American flag and a roller coaster car with passengers in it. The text on the banner reads "MORGAN STANLEY" at the top left corner of the banner.

Below the banner, there is a section titled "What We Do," which includes bullet points describing the company's services and values. The text is too small to read in detail.

To the right of this section, there is a sidebar with additional information about the company, including its history, leadership team, and awards or accolades it has received. This text is also too small to read clearly.

At the bottom of the webpage, there is a footer with links to various pages on the website, such as "Privacy Policy," "Terms of Use," and "Accessibility." The text in the footer is not 

In [36]:
# DETAILED PROMPT, NO EXAMPLES YET -- TECH IMAGE
process_image(example_img, "llava", detailed_prompt, 0.0)


Model: llava - Processing ./images/ms_tech.png

 The image appears to be a screenshot of a webpage. At the top, there is a navigation bar with menu items that are not fully visible due to the resolution and angle of the photo. Below this, on the left side of the page, there is a logo with the text "MORGAN STANLEY" in capital letters, followed by a tagline or slogan that reads "EXPLORING TECHNOLOGY & DISRUPTION."

The main content area features a large image of a blue waveform graphic with a white play button icon superimposed on it. Below this image, there is a section titled "IDEAS" with a subtitle that says "Discover new ways to connect with your audience." This section includes two smaller images, one showing a person's hands interacting with a digital interface and the other depicting a smartphone with a graphical user interface.

The bottom of the page contains additional text that is not fully legible due to the image quality. The overall color scheme of the webpage is a mix of 

In [10]:
# FEW SHOT PROMPTING, MULTI-IMAGE
# def few_shot(image_files, model, system, template, prompt, temperature):
def few_shot(image_files, model, prompt, temperature):
    print(f"\nModel: {model} - Processing {image_files}\n")
    
    images_list = []

    # Process the images
    for image_file in image_files:
        with Image.open(image_file) as img:
            with BytesIO() as buffer:
                img.save(buffer, format='PNG')
                image_bytes = buffer.getvalue()
                images_list.append(image_bytes)
#     print("FINAL images_list: ", images_list[0])
    
    full_response = ''
    # Generate a description of the image
    for response in generate(model=model, 
#                              system=system,
#                              template=template,
                             prompt=prompt,
                             images=images_list, 
                             options={"temperature": temperature},
                             stream=True):
        # Print the response to the console and add it to the full response
        print(response['response'], end='', flush=True)
        full_response += response['response']

    # Add a new row to the DataFrame
    df.loc[len(df)] = [image_file, full_response]

In [14]:
# FEW-SHOT PROMPTING WITH TEXT+IMAGE
detailed_prompt = """A chat between a blind human and an artificial intelligence assistant, where the human presents an image to the assistant. The assistant’s job is to: describe ALL the components that exist on the webpage in as much detail as possible, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page.

[INST]
<image>\\n
<!—Start Description of Banner—> 
There is a banner at the top of the entire webpage. On the top left side of the banner, there is large text that reads 'Morgan Stanley' as the logo. On the right hand side of the banner, there are 2 tiers of menus. At the very top right, there is the smaller menu with these items: 'INVESTOR RELATIONS', 'MEDIA RELATIONS', 'CLIENT LOGIN' in all capital letters. The 'CLIENT LOGIN' menu item has a dropdown. Right below this smaller menu is a larger menu with these items: 'Our Insights', 'What We Do', 'About Us', 'Why Us', 'Careers', and a search icon button. All this text is in the color black. The background color of the banner is white. The background color of all the content below the banner is light gray.
<!—End Description of Banner—>

<!—Start Description of Header—>
Below this banner is a very large header with text that reads 'Explore Technology & Disruption Ideas'. The 'Technology & Disruption' part of this text is green and underlined with a dropdown. The other parts of the text are black. The entire header is centered on the page.
<!—Start Description of Header—>

<!—Start Description of Articles List—>
Below this large header is a list of articles. There are 2 articles listed on this page, separated by a horizontal black line. Each article in the list follows the same structure.

The first article contains:
1. A small blue header with text in all caps that reads 'TECHNOLOGY & DISRUPTION’
2. Below that, a slightly larger black header in all caps that reads 'RETAIL’S COMEBACK PLAN'
3. Below that, a date in italicized black font that reads 'Jul 15, 2024'
4. Below that, a description in black text that reads 'Our Retail analyst and U.S. Internet analyst connect the dots on how technology is helping the retail industry to cash in on the future.'
5. On the right-hand side of all this text is a square image and play button in the center of the image, to indicate a video.

The second article contains:
1. A small blue header with text in all caps that reads'TECHNOLOGY & DISRUPTION'
2. Below that, a slightly larger black header in all caps that reads 'CONVENIENCE IS COMPELLING'
3. Below that, a date in italicized black font that reads 'Jun 12, 2024'
4. Below that, a description in black text that reads 'Our U.S. Thematic Strategist explains the premium that consumers will pay for convenience, and what that means for sectors including online retail, dining and package delivery.'
5. On the right-hand side of all this text is another square image and play button in the center of the image, to indicate a video.
<!—End Description of Articles List—>

[/INST] 
[INST] 
<image>\\n
"""

few_shot(image_files, "llava", detailed_prompt, 0.0)


Model: llava - Processing ['./images/ms_tech.png', './images/ms_welcome.png']

 The image you've provided appears to be a screenshot of

KeyboardInterrupt: 

In [None]:
# FEW-SHOT PROMPTING WITH TEXT+IMAGE
detailed_prompt = """[INST]A chat between a blind human and an artificial intelligence assistant, where the human presents an image to the assistant. The assistant’s job is to: describe ALL the components that exist on the webpage in as much detail as possible, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page.


USER: <image>\\n
ASSISTANT: <!—Start Description of Banner—> 
There is a banner at the top of the entire webpage. On the top left side of the banner, there is large text that reads 'Morgan Stanley' as the logo. On the right hand side of the banner, there are 2 tiers of menus. At the very top right, there is the smaller menu with these items: 'INVESTOR RELATIONS', 'MEDIA RELATIONS', 'CLIENT LOGIN' in all capital letters. The 'CLIENT LOGIN' menu item has a dropdown. Right below this smaller menu is a larger menu with these items: 'Our Insights', 'What We Do', 'About Us', 'Why Us', 'Careers', and a search icon button. All this text is in the color black. The background color of the banner is white. The background color of all the content below the banner is light gray.
<!—End Description of Banner—>

<!—Start Description of Header—>
Below this banner is a very large header with text that reads 'Explore Technology & Disruption Ideas'. The 'Technology & Disruption' part of this text is green and underlined with a dropdown. The other parts of the text are black. The entire header is centered on the page.
<!—Start Description of Header—>

<!—Start Description of Articles List—>
Below this large header is a list of articles. There are 2 articles listed on this page, separated by a horizontal black line. Each article in the list follows the same structure.

The first article contains:
1. A small blue header with text in all caps that reads 'TECHNOLOGY & DISRUPTION’
2. Below that, a slightly larger black header in all caps that reads 'RETAIL’S COMEBACK PLAN'
3. Below that, a date in italicized black font that reads 'Jul 15, 2024'
4. Below that, a description in black text that reads 'Our Retail analyst and U.S. Internet analyst connect the dots on how technology is helping the retail industry to cash in on the future.'
5. On the right-hand side of all this text is a square image and play button in the center of the image, to indicate a video.

The second article contains:
1. A small blue header with text in all caps that reads'TECHNOLOGY & DISRUPTION'
2. Below that, a slightly larger black header in all caps that reads 'CONVENIENCE IS COMPELLING'
3. Below that, a date in italicized black font that reads 'Jun 12, 2024'
4. Below that, a description in black text that reads 'Our U.S. Thematic Strategist explains the premium that consumers will pay for convenience, and what that means for sectors including online retail, dining and package delivery.'
5. On the right-hand side of all this text is another square image and play button in the center of the image, to indicate a video.
<!—End Description of Articles List—>

USER: <image>\\n
ASSISTANT: 
[/INST]
"""

few_shot(image_files, "llava", detailed_prompt, 0.0)


Model: llava - Processing ['./images/ms_tech.png', './images/ms_welcome.png']

 The image you've

In [None]:
system = """You are given a webpage on a website. Your goal is to describe this webpage to someone who is blind. In as much detail as possible, please describe ALL the components that exist on the webpage, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. 

Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page.
"""

template="""Here is a webpage on a website. Your goal is to describe this webpage to someone who is blind. In as much detail as possible, please describe ALL the components that exist on the webpage, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. 

Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page.

Here is an example:
Given this [IMAGE], here is its Description:
[INSERT MANUAL/HUMAN DESCRIPTION]
"""
few_shot(image_files, "llava", detailed_prompt, 0.0)

In [None]:
# FEW-SHOT PROMPTING WITH TEXT ONLY (NO IMAGE)


## LLaVA-Mistral

In [9]:
prompt3 = """[INST]<image>\n Here is a webpage on a website. Your goal is to describe this webpage to someone who is blind. In as much detail as possible, please describe ALL the components that exist on the webpage, including the text, images, menu items, links, etc. Describe their positions on the webpage, color, etc. For all the text, mention exactly what the text says. Start from the top left of the webpage and work your way down until you reach the end of the webpage, step by step. 
Rules: Only describe what you can see, otherwise say you are not sure and move on to the next part of the page.
[/INST]"""

In [10]:
process_image("./images/ms_welcome.png", "mapler/llava-mistral", prompt3, 0.0)


Model: mapler/llava-mistral - Processing ./images/ms_welcome.png


I'm sorry, but I am unable to view images or webpages as a visual assistant. However, I can provide general guidance on how to describe a webpage for someone who is blind. Here are some steps you can follow:

1. Start from the top left of the webpage and work your way down until you reach the end of the page.
2. Describe each component in as much detail as possible, including text, images, menu items, links

KeyboardInterrupt: 

In [13]:
# PROCESS IMAGE DIFFERENTLY ?
process_image2("./images/ms_welcome.png", "mapler/llava-mistral", prompt3, 0.0)


Model: mapler/llava-mistral - Processing ./images/ms_welcome.png


I'm sorry, but I am unable to view images or webpages as a visual assistant

KeyboardInterrupt: 

## LLaVA-Phi3 through Ollama (small LLaVA model fine-tuned from Phi 3 Mini)
https://ollama.com/library/llava-phi3
https://medium.com/@rajeshmanikumarg/llava-phi-3-mini-lm-studio-large-language-and-vision-assistant-local-setup-a24182616bfd

In [13]:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava-phi3", prompt1)


Processing ./images/ms_welcome.png

The image is a screenshot of the Morgan Stanley website. The background color of the webpage is white, providing a stark contrast to the blue header bar at the top. The main content area of the page prominently displays an advertisement for an amusement park ride called "What We Do". 

The advertisement features several people enjoying the roller coaster ride named "Flight Deck 1", which is set against a backdrop that appears to be a clear blue sky dotted with fluffy white clouds. The text within the image reads, "Get ready for some serious adrenaline jolts as you soar above the park on our high-flying roller coaster!". 

The website also includes additional details about the ride at the bottom of the page, presumably providing more information to potential visitors. The overall layout and design suggest that this is a promotional page for an upcoming event or attraction at the amusement park.

In [18]:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava-phi3", prompt2)


Model: llava-phi3 - Processing ./images/ms_welcome.png


The image captures a screenshot of a website named "What We Do" operated by Morgan Stanley. The website has an open, white layout with black text and images against it. Dominating the top left corner is the company's logo featuring three red stripes on a blue background set against a white backdrop. Just below the logo, there are two American flags waving in the wind, adding a patriotic touch to the design.

The main attraction of the website is an image of people enjoying a thrilling ride on a roller coaster track with water splashing around them, suggesting outdoor fun and adventure. The text beneath this image reads "Fun-loving American families come together to experience the rush of adrenaline that only extreme sports can offer". This suggests that the website is dedicated to promoting extreme sports among American families.

The website also features a search bar located in the top right corner, allowing users to find spec

#### ANALYSIS

ACCURATE:
- recognizes "What We Do" text
- recognizes Morgan Stanley
- white layout with black text and images against it
- top left corner is the company's logo 
- recognizes American flags below logo, rollercoasters, blue background, white backdrop (correct color, position, and some content)
- recognize search bar at top right corner

INACCURATE:
- hallucinations: 
    - made up menu items "Home," "Contact Us,"
    - made up text beneath image of the rollercoaster - "Fun-loving American families come together to experience the rush of adrenaline that only extreme sports can offer"
    - made up social media buttons are positioned on the bottom left corner
- there are 3 not 2 American flags
- inaccurate overall design goal (promote extreme sports)
    
OVERALL ANALYSIS: 
MUCH more accurate description with fewer hallucinations compared to LLaVA+prompt1, but fails to capture overall image purpose, text, and navigation accurately for the most part

POSSIBLE CAUSES: 
prompt is not specific enough

In [19]:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava-phi3", prompt3)


Model: llava-phi3 - Processing ./images/ms_welcome.png

The image captures a moment on the iconic Cedar Point Amusement Park's Mill Race roller coaster. The roller coaster, known for its thrilling drops and twists, is in full swing, with people strapped securely into their seats. From this vantage point, we can see the top of the track, providing a unique perspective on the ride.

Three American flags are prominently displayed against the backdrop of an overcast sky, fluttering in the wind as if cheering on the riders below. The colors of the flags - red, white, and blue - stand out vividly against the grayscale tones of the scene.

Below this captivating visual is a video description that reads "What We Do". This text seems to hint at an exploration or explanation of the experiences and actions associated with amusement park rides like this roller coaster.

The image also contains several hyperlinks, including one for 'About Us', suggesting that it's part of an online platform dedica

#### ANALYSIS

ACCURATE:
- recognizes "What We Do" text
- recognizes Morgan Stanley
- white layout with black text and images against it
- top left corner is the company's logo 
- recognizes American flags below logo, rollercoasters, blue background, white backdrop (correct color, position, and some content)
- recognize search bar at top right corner

INACCURATE:
- hallucinations: 
    - made up menu items "Home," "Contact Us,"
    - made up text beneath image of the rollercoaster - "Fun-loving American families come together to experience the rush of adrenaline that only extreme sports can offer"
    - made up social media buttons are positioned on the bottom left corner
- there are 3 not 2 American flags
- inaccurate overall design goal (promote extreme sports)
    
OVERALL ANALYSIS: 
MUCH more accurate description with fewer hallucinations compared to LLaVA+prompt1, but fails to capture overall image purpose, text, and navigation accurately for the most part

POSSIBLE CAUSES: 
prompt is not specific enough

In [12]:
# LOWER THE TEMPERATURE:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "llava-phi3", prompt3, 0.3)


Model: llava-phi3 - Processing ./images/ms_welcome.png


The image is a screenshot of the Morgan Stanley website. The main focus of the image is an advertisement for Disneyland, which features a group of people riding on a roller coaster with American flags flying in the background. The text at the top of the page reads "Morgan Stanley" and there are several links visible below it, including one that says "What We Do". On the right side of the image, there is a search bar for users to find information on the website. The overall color scheme of the image is predominantly white and blue, with some yellow and red accents from the Disneyland advertisement.

#### ANALYSIS

ACCURATE:
- recognizes presence of rollercoaster + American red flags
- recognizes "Morgan Stanley" text + position (top)
- recognizes presence of links below "Morgan Stanley" text
- recognizes "What We Do" text + position (bottom of "Morgan Stanley" text)
- recognizes existence of search bar + its position (right side of image)

INACCURATE:
- claims this is an ad for Disneyland
- no mention of menu item text, text inside image, or text below image
    
OVERALL ANALYSIS: 
THE MOST ACCURATE DESCRIPTION so far with the FEWEST hallucinations compared to previous prompts and LLaVA!!! BUT biggest downside is that it does not contain many specific details


POSSIBLE CAUSES: 
model is too simple, tune parameters


##### GENERAL OBSERVATIONS/PATTERNS SO FAR
- not as detailed/specific as LLaVA
    - sometimes mentions header and menu, but not very detailed at all
    - fewer mentions of specific text (especially the menu items)
    - also no mention of text inside the image
    - fewer hallucinations with menu items but that might also be because it's less detailed overall
- does recognize rollercoaster + American flag in image, as well as color
- OVERALL: DOES NOT PERFORM AS WELL AS LLAVA IN TERMS OF DETAIL, BUT MIGHT RESULT IN FEWER HALLUCINATIONS WITH THE RIGHT PARAMETERS/TUNING

In [26]:
# LOWER THE TEMPERATURE EVEN MORE (0.0)
for image_file in image_files:
#     if image_file not in df['image_file'].values:
#         process_image(image_file, "llava:13b-v1.6")
        process_image(image_file, "llava-phi3", prompt3, 0.0) #7b-v1.6


Model: llava-phi3 - Processing ./images/ms_welcome.png

The image captures a vibrant scene from the amusement park Six Flags Over Texas. Dominating the frame is a roller coaster track, painted in a striking shade of yellow. The track curves gracefully to the left, leading the eye towards two flags fluttering atop it - one red and white, the other blue and white. 

The website itself is bathed in a clean white background, with black text providing a stark contrast for easy readability. At the top of the page, you can see the logo of Six Flags Over Texas, proudly announcing its presence. Just below it, there's a navigation bar equipped with links to various sections of the site.

The main attraction of the website is undoubtedly the "What We Do" section. It provides an intriguing glimpse into the inner workings of the amusement park, detailing its history and showcasing some of its thrilling rides. The text on this page is clear and concise, inviting visitors to explore more about what 

## BakLLaVA through Ollama
https://ollama.com/library/bakllava

In [7]:
prompt3='Describe this image in as much detail as possible within 3 paragraphs, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "bakllava", prompt3)


Model: bakllava - Processing ./images/ms_welcome.png


The website shown is morganstanley.com, which has a lot of information on it. On the front page, there are multiple menu items displayed at the top, such as "about," "contact," and "careers." There's also a section that says, "What We Do," with a brief description of the company's purpose. 
The website features several flags, suggesting a possible focus on American or global politics. A number of small images are scattered throughout the site, likely representing different aspects of the business or its principles.

In [10]:
prompt3='Describe this image in as much detail as possible within 250 words, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "bakllava", prompt3, 0.0)


Model: bakllava - Processing ./images/ms_welcome.png


Morgan Stanley is the company shown on this website.

In [11]:
prompt4='Describe this image in as much detail as possible within 3 paragraphs, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "bakllava", prompt4, 0.0)


Model: bakllava - Processing ./images/ms_welcome.png


Morgan Stanley is the company shown on this website.

In [12]:
prompt4='Describe this image in as much detail as possible within 3 paragraphs, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "bakllava", prompt4)


Model: bakllava - Processing ./images/ms_welcome.png


a web page with the headline What We Do, and pictures of a roller coaster with American flags on it.

In [13]:
prompt5='Describe this image in as much detail as possible. Describe the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc. Only describe what you can see.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "bakllava", prompt5)


Model: bakllava - Processing ./images/ms_welcome.png


The image is a website page that says "What We Do" on it. The page has two rows of text with one row placed above the other. There are three flags in the upper left corner and they appear to be waving. The text on the page seems to be an overview or description of what the site offers.

## Phi-3 through Ollama
https://ollama.com/library/phi3

Microsoft's Phi-3 - a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks

In [23]:
prompt1="Describe this image and make sure to include anything notable about it (include text you see in the image):" 

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "phi3", prompt1)


Model: phi3 - Processing ./images/ms_welcome.png

 As I am an AI developed by Microsoft, I don't have access to external content such as images or personal data unless they are shared directly with me within our conversation. Therefore, I cannot physically view or describe any specific picture labeled [img-0]. However, if you provide a textual description of the image—what it depicts and important details like colors, objects present in the scene, people's expressions on their faces (if applicable), background context—I can certainly help craft an imaginative response to describe what such an image might be portraying.

If you have a text-based description of "img-0" that you would like me to elaborate upon or create from scratch, feel free to share the details here!

In [12]:
prompt2='Describe this image in as much detail as possible within 3 paragraphs, focusing on the structure, position/coordinates, and color of ALL the different elements like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "phi3", prompt2)


Model: phi3 - Processing ./images/ms_welcome.png

 The provided illustration depicts a webpage design for an online bookstore named "Aurum Bibliotheca." At the top-left corner lies the navigation bar with rich brownish hues that stand out due to its contrast against the more subdued background colors of lavender and white. It is prominently positioned at coordinates (20, 50) in pixels from the left edge of the browser window on a standard monitor resolution setting such as 1920x1080 pixels.

Diving deeper into detail within this navigation bar are three main elements: 'Home,' which is highlighted with an arrow icon and has its text color set to electric blue, making it immediately noticeable; 'Shop' presented in a bold font style with gray-toned letters that give the impression of printed paper stock for enhanced realism on digital platforms. The third option, ‘About Us,’ shows as greyish black characters against an off-white background and is less emphasized by design but still easil

In [8]:
prompt3='Describe this image in as much detail as possible within 250 words, focusing on the text, position/coordinates/navigation, color, etc. of ALL the different components like text, images, menu items, links, etc.'

for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "phi3", prompt3)


Model: phi3 - Processing ./images/ms_welcome.png

 This digital interface showcases a user-friendly web page with clear navigation and engaging design elements that facilitate ease of access for its users. The overall background is light gray (#F0F8FF), providing an uncluttered canvas which makes the content stand out effectively.

At the top right corner, there's a prominent 'Search Bar', prominently displayed in white text on this neutral backdrop with rounded corners for visual appeal and easy accessibility from any position within its bounds (x: 70px to x+250 px; y: 15px to y+30px). Below the search bar, you find a 'Logo' of the website. It is represented as an image with coordinates at about 680px from left and approximately centered vertically on the screen due to its height being set near half of that (around x:450px; y:230px). The logo adopts a vibrant color scheme in contrast, primarily using shades of blue which can evoke feelings of trustworthiness.

Moving downwards is an 

In [11]:
# LOWER THE TEMPERATURE:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "phi3", prompt3, 0.3)


Model: phi3 - Processing ./images/ms_welcome.png

 The provided document appears to be a web page with several interactive elements that are crucial for user navigation and understanding content. At the top right corner is an image labeled "img-0," which seems to serve as either decoration or possibly related imagery pertinent to the website's theme, although its specific function isn't clear from this excerpt alone.

The header of the page contains a prominent navigation menu with links that are likely primary pathways for site exploration: Home (#), About Us (#about_us), Services (#services), Contact (#contact). These hyperlinks, which would typically be underlined and colored blue or another contrasting color to indicate interactivity when hovered over by a mouse cursor in an actual webpage scenario, are essential for guiding users through the site's content.

Below this header is text that likely serves as either introductory information about the website or detailed descriptions 

#### ANALYSIS

ACCURATE:
- menu items: About Us (#about_us)

INACCURATE:
- hallucinations: 
    - img-0
    - menu items: Home (#), Services (#services), Contact (#contact) -- claims they are in blue text (instead of black..)
    - NO mention of the large image and its contents
    - claims "Our Services," which includes a bulleted list detailing specific offerings such as consultation, customization options for various projects (e.g., wedding planning), pricing information denoted by bold text or icons suggesting affordability ("$"), payment methods including credit cards and online bank transfers indicated with buttons that would visually stand out when hovered over, possibly changing color to signify 
    - (© 2019) along with social media links
    - no green?
    - footer's social media icons??
    
OVERALL ANALYSIS: 
ENTIRELY HALLUCINATED CONTENT (seems to be describing a shopping website)

POSSIBLE CAUSES: 
Image processing 

##### GENERAL OBSERVATIONS/PATTERNS
- Much FASTER response generation! (truly lightweight)
- VERY detailed (the most detailed response out of BLIP-2, LLaVA, LLaVA-Phi3) -- able to mention exact coordinates and colors
- OVERALL: even though LLaVA-Phi33 is very detailed, it is also THE MOST INACCURATE out of all the other models tested -- BECAUSE IT'S NOT MULTI-MODAL

In [22]:
# LOWER THE TEMPERATURE:
for image_file in image_files:
#     if image_file not in df['image_file'].values:
        process_image(image_file, "phi3", prompt3, 0.0)


Model: phi3 - Processing ./images/ms_welcome.png

 The provided description cannot be completed due to a lack of an actual image or its details such as coordinates and navigation elements within this context. To accurately describe any visual content in detail, specific information about each element present is required—such as the positioning on the page (e.g., top right corner), color schemes used for text versus backgrounds, font styles employed throughout various components like headings or body copy, and navigation elements such as menus with their respective items listed along with any associated images that might accompany them.

In a typical scenario where an image is provided containing these details, the description would include: 
- The exact coordinates of each element on the page (e.g., "The 'Home' menu item located at top left corner [x1, y1] to [x2, y2])."
- Descriptions of colors used for text and backgrounds ("Text is in dark grey with a light blue accent color").
- F

## (Attempt with Running HUGGINGFACE locally - WIP)

In [4]:
!pip install -q -U transformers #==4.37.2
!pip install -q bitsandbytes accelerate #==0.41.3 accelerate==0.25.0

[0m

In [3]:
!pip install --upgrade pip

Collecting pip
  Downloading pip-24.1.2-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.1
    Uninstalling pip-23.3.1:
      Successfully uninstalled pip-23.3.1
Successfully installed pip-24.1.2


In [5]:
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

2024-07-23 17:06:08.263610: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
prompt = "USER: <image>\nDescribe this image in as much detail as possible within 3 paragraphs, focusing on the text, structure, position/coordinates, and color of ALL the different elements like text, images, menu items, links, etc.\nASSISTANT:"
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
image = Image.open(r"/images/ms_welcome.png")
# display(image.resize((596, 437)))

inputs = processor(text=prompt, images=image, return_tensors="pt")

In [None]:
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=200)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]