In [1]:
import os
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Milvus

In [2]:
## here we are using OpenAI embeddings but in future we will swap out to local embeddings
import openai
openai_api_key = os.getenv("OPENAI_API_KEY")

In [3]:
# define model for embeddings
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002', openai_api_key=openai_api_key)

In [4]:
# Get current vector database storing the indexed test files
vector_db = Milvus(embedding_function=embeddings,
                   collection_name='testfiles_repo',
                   connection_args={"host": "localhost", "port": "19530"},)

In [5]:
# Get the retriever from the Milvus database, returning up to 5 documents
retriever = vector_db.as_retriever(search_kwargs={"k": 5})

# Fetch more documents for the MMR algorithm to consider
# But only return the top 5
# retriever = vector_db.as_retriever(search_type="mmr", search_kwargs={"k": 5, "fetch_k": 50}

In [6]:
# Define a prompt template
from langchain.prompts import PromptTemplate
prompt_template = \
"""
Use the following pieces of context to answer the question at the end. 
If you don't know the answer based on the context only, say you do not know the answer. 

{context}

Question: {question}
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

In [7]:
print(retriever)

tags=['Milvus', 'OpenAIEmbeddings'] metadata=None vectorstore=<langchain.vectorstores.milvus.Milvus object at 0x7f40881b0df0> search_type='similarity' search_kwargs={'k': 5}


In [8]:
# Define the retrieval chain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0.3, model_name="gpt-3.5-turbo"), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
    chain_type_kwargs=chain_type_kwargs
)

In [9]:
# Define function to collect and print the response and the sources. As they may appear repeatedly, we use a set.
def process_llm_response(llm_response):
    print(llm_response['result'])
    # print(llm_response['source_documents'])
    srcset = set()
    for source in llm_response["source_documents"]:
        srcset.add(source.metadata['source'])
    if srcset:
        print(f'\n\nSources:')
        for s in srcset:
            print(s)

### Now the queries

#### Query 1: Content from specific lines of the .csv file.

In [10]:
query1 = "what were the scores of the games between Arsenal and Aston Villa?"
llm_response = qa_chain(query1)
process_llm_response(llm_response)

Based on the given context, the scores of the games between Arsenal and Aston Villa were:

1) Date: 8/31/2022
   Home Team: Arsenal
   Score: 2 x 1
   Away Team: Aston Villa

2) Date: 2/18/2023
   Home Team: Aston Villa
   Score: 2 x 4
   Away Team: Arsenal


Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv


#### Query 2: Content from specific lines of the .csv file.

In [11]:
query2 = "Which team has its home at 'The American Express Community Stadium'?"
llm_response = qa_chain(query2)
process_llm_response(llm_response)

Brighton


Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv


#### Query 3: Tricky question - to make sure only the indexed content is used. According to the prompt, it should say it does not know the answer.

In [12]:
query3 = "When was the Penicillin discovered?"
llm_response = qa_chain(query3)
process_llm_response(llm_response)

Based on the given context, there is no information provided about the discovery of Penicillin.


Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv
./testfiles/word/anon_reviews.docx


#### Query 4: Summarizing specific info about GANs from .pdf files.

In [13]:
query4 = "What can you tell me about Generative adversarial networks?"
llm_response = qa_chain(query4)
process_llm_response(llm_response)

Based on the given context, generative adversarial networks (GANs) are a successful image generation paradigm. They consist of two neural networks, a generator and a discriminator. The generator learns the distribution of real images by generating images that are indistinguishable from real images, while the discriminator learns to classify the images into real and fake. GANs have impressive capabilities for synthesizing realistic content and are commonly used for various synthesis purposes, such as text-to-image and image-to-image tasks. GANs have also been used for person image generation, which is a challenging task due to the high variability of human pose, shape, and appearance.


Sources:
./testfiles/pdf/garmentGAN.pdf
./testfiles/pdf/InsetGAN.pdf


#### Query 5: Asking about specific data present in .pdf files - note that the output includes data from the paper references.

In [16]:
query5 = "Can you give me several examples of AI systems using GANs?"
llm_response = qa_chain(query5)
process_llm_response(llm_response)

Based on the given context, several examples of AI systems using GANs are:

1. StyleGAN: It is a method used for creating near photorealistic images for multiple classes, such as human faces, cars, and landscapes.

2. BigGAN: This architecture is often used for class-conditional image generation on the ImageNet dataset.

3. TileGAN: It is a method used for the synthesis of large-scale non-homogeneous textures.

Please note that these examples are mentioned in the context provided. There may be other AI systems using GANs that are not mentioned here.


Sources:
./testfiles/pdf/garmentGAN.pdf
./testfiles/pdf/InsetGAN.pdf


#### Query 6: Asking for a summary of one of the .pdf files.

In [17]:
query6 = "Can you give me a brief summary of garmentGAN?"
llm_response = qa_chain(query6)
process_llm_response(llm_response)

GarmentGAN is a new algorithm that uses generative adversarial methods to perform image-based garment transfer. It allows users to virtually try on clothes before purchasing and can handle complex body poses, hand gestures, and occlusions. The algorithm requires two input images: a picture of the target fashion item and an image of the customer. The output is a synthetic image where the customer is wearing the target apparel. GarmentGAN improves on existing methods in terms of the realism of generated imagery and solves problems related to self-occlusions. It incorporates additional information during training, such as segmentation maps and body key-point information, to synthesize high-realism photographs. The algorithm comprises two separate GANs: a shape transfer network and an appearance transfer network. It also uses a geometric alignment module and a method of masking semantic segmentation maps to handle complex poses and preserve target clothing characteristics. GarmentGAN prese

#### Query 7: Querying the data present in files .txt and .pptx.

In [18]:
query7 = "How is a generation defined?"
llm_response = qa_chain(query7)
process_llm_response(llm_response)

A generation is defined as a group of individuals born and living contemporaneously who share common knowledge and experiences that affect their thoughts, attitudes, values, beliefs, and behaviors.


Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt


#### Query 8: Querying specific information from files .txt and .pptx. 

In [19]:
query8 = "When were Baby Boomers born?"
llm_response = qa_chain(query8)
process_llm_response(llm_response)

Baby Boomers were born between 1946 and 1968.


Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt


#### Query 9: Querying information with semantic understanding, present in files .pptx and .txt.

In [20]:
query9 = "who are the groups that preceeded and succeeded Generation X"
llm_response = qa_chain(query9)
process_llm_response(llm_response)

The groups that preceded Generation X are baby boomers, born between 1943 and 1964. The group that succeeded Generation X is millennials, born between 1981 and 2000.


Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt


#### Query 10: Asking for a summary of separate pieces of information contained in .docx file.

In [22]:
query10 = "Summarize the reviews of paper entitled 'The Best Search System' and tell me if the paper was accepted or not."
llm_response = qa_chain(query10)
process_llm_response(llm_response)

The reviews of the paper 'The Best Search System' are mixed. Review 1 and Review 3 both have positive comments about the paper, with Review 1 stating that they are a big fan of the paper and Review 3 stating that the paper is well written and provides a good overview of the system. However, Review 2 and Review 4 have more negative comments about the paper, with Review 2 mentioning several areas for improvement and Review 4 stating that there are serious missing parts and unsubstantiated claims. The metareview also mentions that the paper is appreciated but less relevant as a full-paper. Based on these reviews, it can be concluded that the paper was not accepted.


Sources:
./testfiles/word/anon_reviews.docx


### Queries that did not work well

#### Query 11: This query requires all rows in the .csv file to be returned by the query (since each row gets mapped into a vector db entity), but since we are only returning 5 documents, the answer is wrong.

In [23]:
query11 = "What match had the highest attendance among all Premier League matches?"
llm_response = qa_chain(query11)
process_llm_response(llm_response)

The match with the highest attendance among all Premier League matches is the one between Manchester City and Everton, which took place on 12/31/2022 at the Etihad Stadium with an attendance of 53444.


Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv


#### Query 12: Asking the for specific information from .xlsx files. 
##### The context and query end up containing too many tokens for the OpenAi model to process and we get an error about exceeding the number of tokens.

In [24]:
query12 = "How many votes did Biden get in Alabama in 2020?"
llm_response = qa_chain(query12)
process_llm_response(llm_response)

InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 16202 tokens. Please reduce the length of the messages.