<a href="https://colab.research.google.com/github/rvraghvender/ChromaDB_vectorDatabase/blob/main/ChromaDB_Vector_Database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [40]:
# !pip -q install chromadb openai langchain tiktoken pypdf

In [43]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

In [None]:
from google.colab import userdata

In [None]:
# !wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip

In [41]:
!mkdir Thesis

In [42]:
!wget -q https://www.theses.fr/2023LIMO0010/document -O Thesis/thesis.pdf

## Extract the text from PDF

In [44]:
loader = PyPDFLoader('Thesis/thesis.pdf')

In [45]:
data = loader.load()

## Loading data

In [47]:
data[0]

Document(page_content='Doctoral Thesis\nUniversité de Limoges\nEcole Doctorale Sciences et Ingénierie (SI) ED 653\nIRCER - Institute of research for ceramics UMR CNRS 7315\nThesis to obtain the degree of\nDocteur de l’Université de Limoges\nIRCER, Axe 3 - organisation structurale multiéchelle des matériaux\nPresented and supported by\nRaghvender\nLe 22 February 2023\nAB-INITIO STUDY OF THE STRUCTURE OF TELLURIUM-OXIDE BASED GLASSES:\nA STEP FORWARD IN ESTABLISHING THE STRUCTURE -PROPERTIES RELATIONSHIPS\nThesis supervised by Assil BOUZID and Olivier MASSON\nJury :\nPrésident de Jury :\nDr. Philippe THOMAS\nDirecteur de Recherche CNRS – IRCER, Université de Limoges, Limoges (France)\nRapporteurs :\nDr. Guillaume FERLAT\nMaître de conférences (HDR) – Sorbonne Université (France)\nDr. Pierre BORDET\nDirecteur de Recherche CNRS – Université Grenoble Alpes (France)\nExaminateurs :\nDr. Oliver ALDERMAN\nISIS Neutron and Muon Source – Rutherford Appleton Laboratory (England)\nDr. Assil BOUZID

In [48]:
len(data)

262

## Splitting the extracted data into text chunks

In [49]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text = text_splitter.split_documents(data)

In [51]:
text[0]

Document(page_content='Doctoral Thesis\nUniversité de Limoges\nEcole Doctorale Sciences et Ingénierie (SI) ED 653\nIRCER - Institute of research for ceramics UMR CNRS 7315\nThesis to obtain the degree of\nDocteur de l’Université de Limoges\nIRCER, Axe 3 - organisation structurale multiéchelle des matériaux\nPresented and supported by\nRaghvender\nLe 22 February 2023\nAB-INITIO STUDY OF THE STRUCTURE OF TELLURIUM-OXIDE BASED GLASSES:\nA STEP FORWARD IN ESTABLISHING THE STRUCTURE -PROPERTIES RELATIONSHIPS\nThesis supervised by Assil BOUZID and Olivier MASSON\nJury :\nPrésident de Jury :\nDr. Philippe THOMAS\nDirecteur de Recherche CNRS – IRCER, Université de Limoges, Limoges (France)\nRapporteurs :\nDr. Guillaume FERLAT\nMaître de conférences (HDR) – Sorbonne Université (France)\nDr. Pierre BORDET\nDirecteur de Recherche CNRS – Université Grenoble Alpes (France)\nExaminateurs :\nDr. Oliver ALDERMAN\nISIS Neutron and Muon Source – Rutherford Appleton Laboratory (England)\nDr. Assil BOUZID

In [52]:
text[1]

Document(page_content='Directeur de Recherche CNRS – Université Grenoble Alpes (France)\nExaminateurs :\nDr. Oliver ALDERMAN\nISIS Neutron and Muon Source – Rutherford Appleton Laboratory (England)\nDr. Assil BOUZID\nChargé de recherche CNRS – IRCER, Université de Limoges, Limoges (France)\nProf. Olivier MASSON\nProfesseur – Université de Limoges, Limoges (France)\nInvités :\nDr. Evgenii M. ROGINSKII\nAssociate Professeur – Ioffe Institute, St. Petersburg (Russia)', metadata={'source': 'Thesis/thesis.pdf', 'page': 0})

In [53]:
text[2]

Document(page_content='To my parents and grandparents,\nYour unwavering love and support have propelled me towards this significant milestone.\nWith profound gratitude and love, I dedicate this thesis to all of you.\nRaghvender | Doctoral thesis | IRCER\nUniversité de Limoges2', metadata={'source': 'Thesis/thesis.pdf', 'page': 1})

In [54]:
text[3]

Document(page_content='Remember to look up at the stars and not down at your feet. Try to make sense of what you see\nand wonder about what makes the universe exist. Be curious. And however difficult life may\nseem, there is always something you can do and succeed at. It matters that you don’t just give up.\nStephen Hawking\nRaghvender | Doctoral thesis | IRCER\nUniversité de Limoges3', metadata={'source': 'Thesis/thesis.pdf', 'page': 2})

In [55]:
len(text)

711

## Creating db object

In [59]:
from langchain import embeddings
persist_directory = 'thesis_db'

embeddings = OpenAIEmbeddings(openai_api_key=userdata.get('OPENAI_API_KEY'))

In [60]:
vectordb = Chroma.from_documents(documents=text,
                               embedding=embeddings,
                               persist_directory=persist_directory)

In [61]:
# persist the db to disk
vectordb.persist()
vectordb = None

In [62]:
# Load the persisted database from disk, and use it as noraml database
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embeddings
                  )

## Make a retriever

In [63]:
retriever = vectordb.as_retriever()

In [64]:
docs = retriever.get_relevant_documents('What adding Thallim in TeO2 glass causes?')

In [65]:
docs

[Document(page_content='Chapter 5 – Structure of TiO 2-Tl2O-TeO 2ternary glasses\nPreliminary studies have shown that the addition of thallium oxide preserves the amplitude of\nthe non-linear index, however we have seen that TlO 0.5modifier oxide has no positive effect on\nimproving the mechanical properties (see chapter [4]) and the thermal stability of the glass [9]. A\nseparate study also showed that adding TiO 2to TeO 2increases the material’s tolerance to heating\nand significantly increases its mechanical resistance [113], [127]–[129]. It was determined in\npractice that the O local environment around Ti and Te are comparable, and neither species can\nbe viewed as a modifier with respect to the others. Dietzel’s field strength criteria, which show\nthat Ti belongs to the group of intermediate and has a field strength that is quite similar to Te (see\ntable [1.2] in chapter [1]), might be used as further evidence to support this claim. Authors further', metadata={'page': 155, 'sou

In [66]:
retriever = vectordb.as_retriever(search_kwargs={'k':2})

In [67]:
retriever.search_type

'similarity'

**Till now we have just got relevant context from the databse but have to received a smart answer.**

# So use LLM to get smart answer

## Make a chain

In [68]:
# Create the chain to answer question

qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(api_key=userdata.get('OPENAI_API_KEY')),
                                       chain_type='stuff',
                                       retriever=retriever,
                                       return_source_documents=True)

In [69]:
query = 'What adding Thallim in TeO2 glass causes?'
llm_response = qa_chain(query)
llm_response

{'query': 'What adding Thallim in TeO2 glass causes?',
 'result': ' The addition of Thallium in TeO2 glass helps to maintain the non-linear optical properties while also potentially reducing the mechanical strength of the glass.',
 'source_documents': [Document(page_content='Chapter 5 – Structure of TiO 2-Tl2O-TeO 2ternary glasses\nPreliminary studies have shown that the addition of thallium oxide preserves the amplitude of\nthe non-linear index, however we have seen that TlO 0.5modifier oxide has no positive effect on\nimproving the mechanical properties (see chapter [4]) and the thermal stability of the glass [9]. A\nseparate study also showed that adding TiO 2to TeO 2increases the material’s tolerance to heating\nand significantly increases its mechanical resistance [113], [127]–[129]. It was determined in\npractice that the O local environment around Ti and Te are comparable, and neither species can\nbe viewed as a modifier with respect to the others. Dietzel’s field strength crite

In [70]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources: ')
    for source in llm_response['source_documents']:
        print(source.metadata['source'])

In [71]:
process_llm_response(llm_response)

 The addition of Thallium in TeO2 glass helps to maintain the non-linear optical properties while also potentially reducing the mechanical strength of the glass.


Sources: 
Thesis/thesis.pdf
Thesis/thesis.pdf


## Deleting the database

In [72]:
!zip -r thesis_db.zip ./thesis_db

  adding: thesis_db/ (stored 0%)
  adding: thesis_db/03031d20-abf0-420e-9ac1-fe8e63989f3c/ (stored 0%)
  adding: thesis_db/03031d20-abf0-420e-9ac1-fe8e63989f3c/data_level0.bin (deflated 100%)
  adding: thesis_db/03031d20-abf0-420e-9ac1-fe8e63989f3c/link_lists.bin (stored 0%)
  adding: thesis_db/03031d20-abf0-420e-9ac1-fe8e63989f3c/length.bin (deflated 14%)
  adding: thesis_db/03031d20-abf0-420e-9ac1-fe8e63989f3c/header.bin (deflated 61%)
  adding: thesis_db/chroma.sqlite3 (deflated 40%)


In [73]:
# to cleanup , the database

vectordb.delete_collection()
vectordb.persist()

!rm -rf db

In [None]:
!unzip thesis_db.zip