# MVP

Startup: Isearch

Done by: Sebastian Sarasti

This notebook aims to be a guide of how to build a elastic search engine to search into different documents.

## Data loading

In [1]:
from langchain.vectorstores import Chroma

Load all pdfs from the data folder, show a progress bar, and use all threads available

In [2]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

In [3]:
loader = DirectoryLoader('../data/', '**/*.pdf', loader_cls=PyPDFLoader, use_multithreading=True)
docs = loader.load()

In [4]:
from langchain.document_loaders import Docx2txtLoader, UnstructuredWordDocumentLoader

In [7]:
word_loader = DirectoryLoader(loader_cls = UnstructuredWordDocumentLoader, loader_kwargs={'mode': 'elements'}, path='../data/', glob='**/*.docx', use_multithreading=True)
word_docs = word_loader.load()

In [8]:
word_docs

[Document(page_content='Actividad 1:  Software y ejemplo de caso de prueba con simple-ai en Python', metadata={'source': '..\\data\\carpeta3\\mia_rpa_#1_solucion.docx', 'filename': 'mia_rpa_#1_solucion.docx', 'file_directory': '..\\data\\carpeta3', 'last_modified': '2023-10-27T22:32:51', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category_depth': 0, 'languages': ['spa'], 'page_number': 1, 'category': 'UncategorizedText'}),
 Document(page_content='ESTUDIANTES:', metadata={'source': '..\\data\\carpeta3\\mia_rpa_#1_solucion.docx', 'filename': 'mia_rpa_#1_solucion.docx', 'file_directory': '..\\data\\carpeta3', 'last_modified': '2023-10-27T22:32:51', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category_depth': 0, 'languages': ['spa'], 'page_number': 1, 'emphasized_text_contents': ['ESTUDIANTES', 'ESTUDIANTES'], 'emphasized_text_tags': ['b', 'i'], 'category': 'UncategorizedText'}),
 Document(page_content

## Text splitting

Once data has been loaded, it has to be splitted into data chunks to be considered useful in the LLM.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap  = 100,
    length_function = len,
    add_start_index = True
)

In [6]:
texts = text_splitter.split_documents(docs)

## Text embeddings

In this part, the chunks are going to be created into embeddings. 

In [7]:
from langchain.embeddings import SentenceTransformerEmbeddings

In [8]:
embedding = SentenceTransformerEmbeddings(model_name = 'intfloat/multilingual-e5-base')

  from .autonotebook import tqdm as notebook_tqdm


## Vector database

In [9]:
from langchain.vectorstores import Chroma

Generate the DB through the embedding

In [10]:
NAME_VECTOR_STORE = 'embeddings-mvp'
# vectorstore = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=NAME_VECTOR_STORE)

Save the embedding in the disk

In [11]:
# vectorstore.persist()

Load vector store from the DB

In [12]:
vectorstore = Chroma(persist_directory=NAME_VECTOR_STORE, embedding_function=embedding)