### Processamento de PDFs com LLMs


In [8]:
#! pip install openai
#! pip install pinecone-client
#! pip install -U langchain
#! pip install watermark
#!pip install pypdf
#! pip install python-dotenv
#! pip install tiktoken
#!pip install -U sqlalchemy

In [9]:
#! pip install faiss-gpu

In [139]:
import os
import langchain
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain_community.embeddings import OCIGenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains.question_answering import load_qa_chain
import warnings
import oci
warnings.filterwarnings('ignore')
from oci.config import from_file
from oci.object_storage import ObjectStorageClient

In [140]:
# Setup basic variables
# Auth Config
CONFIG_PROFILE = "DEFAULT"
config = oci.config.from_file('config', CONFIG_PROFILE)


In [141]:
from langchain_community.llms import OCIGenAI
llm = OCIGenAI(
    model_id="cohere.command",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1..aaaaaaaaokklvuixgex5dl6phxmr3v4bxwnxkggtnfj6qtjbtfcve2uyimda",
    auth_type="API_KEY",
    auth_profile=CONFIG_PROFILE,# DataScienceExperience compartment
    model_kwargs={"temperature": 0.4, "top_p": 0.2, "max_tokens": 300},
)

In [144]:
prompt = """What is the OCI Language service?"""

response = llm.invoke(prompt)
print(response)


 The OCI Language Service is a set of features offered by the Oracle Cloud Infrastructure (OCI) that are designed to help developers build, deploy, and manage applications that require language processing capabilities. It provides a range of language-specific services and tools that can be used to enable intelligent language processing and analysis within an application. 

Here are some key features and capabilities of the OCI Language Service:
1. **Natural Language Processing (NLP)**: The service offers built-in NLP capabilities that enable you to extract information and perform tasks such as entity recognition, sentiment analysis, language detection, and text classification. These features can be used to analyze and gain insights from unstructured text data, enabling applications like customer sentiment analysis, document classification, and language-based analytics.
2. **Text Analytics**: The service integrates with Oracle Cloud Infrastructure's other services and platforms to provi

### Extraindo Texto de Arquivos PDF

In [102]:
def dsa_read_pdf(directory_path):
    #Acessa a pasta com o arquivo
    file_loader = PyPDFDirectoryLoader(directory_path)
    
    #Lê o documento da pasta
    documents = file_loader.load()
    
    return documents

In [103]:
dsa_doc = dsa_read_pdf('documentos')

In [104]:
lista_textos = []
i=0

In [None]:
for page in dsa_doc:
    for documento in page:
        print(str(documento[1]))
        lista_textos.append(str(documento[1]))

### Carregando as variáveis de ambiente

In [106]:
from dotenv import load_dotenv

In [107]:
load_dotenv('openai.env')

True

### Criando o gerador de Embedings

In [108]:
embeddings = OCIGenAIEmbeddings(
    model_id="cohere.embed-english-light-v3.0",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1..aaaaaaaaokklvuixgex5dl6phxmr3v4bxwnxkggtnfj6qtjbtfcve2uyimda", auth_type="API_KEY", auth_profile=CONFIG_PROFILE
)

In [109]:
embeddings

OCIGenAIEmbeddings(client=<oci.generative_ai_inference.generative_ai_inference_client.GenerativeAiInferenceClient object at 0x7f7193151970>, service_models=None, auth_type='API_KEY', auth_profile='DEFAULT', model_id='cohere.embed-english-light-v3.0', model_kwargs=None, service_endpoint='https://inference.generativeai.us-chicago-1.oci.oraclecloud.com', compartment_id='ocid1.compartment.oc1..aaaaaaaaokklvuixgex5dl6phxmr3v4bxwnxkggtnfj6qtjbtfcve2uyimda', truncate='END')

### Testando os vetores de embeddings

### Definindo o Vector Store

In [110]:
import faiss
import numpy as np

In [111]:
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [112]:
#lista_textos.append("The amount of escopo 1 is 45000tgee")

#lista_textos.append("The company is not ISO 14.001 CERTIFIED")
#lista_textos.append("Amount of GEE escopo 1 its 14000toneladas")

In [113]:
#Gerando os vetores
vectorstore = FAISS.from_texts(
    lista_textos,
    embedding=embeddings,
)


In [114]:
def busca_similaridade(vectorstore,query):
    input_embedding = embeddings.embed_query(query)
    print('input_embedding')
    distance, n = vectorstore.search(np.array(input_embedding),'similarity')
    print('vectorstore.search')
    
    return distance,n

### Criando app do Assistente Pessoal com LangChain

In [145]:
from langchain_community.llms import OCIGenAI
llm = OCIGenAI(
    model_id="cohere.command",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1..aaaaaaaaokklvuixgex5dl6phxmr3v4bxwnxkggtnfj6qtjbtfcve2uyimda",
    auth_type="API_KEY",
    auth_profile=CONFIG_PROFILE,# DataScienceExperience compartment
    model_kwargs={"temperature": 0.2, "top_p": 0.2, "max_tokens": 300},
)

llm = OCIGenAI(
    model_id="meta.llama-2-70b-chat",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1..aaaaaaaaokklvuixgex5dl6phxmr3v4bxwnxkggtnfj6qtjbtfcve2uyimda",
    auth_type="API_KEY",
    auth_profile=CONFIG_PROFILE,# DataScienceExperience compartmentnsr
    model_kwargs={"temperature": 0.3, "top_p": 0.7, "max_tokens": 300},
)

In [146]:
chain = load_qa_chain(llm,chain_type='refine')

In [147]:
def obter_resposta(query,vectorstore):
    
    distance,doc_search = busca_similaridade(vectorstore,query)
    
    response = chain.run(input_documents = doc_search,question=query)
    return response

In [155]:
# Use RetrievalQA chain for orchestration
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever())
result = ''

In [156]:
query = 'What is the moon?'
result = qa.run(query)
print(result)

 The Moon is a spherical rocky body, likely with a small metallic core, revolving around Earth in a slightly eccentric orbit at a mean distance of about 384,000 km (238,600 miles) distant from Earth. It is the fifth largest of the planets in the solar system, relative to its neighboring planets, and has a diameter of 3,475 km (or 2,159 miles). Like Earth, the Moon possesses a differentiated interior, with a less dense crust and a more dense mantle and core. Its mass distribution is not uniform, with mascons (or mass concentrations) resulting in a slightly lumpy gravity field. Due to its small size and mass, the moon does not have a significant atmosphere around it. 

It is hypothesized that around 4.51 billion years ago, the Moon formed during a giant impact event, when a Mars-sized body named Theia collided with the early Earth. This collision caused a large amount of debris to be ejected into space, which eventually coalesced to form the Moon. This is the most widely accepted theory 

In [157]:
query = 'What the article tell about LUNOX in the moon and which page?'
result = ''
result = qa.run(query)
print(result)

 According to the document you provided, LUNOX is a concept for a lunar oxygen mine that would extract oxygen from lunar soil to support human life on the moon. It is mentioned on pages 2, 4, and 6 of the document "crewed_cargo_lunar.pdf". 

I cannot speak to the validity or accuracy of this concept, as my purpose is solely to extract information from your sources and present it to you. 

Would you like me to extract any further information from this source? 


In [158]:
query = 'What is the importance of STN according the text and which page?'
result = ''
result = qa.run(query)
print(result)

 According to the text on page 0 of the document 'crewed_cargo_lunar.pdf', STN or "System-Theoretic Networking" is a system used to help design spacecraft and other complex systems. It is meant to aid in the optimization of resource allocation and management through the assessment of various factors and uncertainties. 

Is there anything specific you'd like to know regarding the importance of STN, or would you like me to cover any other details mentioned in the text? 


In [159]:
query = 'What is the crewed cargo transport?'
result = ''
result = qa.run(query)
print(result)

 According to the provided sources, the crewed cargo transport is a vehicle that is designed to transport cargo to the Moon with a crew aboard the vessel. This differentiates it from other cargo transports which do not have a crew. I am unable to provide any further specifics regarding the cargo transport, as the provided information is limited to this broad overview. 

Would you like me to expand on any of the information mentioned above? 


In [160]:
query = 'Conestoga-class missions carrying?'
result = ''
result = qa.run(query)
print(result)

 According to the provided sources, the Conestoga-class missions will be used to carry cargo to the Moon. 

Would you like to know more about these missions and their purpose? 
