In [None]:
# You need to load your OPENAI_API_KEY into your environment
with open("my_key.txt", "r") as f:
    key_file = f.read()

key_file = key_file.split("\n")
keys = {}

for line in key_file:
    key, value = line.split(':')
    keys[key.strip()] = value.strip()

In [None]:
import os

os.environ['OPENAI_API_KEY'] = keys['OPENAI_API_KEY']

# Questions & Answers over VectorStore
- I/O Spec
    - Input: 
        - Files Directory (Now Support .txt, .doc, .pdf)
        - Web Link
    - Output: None

- Chain Spec
    - Type: Index-Related Chains (Stuffing)
    - Components:
        - Document Loaders: 
            - Directory
            - Web Link
        - Text Splitters: RecursiveCharacterTextSplitter
        - VectorStores: Chroma
        - Embedding Model: OpenAI - text-embedding-ada-002

In [30]:
web_url = [
    "https://en.wikipedia.org/wiki/Attention",
]
directory_path = "Files_Directory"

## Google Cloud Storage File
Set up the Google Drive API:

Follow the steps mentioned in the first response to set up the Google Drive API and obtain the credentials.json file.
Set up a Python web server:

Choose a web framework such as Flask or Django to handle the server-side functionality.
Install the necessary dependencies for the chosen framework.
Implement the authentication flow:

Provide a user interface for users to log in with their Google accounts and authorize your application to access their Google Drive.
Use the Google Sign-In API or OAuth 2.0 to handle the authentication process.
Retrieve the access token after successful authentication.
Handle file upload:

Create a file upload endpoint in your server-side application.
Receive the file upload request from the client-side, along with the access token obtained in the authentication step.
Use the access token to authenticate and authorize access to the user's Google Drive.
Use the Google Drive API to retrieve the file contents based on the provided file ID.
Save the file contents to your server or perform any required processing on the file.
Implement error handling and security measures:

Handle any potential errors that may occur during the authentication or file upload process.
Implement appropriate security measures to protect user data and prevent unauthorized access.
Test and deploy your application:

Test your application thoroughly to ensure it functions as expected.
Deploy your application to a web server or cloud platform to make it accessible to users.

In [1]:
from langchain.document_loaders import GCSFileLoader
loader = GCSFileLoader(
    project_name = "aist",
    bucket = "langchain",
    blob = "sasa"
)

In [3]:
loader.load()

DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.

## Document Loaders


In [23]:
docs = []

### Load Files in a Directory

In [24]:
from langchain.document_loaders import DirectoryLoader

In [25]:
directory_loader = DirectoryLoader(
    path = directory_path,
    glob = "**/[!.]*",
)

In [26]:
docs_from_directory = directory_loader.load()

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy.
Falling back to partitioning with ocr_only.


In [27]:
print(docs_from_directory[1])

page_content='arXiv:2305.01625v1 [cs.CL] 2 May 2023\n\nUnlimiformer: Long-Range Transformers with Unlimited Length Input\n\nAmanda Bertsch and Uri Alon and Graham Neubig and Matthew R. Gormley Carnegie Mellon University, USA {abertsch,ualon,gneubig, mgormley}@cs.cmu.edu\n\nAbstract\n\nTransformer-based models typically have a predefined bound to their input length, because of their need to potentially attend to every to- ken in the input. In this work, we propose Unlimiformer: a general approach that can wrap any existing pretrained encoder-decoder transformer, and offload the attention compu- tation across all layers to a single k-nearest- neighbor index; this index can be kept on ei- ther the GPU or CPU memory and queried in sub-linear time. This way, we can index ex- tremely long input sequences, while every at- tention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We demonstrate Unlimiformer’s efficacy on several long-document and multi-do

In [35]:
print(type(docs_from_directory[1]))

<class 'langchain.schema.Document'>


In [28]:
docs = docs + docs_from_directory

### Load Content from Web Link

In [29]:
from langchain.document_loaders import UnstructuredURLLoader

In [31]:
url_loader = UnstructuredURLLoader(
    urls = web_url,
    
)

In [32]:
url_data = url_loader.load()

In [36]:
print(url_data[0])

page_content='Toggle the table of contents\n\nToggle the table of contents\n\nAttention\n\n60 languages\n\nالعربية\n\nAsturianu\n\nAzərbaycanca\n\nবাংলা\n\nБеларуская\n\nБългарски\n\nBosanski\n\nCatalà\n\nČeština\n\nDansk\n\nDeutsch\n\nEesti\n\nEspañol\n\nEsperanto\n\nEuskara\n\nفارسی\n\nFrançais\n\nGaeilge\n\nGalego\n\n한국어\n\nՀայերեն\n\nहिन्दी\n\nHrvatski\n\nBahasa Indonesia\n\nItaliano\n\nעברית\n\nಕನ್ನಡ\n\nКыргызча\n\nLatina\n\nLatviešu\n\nLietuvių\n\nMagyar\n\nМакедонски\n\nNederlands\n\n日本語\n\nNorsk bokmål\n\nOʻzbekcha / ўзбекча\n\nپښتو\n\nPolski\n\nPortuguês\n\nRomână\n\nРусский\n\nShqip\n\nSimple English\n\nSlovenčina\n\nSoomaaliga\n\nСрпски / srpski\n\nSrpskohrvatski / српскохрватски\n\nSuomi\n\nSvenska\n\nதமிழ்\n\nతెలుగు\n\nไทย\n\nTürkçe\n\nУкраїнська\n\nTiếng Việt\n\n吴语\n\n粵語\n\nŽemaitėška\n\n中文\n\nEdit links\n\nArticle\n\nTalk\n\nEnglish\n\nRead\n\nEdit\n\nView history\n\nTools\n\nTools\n\nActions\n\nRead\n\nEdit\n\nView history\n\nGeneral\n\nWhat links here\n\nRelated change

In [34]:
print(type(url_data[0]))

<class 'langchain.schema.Document'>


In [37]:
docs = docs + url_data

## Text Splitter
- Output: text_chunks (with document insides)

In [38]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [39]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 30,
    length_function = len
)

In [42]:
text_chunks = text_splitter.split_documents(docs)

In [50]:
print(type(text_chunks), type(text_chunks[0]), len(text_chunks))

<class 'list'> <class 'langchain.schema.Document'> 426


## VectorStore
- Output: search_space

In [53]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

In [52]:
openai_embeddings = OpenAIEmbeddings(
    model = "text-embedding-ada-002"
)

In [54]:
search_space = Chroma.from_documents(docs, openai_embeddings)

Using embedded DuckDB without persistence: data will be transient


### Similarity Search Test

In [78]:
query = "What is Attention"

In [79]:
search_result = search_space.similarity_search(query)

Chroma collection langchain contains fewer than 4 elements.


In [80]:
len(search_result)

3

In [84]:
len(search_result[1].page_content)

3041

In [85]:
print(search_result[1].page_content)

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transforme

## Question & Answer Chain

In [64]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain import OpenAI

In [68]:
# Retriever is got by using seach_space.as_retriever()
QA_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm = OpenAI(
        temperature = 0,
        max_tokens = 200
    ),
    chain_type = "stuff",
    retriever = search_space.as_retriever(),
    reduce_k_below_max_tokens = True,
)

## Test

In [69]:
result = QA_chain(
    {
        "question": "What is attention?",
    },
    return_only_outputs = True
)

Chroma collection langchain contains fewer than 4 elements.


In [73]:
print(result['answer'])

 I don't know.
SOURCES:
