<a href="https://colab.research.google.com/github/rastringer/promptcraft/blob/main/enterprise_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Enterprise search

Using Langchain retrievers with [Enterprise Search](https://cloud.google.com/enterprise-search) on Google Cloud.

As of July 2023, the product is available to trusted testers. This notebook offers an example use of retrieving relevant documents for a query.

In this example, we will add course pdfs from Stanfords's CS224n class, which covers (rather aptly) NLP and LLMs. The dataset is available at `gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224`.


In [1]:
! pip install --upgrade google-cloud-aiplatform
! pip install google-cloud-discoveryengine
! pip install shapely<2.0.0
! pip install langchain
! pip install typing-inspect==0.8.0 typing_extensions==4.5.0

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.28.1-py2.py3-none-any.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting google-cloud-resource-manager<3.0.0dev,>=1.3.3 (from google-cloud-aiplatform)
  Downloading google_cloud_resource_manager-1.10.2-py2.py3-none-any.whl (321 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.3/321.3 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shapely<2.0.0 (from google-cloud-aiplatform)
  Downloading Shapely-1.8.5.post1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: shapely, google-cloud-resource-manager, google-cloud-aiplatform
  Attempting uninstall: shapely
    Found existing installation: shapely 2.0.1
    Uninstalling shapel

Collecting google-cloud-discoveryengine
  Downloading google_cloud_discoveryengine-0.9.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.8/431.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: google-cloud-discoveryengine
Successfully installed google-cloud-discoveryengine-0.9.1


/bin/bash: line 1: 2.0.0: No such file or directory
Collecting langchain
  Downloading langchain-0.0.242-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.13-py3-none-any.whl (26 kB)
Collecting langsmith<0.1.0,>=0.0.11 (from langchain)
  Downloading langsmith-0.0.14-py3-none-any.whl (29 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?

In [2]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

If you're on Colab, authenticate via the following cell

In [1]:
from google.colab import auth
auth.authenticate_user()

Add your `project id` and the `search engine id`. The search engine will have to be set up in the Google Cloud console. Future versions of the SDK should provide this feature.

In [2]:
PROJECT_ID = "notebooks-370010"
SEARCH_ENGINE_ID = "stanford_1690299721972"

Optional parameters

`max_documents` - The maximum number of documents used to provide extractive segments or extractive answers

`get_extractive_answers` - By default, the retriever is configured to return extractive segments. Set this field to True to return extractive answers

`max_extractive_answer_count` - The maximum number of extractive answers returned in each search result. At most 5 answers will be returned

`max_extractive_segment_count` - The maximum number of extractive segments returned in each search result. Currently one segment will be returned

`filter` - The filter expression that allows you filter the search results based on the metadata associated with the documents in the searched data store.

`query_expansion_condition` - Specification to determine under which conditions query expansion should occur. 0 - Unspecified query expansion condition. In this case, server behavior defaults to disabled. 1 - Disabled query expansion. Only the exact search query is used, even if SearchResponse.total_size is zero. 2 - Automatic query expansion built by the Search API.


In [3]:
from langchain.retrievers import GoogleCloudEnterpriseSearchRetriever

In [4]:
retriever = GoogleCloudEnterpriseSearchRetriever(
    project_id=PROJECT_ID,
    search_engine_id=SEARCH_ENGINE_ID,
    max_documents=3,
)

query = "What are the goals of the course?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='[draft] note 1: introduction and word2vec cs 224n: natural language processing with\ndeep learning 3\n\navoid in this course. Partly, this is historical and methodological;\nthe raw signal processing methods and expertise are generally\ncovered in other courses (224s!) and other research communities,\nthough there has been some convergence of techniques of late.\nIn all aspects of NLP, most existing tools work for precious few (usu\nally one, maybe up to 100) of the world’s roughly 7000 languages,\nand fail disproportionately much on lesser-spoken and/or marginal\nized dialects, accents, and more. Beyond this, recent successes in\nbuilding better systems have far outstripped our ability to charac\nterize and audit these systems. Biases encoded in text, from race to\ngender to religion and more, are reflected and often amplified by\nNLP systems. With these challenges and considerations in mind, but\nwith the desire to do good science and build trustworthy systems\nthat imp

In [5]:
retriever = GoogleCloudEnterpriseSearchRetriever(
    project_id=PROJECT_ID,
    search_engine_id=SEARCH_ENGINE_ID,
    max_documents=3,
    max_extractive_answer_count=3,
    get_extractive_answers=True,
)

In [6]:
query = "Does the course cover transformers?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='On faster GPUs, the pretraining can finish in around 30-40 minutes. This assignment is an investigation into Transformer self-attention building blocks, and the effects of pre training. It covers mathematical properties of Transformers and self-attention through written questions.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/a5.pdf:2', 'id': 'e45a23e879587067446c6f876341de6d'}
page_content='2. Pretrained Transformer models and knowledge access (35 points) You&#39;ll train a Transformer to perform a task that involves accessing knowledge about the world — knowledge which isn&#39;t provided via the task&#39;s training data (at least if you want to generalize outside the training set).' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/a5.pdf:2', 'id': 'e45a23e879587067446c6f876341de6d'}
page_content='CS 224N Assignment 5 Page 2 of 10 1.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/sea

In [7]:
query = "What is word2vec?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='However, many of the details of word2vec will hold true in methods that we&#39;ll proceed to further in the course, so we&#39;ll focus our time on that. 3.2 Word2vec model and objective The word2vec model represents each word in a fixed vocabulary as a low-dimensional (much smaller than vocabulary size) vector.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf:8', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='[draft] note 1: introduction and word2vec cs 224n: natural language processing with deep learning 4 language is intended to achieve—makes representing words an endlessly fascinating problem. Let&#39;s move to some methods. 2.2 Independent words, independent vectors What is a word?' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf:8', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='The word2vec mo