# Nomic Embeddings

Nomic has released a new embedding model with strong performance for long context retrieval (8k context window).

The cookbook walks through the process of building and deploying (via LangServe) a RAG app using Nomic embeddings.

![Screenshot 2024-02-01 at 9.14.15 AM.png](attachment:4015a2e2-3400-4539-bd93-0d987ec5a44e.png)

## Signup

Get your API token, then run:
```
! nomic login
```

Then run with your generated API token
```
! nomic login < token >
```

In [None]:
!pip install nomic

Collecting nomic
  Downloading nomic-3.0.12.tar.gz (40 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.7/40.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jsonlines (from nomic)
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Collecting loguru (from nomic)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: nomic
  Building wheel for nomic (setup.py) ... [?25l[?25hdone
  Created wheel for nomic: filename=nomic-3.0.12-py3-none-any.whl size=41557 sha256=5870bc241338c7202f873c8d48c098685e0e1551ee1acb3516761bef02e147c2
  Stored in directory: /root/.cache/pip/wheels/30/2c/0e/e559ff44f0f908cd996021832fb98ea169372e7d91a1b

In [None]:
! nomic login

[1m                                  [0m[1mAuthenticate with the Nomic API[0m[1m                                   [0m
[1m                                  [0m[4;94mhttps://atlas.nomic.ai/cli-login[0m[1m                                  [0m
[1m       [0m[1mClick the above link to retrieve your access token and then run `nomic login [0m[1m[[0m[1mtoken[0m[1m][0m[1m`[0m[1m        [0m


In [None]:
! nomic login <login_token>

In [None]:
! pip install -U langchain-nomic langchain_community tiktoken langchain-openai chromadb langchain

Collecting langchain-nomic
  Downloading langchain_nomic-0.0.2-py3-none-any.whl (3.4 kB)
Collecting langchain_community
  Downloading langchain_community-0.0.20-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-openai
  Downloading langchain_openai-0.0.6-py3-none-any.whl (29 kB)
Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.7-py3-none-any.whl (815 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.

In [None]:
# Optional: LangSmith API keys
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "api_key"

## Document Loading

Let's test 3 interesting blog posts.

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.0.1-py3-none-any.whl (283 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/284.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m276.5/284.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.0.1


In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("Yearly Medical Examination Report for a Middle-aged Woman.pdf"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)
doc_splits = text_splitter.split_documents(docs)
len(doc_splits)

4

In [None]:
# from langchain_community.document_loaders import WebBaseLoader

# urls = [
#     "https://lilianweng.github.io/posts/2023-06-23-agent/",
#     "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
#     "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
# ]

# docs = [WebBaseLoader(url).load() for url in urls]
# docs_list = [item for sublist in docs for item in sublist]

## Splitting

Long context retrieval

In [None]:
# from langchain.text_splitter import CharacterTextSplitter

# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=1500, chunk_overlap=150
# )
# doc_splits = text_splitter.split_documents(docs_list)

In [None]:
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
for d in doc_splits:
    print("The document is %s tokens" % len(encoding.encode(d.page_content)))

The document is 387 tokens
The document is 305 tokens
The document is 345 tokens
The document is 29 tokens


## Index

Nomic embeddings [here](https://docs.nomic.ai/reference/endpoints/nomic-embed-text).

In [None]:
import os

from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_nomic import NomicEmbeddings
from langchain_nomic.embeddings import NomicEmbeddings

In [None]:
# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=NomicEmbeddings(model="nomic-embed-text-v1"),
)
retriever = vectorstore.as_retriever()

In [None]:
retriever

VectorStoreRetriever(tags=['Chroma', 'NomicEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7cb8579414e0>)

In [None]:
question = "skin cancer"
docs = vectorstore.similarity_search(question,k=3)
len(docs)
vectorstore.persist()

In [None]:
!pip install gpt4all

Collecting gpt4all
  Downloading gpt4all-2.2.1.post1-py3-none-manylinux1_x86_64.whl (4.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/4.1 MB[0m [31m7.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/4.1 MB[0m [31m25.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.1/4.1 MB[0m [31m41.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.1/4.1 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gpt4all
Successfully installed gpt4all-2.2.1.post1


In [None]:
from gpt4all import GPT4All
model = GPT4All("nous-hermes-llama2-13b.Q4_0.gguf")

100%|██████████| 7.37G/7.37G [04:22<00:00, 28.1MiB/s]


In [None]:
from langchain_core.prompts import ChatPromptTemplate

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

context = ''
for i in range(3):
  context+="\n"+docs[i].page_content

question = "Summarize the context"

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = template.format(context=context, question=question)

In [None]:
output = model.generate(prompt, max_tokens=300)
print(output)

Answer: The provided text is a medical report for a woman in her 40s, detailing her health assessments and screenings over three years. She has been diagnosed with osteopenia and basal cell carcinoma (skin cancer), both of which are being managed with supplementation and regular monitoring. Her cholesterol levels have improved since the previous year, but she still needs to continue taking calcium and vitamin D supplements for her bone density.


Personal information