<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/SetupColabEnvironment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Colab Environment & Data Prep
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook sets up the [financial-vss](https://github.com/Redislabs-Solution-Architects/financial-vss) github examples in a Google Colab runtime. It also uses [LangChain](https://python.langchain.com/docs/get_started/introduction) data loaders to prepare a dataset of PDFs for downstream indexing and semantic search tasks in Redis.

Clone the full repo (if running in Colab) to get the necessary files and data.

In [1]:
# If running this in Google Collab -- clone the full repo to access the contents and dataset
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo
!mv temp_repo/* .
!rmdir temp_repo

Cloning into 'repo'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 40 (delta 17), reused 30 (delta 11), pack-reused 0[K
Receiving objects: 100% (40/40), 6.85 MiB | 1.65 MiB/s, done.
Resolving deltas: 100% (17/17), done.
LICENSE                      README.md
OpenAI_LangChain_Redis.ipynb [34mresources[m[m
OpenAI_RedisVL.ipynb


## Install Python Dependencies


In [None]:
!pip install -r langchain sentence-transformers pdf2image "unstructured[all-docs]"

## Import Helpers

In [6]:
import os
import json

from langchain.embeddings.hugging_face import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

### Load text and split it into manageable chunks

Without this step any large body of text would exceed the limit of tokens you can feed to the LLM

In [8]:
# Load list of pdfs
data_path = "notebooks/resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print(docs)

['repo/resources/nke-10k-2023.pdf', 'repo/resources/amzn-10k-2023.pdf', 'repo/resources/jnj-10k-2023.pdf', 'repo/resources/aapl-10k-2023.pdf', 'repo/resources/nvd-10k-2023.pdf', 'repo/resources/msft-10k-2023.pdf']


In [27]:
# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(docs[0], mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

In [28]:
# This parser pipeline broke up our PDF into smaller chunks
len(chunks)

323

In [34]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Initialize embeddings engine

Using HuggingFace embeddings wrapper from langchain.

In [37]:
embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")

In [40]:
# Embed each page_content from the document chunks
chunk_embeddings = embeddings.embed_documents([chunk.page_content for chunk in chunks])

In [43]:
# Check to make sure we've created enough embeddings, 1 per document chunk
len(chunk_embeddings) == len(chunks)

True

## Write Data to Disk
Now that we have preprocessed our dataset and created semantic embddings, we will save to disk for re-use with the later notebooks.

In [52]:

with open(os.path.join(data_path, "embeddings.json"), "w") as f:
    json.dump(chunk_embeddings, f)

with open(os.path.join(data_path, "docs.json"), "w") as f:
    json.dump([chunk.__dict__ for chunk in chunks], f)

Now we are ready to try Vector Similarity Search in Redis with three notebook options:

- [Redis Python](notebooks/RedisPython_VSS.ipynb)
- [RedisVL](notebooks/RedisVL_VSS.ipynb)
- [LangChain](notebooks/LangChain_VSS.ipynb)