In [109]:
#!pip install yellowbrick

In [110]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Semantic search

Traditionally, one would search through a corpus of documents using a keywords-based search engine like Lucene, Solr, ElasticSearch, etc. While the technology has matured, the basic underlying approach behind keyword search engines is to maintain an *inverted-index* mapping keywords to a list of documents that contain them, with associated relevances.

In general, the keywords-based search approach has been quite successful over the years, and have matured with added features and linguistic capabilities.

However, this approach has had its limitations. The principal cause of it goes to the fact that when we enter keywords, it is a human tendency to describe the intent of what we are looking for. For example, if we enter "breakfast places", we implicitly also mean restaurants, cafe, etc that serve items appropriate for breakfast. There may be a restaurant described as a shop for expresso, or crepe, that a keywords-search will likely miss, since its keywords do not match the query terms. And yet, we would hope to see it near the top of the search results.

Semantic search is an NLP approach largely relying on deep-neural networks, and in particular, the transformers that make it possible to more closely infer the human intent behind the search terms, the relationship between the words, and the underlying context. It allows for entire sentences -- and even paragraphs -- describing what the searcher's intent is, and retrieves results more relevant or aligned to it.

## How would we do this NLP task with AI?

Let us represent the functional behavior we expect: 


![](images/semantic-search-functionality.png)


### Magic happens: breaking it down into steps

We recall that machine-learning algorithms work with vectors ($\mathbf{X}$) representation of data.

So the first order of business would be to map each of the document texts $D_i$ to its corresponding vector $X_i$ in an appropriate $d$-dimensional space, $\mathbb{R}^d$, i.e.

\begin{equation}
D_i \longrightarrow X_i \in \mathbb{R}^d
\end{equation}

This resulting vectors are called **sentence embeddings**. Once these embeddings are for each of the documents, we can store the collection of tuples $[<D_1, X_1>, <D_2, X_2>, ..., <D_n, X_n>]$. Here each tuple corresponds to a document and its sentence embedding.

This collection of tuples, therefore, becomes our **search index**.

### Search

Now, when the user described what she is looking for, we consider the entire text as a "sentence".
<p>
<div class="alert-box alert-warning" style="padding-top:30px">
   
<b >Caveat Emptor</b>

> Note that we have a rather relaxed definition of a *sentence* in NLP: it diverges from a grammmatical definition of a sentence somewhat.  For example, in the English language, we would consider a sentence to be terminated with a punctuation, such as a period, question-mark or exclamation. However, in NLP, we loosely consider the entire text -- whether it is just a word, or a few keywords, or an english sentence, or a few sentences together -- as one **sentence** for the purposes of natual language processing task.
    
<p>
</div>
    
Therefore, it is common to consider an entire document text as a *sentence* if the text is relatively short. Alternatively, it is partitioned into smaller chunks (of say 512-tokens each), and each such chunk is considered an NLP *sentence*.

Since we consider the entire query text as a sentence, we can map it to its **sentence embedding vector**, ${Q}$.

#### Vector Similarity
Once we have this, we simply need to compare the query vector ${Q}$ with each of the document vectors $X_i$, and sort the document vectors in descending order of similarity.

The rest is trivial: pick the top-k  in the sorted document vectors list. Then for each vector, look up its corresponding document, and return the list as sorted search result of relevant document.

We expect that these documents will exhibit high semantic similarity with the search query, assuming that the search index did contain such documents.

<figure>
    <img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png">
    <caption> Semantic similarity as vector proximity in the embedding space. <br>
    (Figure source: Sbert.net documentation).
    </caption>
</figure>


#### Similarity measures

The sentence embedding vectors typically exist in very large dimensional space (e.g., 300 dimensions). In such large dimensional spaces, the notion of euclidean distance is not as effective. Therefore, it is far more common to use one of the two below measures for vector similarity:

* **dot-product**, the (inner) dot-product between the embedding vectors.

\begin{equation}
\text{dot-similarity} = \langle X_i, X_j \rangle
\end{equation}

* **cosine-similarity**, the $\cos \left(\theta_{ij}\right)$ gives degree of directional alignment between the vectors, but ignores their magnitudes. Here, $\theta_{ij}$ is the angle between $X_i$ and $X_j$ (embedding) vectors.

\begin{equation} 
\text{cosine-similarity} = \frac{\langle X_i, X_j \rangle} {\| X_i \| \| X_j \|}
\end{equation}

<div class="alert-box alert-info" style="padding-top:30px">
   
**Important**
    
>  Sentence transformer models trained with cosine-similarity tend to favor the shorter document texts in the search results, whereas the models trained on the dot-product similarity tend to favor longer texts.
</div>

### Symmetric vs asymmetric search

One of the technical aspects to be careful of is the relative textual length of the query sentence compared to the actual documents. Different sentence-transformer models have been trained specifically for each of these use-cases. 

* **symmetric search** when we expect the query-sentence to be approximately the same length as the document sentences.

* **asymmetric search** when we expect the document texts to be significantly larger in length to the query sentence.



#### Load an appropriate model

Let us consider the use-case where we are searching through some reasonably large documents. In such a case, it would be appropriate to use an asymmetric-search model. 

Let us consider an asymmetric model trained with *cosine-similarity* as the distance measure. In particular, let us use one of the below models:

* `


We load the model with the following code:

In [111]:
from sentence_transformers import SentenceTransformer

MODEL = 'msmarco-distilbert-base-v4'
embedder = SentenceTransformer(MODEL)

#### Load a toy corpus

Let us now load a toy corpus of some simple, long texts.

In [112]:
%run NLP-Lesson-01___search-corpus.ipynb

#### Search index of sentence embeddings

Let us now create the search index of sentence embeddings.

In [113]:
embeddings = embedder.encode(sentences, convert_to_tensor=True)

Note that we chose to get the embeddings as `pytorch` tensors -- this will help us later in doing high-performance searches over the GPU/TPU hardware. What do these embeddings look like? 

In [114]:
embeddings.shape

torch.Size([20, 768])

Clearly, there are 16 embeddings, each of a 768 dimensional vector. Let us glance at a sentence, and its embedding:

In [115]:
print (f'{sentences[0]}  {embeddings[0]}')


’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mo

#### Now, search for something!

Let us find the closest match to the the query: "a friendship with animals"

In [116]:
query_text = "a friendship with animals"
query = embedder.encode(query_text, convert_to_tensor=True)

In [117]:
from sentence_transformers import util
search_results = util.semantic_search(query, embeddings, top_k = 3)
search_results

[[{'corpus_id': 7, 'score': 0.2550494074821472},
  {'corpus_id': 8, 'score': 0.23244602978229523},
  {'corpus_id': 5, 'score': 0.2188762128353119}]]

In [118]:
for index, result in enumerate(search_results[0]):
    print('-'*80)
    print(f'Search Rank: {index}, Relevance score: {result["score"]} ')
    print(sentences[result['corpus_id']])
    

--------------------------------------------------------------------------------
Search Rank: 0, Relevance score: 0.2550494074821472 

Golden retrievers are not bred to be guard dogs, and considering the size of their hearts and their irrepressible joy and life, they are less likely to bite than to bark, less likely to bark than to lick a hand in greeting. In spite of their size, they think they are lap dogs, and in spite of being dogs, they think they’re also human, and nearly every human they meet is judged to have the potential to be a boon companion who might at any moment, cry, “Let’s go!” and lead them on a great adventure.

--------------------------------------------------------------------------------
Search Rank: 1, Relevance score: 0.23244602978229523 

If you’re lucky, a golden retriever will come into your life, steal your heart, and change everything

--------------------------------------------------------------------------------
Search Rank: 2, Relevance score: 0.218876

## Visual Search

Let us now search by giving it an image of what we are looking for. How can we do this?

We need to translate an image into an embedding vector, in a semantically relevant manner. If we were to be able to do it, we can then search through our document embeddings in essentially  the same manner as we did for textual search.

We start by getting an image from the web:

<figure style="padding:30px">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/At_Sticker_Lumis_Golden.JPG/1920px-At_Sticker_Lumis_Golden.JPG">
    <caption> A golden retriever image from wikipedia </caption>
</figure>



In [119]:
from PIL import Image
# Create the model
CLIP_MODEL = 'clip-ViT-B-32'
embedder = SentenceTransformer(CLIP_MODEL)

# We need to take smaller text, for CLIP to work. (current limitation)
short_sentences = [
    'A smiling dog', 'House with a chimney', 'Car on a highway',
    'Elephant in the field', 'Attention is all you need',
    'Golden retrievers are little children', 'A cat on the window sill',
    'For the love of these furry dogs'
]
# Create the search index of embeddings
embeddings = embedder.encode(short_sentences, convert_to_tensor=True)
embeddings.shape

torch.Size([8, 512])

#### Image as query

In [120]:
query = embedder.encode(Image.open('images/dog.jpeg'))

In [121]:
from sentence_transformers import util
search_results = util.semantic_search(query, embeddings, top_k = 3)
search_results

[[{'corpus_id': 5, 'score': 0.26912155747413635},
  {'corpus_id': 7, 'score': 0.25064516067504883},
  {'corpus_id': 0, 'score': 0.24110281467437744}]]

In [122]:
for index, result in enumerate(search_results[0]):
    print('-'*80)
    print(f'Search Rank: {index}, Relevance score: {result["score"]} ')
    print(short_sentences[result['corpus_id']])
    

--------------------------------------------------------------------------------
Search Rank: 0, Relevance score: 0.26912155747413635 
Golden retrievers are little children
--------------------------------------------------------------------------------
Search Rank: 1, Relevance score: 0.25064516067504883 
For the love of these furry dogs
--------------------------------------------------------------------------------
Search Rank: 2, Relevance score: 0.24110281467437744 
A smiling dog


### Text as query

We can also do the converse: give a text query, and retrieve all images that match. 

<div class="alert-box alert-info" style="padding-top:30px">
   
**Attribution**
    
>  The following example is derived from, and inspired by, the sample jupyter notebook posted at:  https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/image-search/Image_Search.ipynb
</div>

Let us first download a collection of photos from the Unsplash website of free available photos.

In [123]:
# Taken almost verbatim from the above-mentioned resource.
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm
torch.set_num_threads(4)

img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
    
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
        
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)
        

### Create the search index of these images

Let us now create a search index of these images as a collection of sentence embeddings. Since it is rather computationally expensive and time-consuming to create these embeddings, let us also store it for repeated use. 

Of-course, this implies that the next time you run this cell, it will retrieve the pre-computed embeddings, rather than recreate them.

In [124]:
# Once again, this code is taken from the sample notebook mentioned above.

model = SentenceTransformer('clip-ViT-B-32')
use_precomputed_embeddings = True

if use_precomputed_embeddings:
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):  #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/' + emb_filename,
                      emb_filename)

    with open(emb_filename, 'rb') as fIn:
        img_names, img_emb = pickle.load(fIn)
    print("Images:", len(img_names))
else:
    img_names = list(glob.glob('unsplash/photos/*.jpg'))
    print("Images:", len(img_names))
    img_emb = model.encode([Image.open(filepath) for filepath in img_names],
                           batch_size=128,
                           convert_to_tensor=True,
                           show_progress_bar=True)

Images: 24996


### Image search with prompts

Now we can search in a manner exactly the same as before.

In [125]:
!pip install ipyplot
import ipyplot

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [126]:
# Let us define a search function
from typing import Union
def search(query: str, is_image: bool = False, k:int = 8):    
    raw = Image.open(query) if is_image else query
    query_emb = model.encode([raw], convert_to_tensor=True, show_progress_bar=False)
    
    # Then, we use the util.semantic_search function, which computes the cosine-similarity
    # between the query embedding and all image embeddings.
    # It then returns the top_k highest ranked images, which we output
    hits = util.semantic_search(query_emb, img_emb, top_k=k)[0]
    
    print("Query:")
    display(raw) if is_image else display(query)
    images = [os.path.join(img_folder, img_names[hit['corpus_id']]) for hit in hits]
    ipyplot.plot_images(images, max_images=30, img_width=150, show_url=False)
    

In [127]:
search('To love a dog')

Query:


'To love a dog'

In [128]:
search('House near a lake')

Query:


'House near a lake'

In [129]:
search('jumping dog')

Query:


'jumping dog'

In [130]:
search('Pathways through the forest')

Query:


'Pathways through the forest'

## Homework

Perform a search, where the query is an image, and the results are images.

In [131]:
# dog = 'images/dog.jpeg'
# search(dog, k=10, is_image=True)

## References

Further reading resources associated with the two topics: sentence-transformers and approximate nearest neighbor searches.
<p>

* <a href="https://sbert.net/">Sentence Transformers</a> 

* The original paper that introduced sentence-transformers: <a href="https://arxiv.org/abs/1908.10084">Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</a>

* Faiss blog at facebook <a href="https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/"> Faiss: A library for efficient similarity search </a>

* A gentler introduction to Faiss: <a href="https://github.com/facebookresearch/faiss/wiki/"> Faiss wiki </a>

* Faiss <a href="https://github.com/facebookresearch/faiss">The Faiss github repository </a>

* Approximate nearest neighbor search with ScaNN <a href="https://github.com/google-research/google-research/tree/master/scann"> Scann github repository</a>

* Scann research paper: <a href="https://arxiv.org/abs/1908.10396">  Accelerating Large-Scale Inference with Anisotropic Vector Quantization </a>
    
* Topic modeling: <a href="https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6">Topic modeling with BERT </a>
    
* To2Vec research paper: <a  href="https://arxiv.org/abs/2008.09470">Top2Vec: Distributed Representations of Topics</a>


In [132]:
# !pip install bertopic  #Do it only once!

In [133]:
# # Fetch the famous 20-newsgroup data
# from sklearn.datasets import fetch_20newsgroups

# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# from bertopic import BERTopic
# topic_model = BERTopic()
# topics, probs = topic_model.fit_transform(docs)

In [134]:
# from umap import umap_ as UMAP
# umap_embeddings = UMAP(n_neighbors=15, 
#                             n_components=5, 
#                             metric='cosine').fit_transform(embeddings)

In [135]:
# import hdbscan
# cluster = hdbscan.HDBSCAN(min_cluster_size=15,
#                           metric='euclidean',                      
#                           cluster_selection_method='eom').fit(umap_embeddings)


In [136]:
#!pip install langchain
#!pip install pyspark

In [137]:
from pyspark.sql import SparkSession
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PySparkDataFrameLoader

def get_chunks(df, column_name, chunk_size):
    # Create a PySparkDataFrameLoader object with the given parameters
    loader = PySparkDataFrameLoader(spark, df, page_content_column=column_name)
    # Load the data into a LangChain Document object
    documents = loader.load()
    # Create a RecursiveCharacterTextSplitter object with the given parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=50,
        length_function=len,
        add_start_index=False
    )
    # Use the text splitter to create a list of documents containing the chunks
    chunks = []
    for doc in documents:
        docs = text_splitter.create_documents([doc.page_content])
        chunks.extend([doc.page_content for doc in docs])
    return chunks

In [138]:
# Define the input parameters
filename = "/home/shaanvi/Downloads/recoveryfiles/ml700_bootcamp/Downloads/2023ml/2023_ml500/nlp-day5/docprocessor/2103.00020.txt"
#column_name = "text_column"
column_name = "value"
chunk_size = 1000

# Create a SparkSession object
spark = SparkSession.builder.appName("TextChunker").getOrCreate()

# Read the text file into a DataFrame
df = spark.read.text(filename)

# Call the get_chunks() function with the DataFrame, column name, and chunk size
chunks = get_chunks(df, column_name, chunk_size)

# Print each chunk with an index and comment indicating its position in the original text
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}  # Start index: {i*chunk_size}, End index: {(i+1)*chunk_size}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


23/09/19 15:36:31 WARN Utils: Your hostname, shaanvi-MS-7D91 resolves to a loopback address: 127.0.1.1; using 192.168.1.173 instead (on interface enp3s0)
23/09/19 15:36:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/19 15:36:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/home/shaanvi/Downloads/recoveryfiles/ml700_bootcamp/Downloads/2023ml/2023_ml500/nlp-day5/docprocessor/2103.00020.txt.

Trying Spacy text splitter

In [None]:
text = "..." # your text
from langchain.text_splitter import SpacyTextSplitter


In [None]:
#!pip install spacy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
#!python -m spacy download en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import langchain.text_splitter as ts

# Initialize the SpacyTextSplitter
spacy_text_splitter = ts.SpacyTextSplitter()

# Split the text into sentences
text = "This is the first sentence. This is the second sentence."
sentences = spacy_text_splitter.split_text(text)

# Print the sentences
print(sentences)

['This is the first sentence.\n\nThis is the second sentence.']


Experiments with chunking!

In [None]:
textlong = '''
State Small Business Credit Initiative 

Data Definitions 

 
• state_id  

Abbreviation of state participating in SSBCI. 

• state_name  
Name of state participating in SSBCI. 

• year_reported  
Year of Annual transaction reporting. 

• program_name  
The name of the state’s program that Treasury approved for participation in SSBCI. 

• program_type  
The program types or program type corresponding to the SSBCI-approved program. 

• unique_id 
An alphanumeric or numeric code that is unique to each transaction in the state. 

• disbursement_date  
The date of the loan/line of credit or investment closing. 

• loan_investment_amount  
The principal amount of the loan or investment supported with SSBCI funds.  This amount does 
not include any subsequent private financing associated with the loan or investment and any 
private financing associated with, but separate from, the SSBCI-supported loan or investment. 

• ssbci_original_funds  
SSBCI original funds expended. For CAP programs, this is the amount paid out of SSBCI funds to 
cover the Approved State Program's contribution to the CAP reserve fund. For Loan 
Participation or 'direct lending programs', this is the amoutn paid out of SSBCI dollars, of the 
Approved State Program’s participation in the loan. If the deal is structured as two companion 
loans - one from the private sector, and one (upon which private sector financing is contingent) 
from the Approved State Program - then it is the value of the SSBCI-supported companion loan. 
For Loan Guarantee programs, this is the amount, paid for out of SSBCI dollars, actually set aside 
to cover the guarantee obligation. For Collateral Support programs, this  is the amount set aside 
to cover the collateral support obligation. For VC programs, this is the amount invested by state-
run venture capital program, or a fund in which the state-run venture capital program has 
invested, using SSBCI funds. 

• nonprivate_amount  
The amount of any public subsidy associated with the enrolled loan or investment that is from 
any source other than SSBCI. 



• concurrent_private_financing  
The amount of any private financing associated with, but separate from, the enrolled loan or 
SSBCI-supported investment. 

• borrower_insurance_premium  
Required only for CAP programs. This is the amount paid by the borrower to cover the 
borrower’s contribution to the CAP reserve fund. 

• lender_insurance_premium  
Required only for CAP programs. This is the amount paid by the lender to cover the lender’s 
contribution to the CAP reserve fund. 

• guaranteed_amount  
Required only for Loan Guarantee programs. This is the full value of the loan guarantee. 

• collateral_support  
Required only for Collateral Support programs. This is the full amount of the collateral support 
obligation. 

• ssbci_recycled_funds  
The funds used for this loan or investment that came to the state in the form of program 
income, interest earned, or principal repayments and funds that have been previously loaned or 
invested. 

• subsequent_private_financing 
The total amount of private financing received after closing that is caused by, or resulting from, 
the initial SSBCI-supported financing.  Applicable only to loan participation and venture capital 
programs. 
 

• census_tract  
The census tract of the eligible small business receiving the SSBCI loan or investment. 
 

• zip_code  
The 5-digit zip code of the eligible small business receiving the SSBCI loan or investment. 
 

• metro_type  
The “metro”/ “non-metro” designation of the county of investment as defined by the Office of 
Management and Budget (OMB) based on the 2010 Census Survey. Counties containing a core 
urban area of 50,000 or more in population are designated as metropolitan (“metro”), and all 
other counties (micropolitan or rural) as non-metropolitan (“non-metro”). 
 

• LMI_type  
LMI indicates census tracts that fall in the “Low and Moderate Income” categorizations, base on 
the 2010 Census Bureau's 5-year American Community Survey.  "Low income” households earn 
less than 50 percent of area median income. “Moderate income” households earn between 50 
percent and 80 percent of area median income. Non-LMI indicates an income level greater than 
80% of the area median income. 



• revenue  
The borrower’s/investee’s annual revenues for the most recent fiscal year as of the reporting 
year end. 
 

• full_time_employees  
The borrower’s/investee’s Full Time Equivalent employees, rounded to the nearest whole 
number, at the time of loan or investment closing. 
 

• naics_code  
The 2012 North American Industry Classification System (NAICS) codes for the 
borrower’s/investee’s industry. 
 

• year_incorporated  
The year the business was incorporated.  If the business had not been incorporated, this is the 
year the business opened. 
 

• jobs_created  
The number of new Full-Time Equivalent jobs expected to be created as a direct result of the 
loan or investment. These jobs must be expected to materialize in no more than 2 years from 
the date of the loan or investment closing. 
 

• jobs_retained  
The number of Full-Time Equivalent (FTE) jobs retained as a direct result of the loan or 
investment. For all start-up companies receiving SSBCI investments, this must be zero. 
 

• trans_type  
Indicates whether the transaction is a loan or investment. 
 

• lender_name  
The name of the private lending institution or lender making the loan that is guaranteed, 
insured, or otherwise enhanced with SSBCI funds. If the state is making a companion loan using 
SSBCI funds, the state enters the name of the private lender making the private companion loan 
to the borrower as part of the same transaction. For approved venture capital programs, states 
are not required to report the names of investors and venture capital firms providing the private 
capital at risk in the investment. 
 

• lender_ein  
The lender’s entity identification number (EIN) or tax ID number. This data field is not applicable 
to approved state-run venture capital programs. 
 

• lender_regulatory_id  
The is the lender’s regulatory ID number: CDFI#### - For certified CDFIs, FDIC#### - For FDIC 
regulated institutions, NCUA#### - For NCUA regulated institutions, OTHER - For non-financial 



lenders or other lenders whose regulatory IDs are not known (This data field is not applicable to 
venture capital programs.) 
 

• lender_type  
Indicates whether the lender is a Bank, Credit Union, or Other type of lending institution. 
 

• lender_type_category  
Indicates whether the lender is a Bank or Thrift, Credit Union, Depository Institution Holding 
Company, Loan Fund, Venture Capital Fund, or Other type of lending institution. 
 

• CDFI_type  
Indicates whether the lender is a Community development financial institutions certified by the 
Community Development Financial Institutions Fund (CDFI Fund) at the U.S. Department of the 
Treasury. 
 

• MDI_type  
Indicates whether the lender is a Minority Depository Institution. 
 

• VC_cat  
Indicates the VC program strategy for the investment, select from: State Sponsored Entity (SSE), 
Fund, Co-Investment Model, or State Agency. 
 

• optional_woman_owned  
Women-owned business is a business concern that is at least 51 percent owned by one of more 
women; or, in the case of any publicly owned business, at least 51 percent of the stock of which 
is owned by one or more women; and whose management and daily business operations are 
controlled by one or more women. 
 

• optional_minority_owned  
Minority-owned business is a business concern that is at least 51 percent owned by one or more 
(in combination) of the following ethnic minorities: Black Americans, Hispanic Americans, Native 
Americans, Asian Pacific Americans, and Sub Continent Asian American; or, in the case of any 
publicly owned business, at least 51 percent of the stock of which is owned by one or more 
ethnic minority; and whose management and daily business operations are controlled by one or 
more ethnic minority. 
 

• optional_veteran_owned  
Veteran-owned business concern is a business concern that is at least 51 percent owned by one 
or more veterans, or in the case of any publicly owned business, at least 51 percent of the stock 
of which is owned by one or more veterans; the management and daily business operations of 
which are controlled by one or more veterans. All service-disabled veteran-owned business 
concerns are also, by definition, veteran-owned business concerns. Veteran is a person who 
served on active duty with the U.S. Army, Air Force, Navy, Marine Corps or Coast Guard, for any 
length of time and at any place and who was discharged or released under conditions other than 



dishonorable. Reservists or members of the National Guard called to Federal active duty or 
disabled from a disease or injury incurred or aggravated in line of duty while in training status 
also qualify as a veteran. 
 

• optional_FTE  
The borrower’s/investee’s Full Time Equivalent employees, rounded to the nearest whole 
number, verified by the Participating State’s operating entity. 
 

• optional_FTE_yr_confirmed  
Enter the year at which the Participating State’s operating entity last verified the 
borrower/investee reported number of FTEs. 
 

• optional_primary_use_of_funds  
The primary purpose the financing was used for by the borrower/investee: wages, working 
capital and professional services; Purchase equipment; fund construction costs; purchase real 
estate; or refinance. 
 

• optional_revenue  
Dollar amount of gross annual revenues from most recent fiscal year, verified by the 
Participating State’s operating entity. 
 

• optional_revenue_yr_confirmed  
The year at which the Participating State’s operating entity last verified the borrower/investee 
gross annual revenues. 
 

• optional_active  
Yes/no indicator as to whether the business was actively operating as of 12/31/2017. If no, 
select explanation in the column identified as “Explanation for it ‘no'". If unknown, please 
explain in column identified as “Explanation for if ‘unknown'". 
 

• optional_active_no  
Indicates the explanation if the answer to “Business actively operating as of 12/31/2016” is 
“no”: Loss, Bankrupt, Moved, Sold, Exit. 
 

• optional_active_unknown  
Indicates the explanation if the answer to “Business actively operating as of 12/31/2016 is 
“unknown”. Attempted to confirm operations, but unable. Did not attempt to confirm 
operations. 
 

• optional_dollars_lost  
The dollar amount of SSBCI funds that were lost. 
 

• optional_business_EIN  
The Employer Identification Number (EIN) of the business receiving SSBCI financing. 



• optional_business_name 
The name of the business receiving SSBCI financing. 
 

• optional_business_street_address  
The street address of the business receiving SSBCI financing. 
 

• optional_business_city  
The city of the business receiving SSBCI financing. 
 

• optional_business_state  
The state of the business receiving SSBCI financing. 
 

• optional_coinvestment_source 
Indicates the primary source of co-investment for VC investments: Angel, In-state VC fund,  Out-
of-state VC fund, Bank or Financial Institution. 
 

• optional_stage  
Indicates the company stage at the time of the transaction date for VC investments: Pre-seed, 
Seed, Early Stage, Growth, Mezzanine. 
'''

In [None]:
para1='''
Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket.[4] He is the all-time highest run-scorer in both ODI and Test cricket with more than 18,000 runs and 15,000 runs, respectively.[5] He also holds the record for receiving the most man-of-the-match awards in international cricket.[6] Tendulkar was a Member of Parliament, Rajya Sabha by nomination from 2012 to 2018.[7][8]

'''

para2='''

Tendulkar took up cricket at the age of eleven, made his Test match debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumbai domestically and India internationally for over 24 years.[9] In 2002, halfway through his career, Wisden ranked him the second-greatest Test batsman of all time, behind Don Bradman, and the second-greatest ODI batsman of all time, behind Viv Richards.[10] The same year, Tendulkar was a part of the team that was one of the joint-winners of the 2002 ICC Champions Trophy. Later in his career, Tendulkar was part of the Indian team that won the 2011 Cricket World Cup, his first win in six World Cup appearances for India.[11] He had previously been named "Player of the Tournament" at the 2003 World Cup.

'''
para3='''
Tendulkar has received several awards from the government of India: the Arjuna Award (1994), the Khel Ratna Award (1997), the Padma Shri (1998), and the Padma Vibhushan (2008).[12][13] After Tendulkar played his last match in November 2013, the Prime Minister's Office announced the decision to award him the Bharat Ratna, India's highest civilian award.[14][15] He was the first sportsperson to receive the reward and, as of 2023, is the youngest recipient.[16][17][18] In 2010, Time included Tendulkar in its annual list of the most influential people in the world.[19] Tendulkar was awarded the Sir Garfield Sobers Trophy for cricketer of the year at the 2010 International Cricket Council (ICC) Awards.[20]

'''
para4='''
Having retired from ODI cricket in 2012,[21][22] he retired from all forms of cricket in November 2013 after playing his 200th Test match.[23] Tendulkar played 664 international cricket matches in total, scoring 34,357 runs.[24] In 2013, Tendulkar was included in an all-time Test World XI to mark the 150th anniversary of Wisden Cricketers' Almanack, and he was the only specialist batsman of the post–World War II era, along with Viv Richards, to get featured in the team.[25] In 2019, he was inducted into the ICC Cricket Hall of Fame.[26] On 24 April 2023, the Sydney Cricket Ground unveiled a set of gates named after Tendulkar and Brian Lara on the occasion of Tendulkar's 50th birthday and the 30th anniversary of Lara's inning of 277 at the ground.[27][28][29]
'''

fourparagraphs='''
Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket.[4] He is the all-time highest run-scorer in both ODI and Test cricket with more than 18,000 runs and 15,000 runs, respectively.[5] He also holds the record for receiving the most man-of-the-match awards in international cricket.[6] Tendulkar was a Member of Parliament, Rajya Sabha by nomination from 2012 to 2018.[7][8]

Tendulkar took up cricket at the age of eleven, made his Test match debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumbai domestically and India internationally for over 24 years.[9] In 2002, halfway through his career, Wisden ranked him the second-greatest Test batsman of all time, behind Don Bradman, and the second-greatest ODI batsman of all time, behind Viv Richards.[10] The same year, Tendulkar was a part of the team that was one of the joint-winners of the 2002 ICC Champions Trophy. Later in his career, Tendulkar was part of the Indian team that won the 2011 Cricket World Cup, his first win in six World Cup appearances for India.[11] He had previously been named "Player of the Tournament" at the 2003 World Cup.

Tendulkar has received several awards from the government of India: the Arjuna Award (1994), the Khel Ratna Award (1997), the Padma Shri (1998), and the Padma Vibhushan (2008).[12][13] After Tendulkar played his last match in November 2013, the Prime Minister's Office announced the decision to award him the Bharat Ratna, India's highest civilian award.[14][15] He was the first sportsperson to receive the reward and, as of 2023, is the youngest recipient.[16][17][18] In 2010, Time included Tendulkar in its annual list of the most influential people in the world.[19] Tendulkar was awarded the Sir Garfield Sobers Trophy for cricketer of the year at the 2010 International Cricket Council (ICC) Awards.[20]

Having retired from ODI cricket in 2012,[21][22] he retired from all forms of cricket in November 2013 after playing his 200th Test match.[23] Tendulkar played 664 international cricket matches in total, scoring 34,357 runs.[24] In 2013, Tendulkar was included in an all-time Test World XI to mark the 150th anniversary of Wisden Cricketers' Almanack, and he was the only specialist batsman of the post–World War II era, along with Viv Richards, to get featured in the team.[25] In 2019, he was inducted into the ICC Cricket Hall of Fame.[26] On 24 April 2023, the Sydney Cricket Ground unveiled a set of gates named after Tendulkar and Brian Lara on the occasion of Tendulkar's 50th birthday and the 30th anniversary of Lara's inning of 277 at the ground.[27][28][29]
'''

In [None]:
from langchain.text_splitter import CharacterTextSplitter

chunk_size = 256
chunk_overlap  = 20
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = chunk_size,
    chunk_overlap  = chunk_overlap
)
docs = text_splitter.create_documents([fourparagraphs])

Created a chunk of size 588, which is longer than the specified 256
Created a chunk of size 781, which is longer than the specified 256
Created a chunk of size 710, which is longer than the specified 256


In [None]:
chunks = []
for doc in docs:
    docs = text_splitter.create_documents([doc.page_content])
    chunks.extend([doc.page_content for doc in docs])
print (chunks)

['Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket.[4] He is the all-time highest run-scorer in both ODI and Test cricket with more than 18,000 runs and 15,000 runs, respectively.[5] He also holds the record for receiving the most man-of-the-match awards in international cricket.[6] Tendulkar was a Member of Parliament, Rajya Sabha by nomination from 2012 to 2018.[7][8]', 'Tendulkar took up cricket at the age of eleven, made his Test match debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumbai domestically and India internationally for over 24 years.[9] In 2002, halfway through his career, Wisden ranked him the second-greatest Test batsman of all time, behind Don Bradman, and the second-greatest ODI batsman of all time,

In [None]:
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}  # Start index: {i*chunk_size}, End index: {(i+1)*chunk_size}")

Chunk 1: Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket.[4] He is the all-time highest run-scorer in both ODI and Test cricket with more than 18,000 runs and 15,000 runs, respectively.[5] He also holds the record for receiving the most man-of-the-match awards in international cricket.[6] Tendulkar was a Member of Parliament, Rajya Sabha by nomination from 2012 to 2018.[7][8]  # Start index: 0, End index: 256
Chunk 2: Tendulkar took up cricket at the age of eleven, made his Test match debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumbai domestically and India internationally for over 24 years.[9] In 2002, halfway through his career, Wisden ranked him the second-greatest Test batsman of all time, behind Don Bradman, a

In [None]:
#model = SentenceTransformer('all-mpnet-base-v2')

**Learning**:
1. Fixed size (character based. say a number.. 256) chunking doesn't work as the paragraphs could be longer than 256 characters. 

Example look at the following:
- Created a chunk of size 588, which is longer than the specified 256
- Created a chunk of size 781, which is longer than the specified 256
- Created a chunk of size 710, which is longer than the specified 256

2. How about Naive splitting?

In [None]:
chunks = fourparagraphs.split(".")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}  # Start index: {i*chunk_size}, End index: {(i+1)*chunk_size}")

Chunk 1: 
Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team  # Start index: 0, End index: 256
Chunk 2:  He is widely regarded as one of the greatest batsmen in the history of cricket  # Start index: 256, End index: 512
Chunk 3: [4] He is the all-time highest run-scorer in both ODI and Test cricket with more than 18,000 runs and 15,000 runs, respectively  # Start index: 512, End index: 768
Chunk 4: [5] He also holds the record for receiving the most man-of-the-match awards in international cricket  # Start index: 768, End index: 1024
Chunk 5: [6] Tendulkar was a Member of Parliament, Rajya Sabha by nomination from 2012 to 2018  # Start index: 1024, End index: 1280
Chunk 6: [7][8]

Tendulkar took up cricket at the age of eleven, made his Test match debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumba

NLP model based chunking. 
What does an Autotokenizer do?

In [None]:
example = "The tokenizer does tokenization. It does this to have fun with tokens."

In [None]:
#!pip install transformers



In [None]:
from transformers import AutoTokenizer
import pandas as pd

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(chunks[0])

In [None]:
print (chunks[0])
pd.DataFrame(dict(tokens))


Sachin Ramesh Tendulkar, (/ˌsʌtʃɪn tɛnˈduːlkər/ i; pronounced [sətɕin teːɳɖulkəɾ]; born 24 April 1973) is an Indian former international cricketer who captained the Indian national team


Unnamed: 0,input_ids,token_type_ids,attention_mask
0,101,0,1
1,17266,0,1
2,10606,0,1
3,8223,0,1
4,9953,0,1
5,7166,0,1
6,5313,0,1
7,6673,0,1
8,1010,0,1
9,1006,0,1


In [None]:
from sentence_transformers import SentenceTransformer

MODEL = 'msmarco-distilbert-base-v4'
embedder = SentenceTransformer(MODEL)
embeddings = embedder.encode(fourparagraphs, convert_to_tensor=True)
embeddings.shape
print (f'{fourparagraphs[0]}  {embeddings[0]}')
query_text = "tendulkar"
query = embedder.encode(query_text, convert_to_tensor=True)
from sentence_transformers import util
search_results = util.semantic_search(query, embeddings, top_k = 3)
search_results
for index, result in enumerate(search_results[0]):
    print('-'*80)
    print(f'Search Rank: {index}, Relevance score: {result["score"]} ')
    print(fourparagraphs[result['corpus_id']])


  -0.3871190547943115
--------------------------------------------------------------------------------
Search Rank: 0, Relevance score: 0.6493244171142578 


