In [1]:
#!pip install yellowbrick

In [2]:
%run supportvectors-common.ipynb

ModuleNotFoundError: No module named 'yellowbrick'

ModuleNotFoundError: No module named 'yellowbrick'

# Semantic search

Traditionally, one would search through a corpus of documents using a keywords-based search engine like Lucene, Solr, ElasticSearch, etc. While the technology has matured, the basic underlying approach behind keyword search engines is to maintain an *inverted-index* mapping keywords to a list of documents that contain them, with associated relevances.

In general, the keywords-based search approach has been quite successful over the years, and have matured with added features and linguistic capabilities.

However, this approach has had its limitations. The principal cause of it goes to the fact that when we enter keywords, it is a human tendency to describe the intent of what we are looking for. For example, if we enter "breakfast places", we implicitly also mean restaurants, cafe, etc that serve items appropriate for breakfast. There may be a restaurant described as a shop for expresso, or crepe, that a keywords-search will likely miss, since its keywords do not match the query terms. And yet, we would hope to see it near the top of the search results.

Semantic search is an NLP approach largely relying on deep-neural networks, and in particular, the transformers that make it possible to more closely infer the human intent behind the search terms, the relationship between the words, and the underlying context. It allows for entire sentences -- and even paragraphs -- describing what the searcher's intent is, and retrieves results more relevant or aligned to it.

## How would we do this NLP task with AI?

Let us represent the functional behavior we expect: 


![](images/semantic-search-functionality.png)


### Magic happens: breaking it down into steps

We recall that machine-learning algorithms work with vectors ($\mathbf{X}$) representation of data.

So the first order of business would be to map each of the document texts $D_i$ to its corresponding vector $X_i$ in an appropriate $d$-dimensional space, $\mathbb{R}^d$, i.e.

\begin{equation}
D_i \longrightarrow X_i \in \mathbb{R}^d
\end{equation}

This resulting vectors are called **sentence embeddings**. Once these embeddings are for each of the documents, we can store the collection of tuples $[<D_1, X_1>, <D_2, X_2>, ..., <D_n, X_n>]$. Here each tuple corresponds to a document and its sentence embedding.

This collection of tuples, therefore, becomes our **search index**.

### Search

Now, when the user described what she is looking for, we consider the entire text as a "sentence".
<p>
<div class="alert-box alert-warning" style="padding-top:30px">
   
<b >Caveat Emptor</b>

> Note that we have a rather relaxed definition of a *sentence* in NLP: it diverges from a grammmatical definition of a sentence somewhat.  For example, in the English language, we would consider a sentence to be terminated with a punctuation, such as a period, question-mark or exclamation. However, in NLP, we loosely consider the entire text -- whether it is just a word, or a few keywords, or an english sentence, or a few sentences together -- as one **sentence** for the purposes of natual language processing task.
    
<p>
</div>
    
Therefore, it is common to consider an entire document text as a *sentence* if the text is relatively short. Alternatively, it is partitioned into smaller chunks (of say 512-tokens each), and each such chunk is considered an NLP *sentence*.

Since we consider the entire query text as a sentence, we can map it to its **sentence embedding vector**, ${Q}$.

#### Vector Similarity
Once we have this, we simply need to compare the query vector ${Q}$ with each of the document vectors $X_i$, and sort the document vectors in descending order of similarity.

The rest is trivial: pick the top-k  in the sorted document vectors list. Then for each vector, look up its corresponding document, and return the list as sorted search result of relevant document.

We expect that these documents will exhibit high semantic similarity with the search query, assuming that the search index did contain such documents.

<figure>
    <img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png">
    <caption> Semantic similarity as vector proximity in the embedding space. <br>
    (Figure source: Sbert.net documentation).
    </caption>
</figure>


#### Similarity measures

The sentence embedding vectors typically exist in very large dimensional space (e.g., 300 dimensions). In such large dimensional spaces, the notion of euclidean distance is not as effective. Therefore, it is far more common to use one of the two below measures for vector similarity:

* **dot-product**, the (inner) dot-product between the embedding vectors.

\begin{equation}
\text{dot-similarity} = \langle X_i, X_j \rangle
\end{equation}

* **cosine-similarity**, the $\cos \left(\theta_{ij}\right)$ gives degree of directional alignment between the vectors, but ignores their magnitudes. Here, $\theta_{ij}$ is the angle between $X_i$ and $X_j$ (embedding) vectors.

\begin{equation} 
\text{cosine-similarity} = \frac{\langle X_i, X_j \rangle} {\| X_i \| \| X_j \|}
\end{equation}

<div class="alert-box alert-info" style="padding-top:30px">
   
**Important**
    
>  Sentence transformer models trained with cosine-similarity tend to favor the shorter document texts in the search results, whereas the models trained on the dot-product similarity tend to favor longer texts.
</div>

### Symmetric vs asymmetric search

One of the technical aspects to be careful of is the relative textual length of the query sentence compared to the actual documents. Different sentence-transformer models have been trained specifically for each of these use-cases. 

* **symmetric search** when we expect the query-sentence to be approximately the same length as the document sentences.

* **asymmetric search** when we expect the document texts to be significantly larger in length to the query sentence.



#### Load an appropriate model

Let us consider the use-case where we are searching through some reasonably large documents. In such a case, it would be appropriate to use an asymmetric-search model. 

Let us consider an asymmetric model trained with *cosine-similarity* as the distance measure. In particular, let us use one of the below models:

* `


We load the model with the following code:

In [48]:
from sentence_transformers import SentenceTransformer

MODEL = 'msmarco-distilbert-base-v4'
embedder = SentenceTransformer(MODEL)

#### Load a toy corpus

Let us now load a toy corpus of some simple, long texts.

In [274]:
%run NLP-Lesson-01___search-corpus.ipynb










































Learning Transferable Visual Models From Natural Language Supervision


Learning Transferable Visual Models From Natural Language Supervision

Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1

Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1

Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of super- vision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw
text about images is a promising alternative which
leverages a much broader source of supervision.
We demonstrate that the simple pre-training task
of predicting which caption goes with which im-
age is an efficient and scalable way to learn SOTA
image representations from scratch on a dataset
of 400 million (imag

#### Search index of sentence embeddings

Let us now create the search index of sentence embeddings.

In [50]:
embeddings = embedder.encode(sentences, convert_to_tensor=True)

 ## Let us check the tokens in each of the sentences!

In [275]:
from transformers import AutoTokenizer
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(sentences)
df=pd.DataFrame(dict(tokens))
#df['input_ids'].iloc[0]
df['Length']=df['input_ids'].apply(len)
df

Token indices sequence length is longer than the specified maximum sequence length for this model (710 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,input_ids,token_type_ids,attention_mask,Length
0,"[101, 1521, 1056, 17311, 7987, 8591, 8004, 101...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",297
1,"[101, 2009, 2001, 1996, 2190, 1997, 2335, 1010...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",147
2,"[101, 1045, 4299, 2017, 2000, 2113, 2008, 2017...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",168
3,"[101, 1037, 6919, 2755, 2000, 8339, 2588, 1010...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",26
4,"[101, 1996, 3899, 2003, 1037, 10170, 1025, 104...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",20
5,"[101, 2065, 1037, 2158, 2004, 20781, 2015, 287...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",26
6,"[101, 26922, 3709, 2819, 1998, 26922, 3709, 44...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",125
7,"[101, 3585, 12850, 2869, 2024, 2025, 13680, 20...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",120
8,"[101, 2065, 2017, 1521, 2128, 5341, 1010, 1037...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",25
9,"[101, 2026, 2767, 6316, 2038, 1037, 3399, 2008...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",30


Note that we chose to get the embeddings as `pytorch` tensors -- this will help us later in doing high-performance searches over the GPU/TPU hardware. What do these embeddings look like? 

In [52]:
embeddings.shape

torch.Size([21, 768])

Clearly, there are 16 embeddings, each of a 768 dimensional vector. Let us glance at a sentence, and its embedding:

In [53]:
print (f'{sentences[0]}  {embeddings[0]}')


’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mo

#### Now, search for something!

Let us find the closest match to the the query: "a friendship with animals"

In [54]:
query_text = "kapil"
query = embedder.encode(query_text, convert_to_tensor=True)

In [55]:
from sentence_transformers import util
search_results = util.semantic_search(query, embeddings, top_k = 3)
search_results

[[{'corpus_id': 16, 'score': 0.4423687756061554},
  {'corpus_id': 19, 'score': 0.319680392742157},
  {'corpus_id': 17, 'score': 0.21497485041618347}]]

In [56]:
for index, result in enumerate(search_results[0]):
    print('-'*80)
    print(f'Search Rank: {index}, Relevance score: {result["score"]} ')
    print(sentences[result['corpus_id']])
    

--------------------------------------------------------------------------------
Search Rank: 0, Relevance score: 0.4423687756061554 

Kapil Dev Ramlal Nikhanj (Pronunciation: [kəpiːl deːʋ] born 6 January 1959) is an Indian former cricketer. One of the greatest all-rounders in the history of cricket, he was a fast-medium bowler and a hard-hitting middle-order batsman. Dev is the only player in the history of cricket to have taken more than 400 wickets (434 wickets) and scored more than 5,000 runs in Test.[4]

Dev captained the Indian cricket team that won the 1983 Cricket World Cup,[5] becoming the first Indian captain to win the Cricket World Cup. He is still the youngest captain (at the age of 24) to win the World Cup for any team.[6] He retired in 1994, as the first player to take 200 ODI wickets,[7] and holding the world record for the highest number of wickets taken in Test cricket, a record subsequently broken by Courtney Walsh in 2000.[8] Kapil Dev still holds the record for the

In [57]:
from langchain.text_splitter import CharacterTextSplitter

chunk_size = 256
chunk_overlap  = 20
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = chunk_size,
    chunk_overlap  = chunk_overlap
)
docs = text_splitter.create_documents(sentences)
chunks = []
for doc in docs:
    docs = text_splitter.create_documents([doc.page_content])
    chunks.extend([doc.page_content for doc in docs])
print("printing chunks\n")
print (chunks)

Created a chunk of size 379, which is longer than the specified 256
Created a chunk of size 581, which is longer than the specified 256
Created a chunk of size 344, which is longer than the specified 256
Created a chunk of size 588, which is longer than the specified 256
Created a chunk of size 781, which is longer than the specified 256
Created a chunk of size 710, which is longer than the specified 256
Created a chunk of size 720, which is longer than the specified 256
Created a chunk of size 312, which is longer than the specified 256
Created a chunk of size 514, which is longer than the specified 256
Created a chunk of size 863, which is longer than the specified 256
Created a chunk of size 956, which is longer than the specified 256
Created a chunk of size 740, which is longer than the specified 256
Created a chunk of size 1491, which is longer than the specified 256
Created a chunk of size 266, which is longer than the specified 256
Created a chunk of size 671, which is longer th

printing chunks



In [58]:
sentences

['\n’Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe:\nAll mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n“Beware the Jabberwock, my son!\n      The jaws that bite, the claws that catch!\nBeware the Jubjub bird, and shun\n      The frumious Bandersnatch!”\n\nHe took his vorpal sword in hand;\n      Long time the manxome foe he sought—\nSo rested he by the Tumtum tree\n      And stood awhile in thought.\n\nAnd, as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\nCame whiffling through the tulgey wood,\n      And burbled as it came!\n\nOne, two! One, two! And through and through\n      The vorpal blade went snicker-snack!\nHe left it dead, and with its head\n      He went galumphing back.\n\n“And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\nO frabjous day! Callooh! Callay!”\n      He chortled in his joy.\n\n’Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe:\nAll mimsy w

In [59]:
from langchain.text_splitter import SpacyTextSplitter
import langchain.text_splitter as ts

# Initialize the SpacyTextSplitter
spacy_text_splitter = ts.SpacyTextSplitter()

# Split the text into sentences
#text = "This is the first sentence. This is the second sentence."
for onesentence in sentences:
    chunks = spacy_text_splitter.split_text(onesentence)
    # Print the sentences
    print(chunks)

['’Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe:\nAll mimsy were the borogoves,\n      And the mome raths outgrabe.\n\n\n\n“Beware the Jabberwock, my son!\n      \n\nThe jaws that bite, the claws that catch!\nBeware the Jubjub bird, and shun\n      The frumious Bandersnatch!”\n\n\n\nHe took his vorpal sword in hand;\n      Long time the manxome foe he sought—\nSo rested he by the Tumtum tree\n      And stood awhile in thought.\n\n\n\nAnd, as in uffish thought he stood,\n      The Jabberwock, with eyes of flame,\nCame whiffling through the tulgey wood,\n      And burbled as it came!\n\n\n\nOne, two!\n\nOne, two!\n\nAnd through and through\n      The vorpal blade went snicker-snack!\n\n\nHe left it dead, and with its head\n      He went galumphing back.\n\n\n\n“And hast thou slain the Jabberwock?\n      Come to my arms, my beamish boy!\n\n\nO frabjous day!\n\nCallooh! Callay!”\n      \n\nHe chortled in his joy.\n\n’Twas brillig, and the slithy toves\n      Di

Output:

['Virat Kohli (Hindi pronunciation: [ʋɪˈɾɑːʈ ˈkoːɦli] i; born 5 November 1988) is an Indian international cricketer and the former captain of the Indian national cricket team who plays for Royal Challengers Bangalore in the IPL and Delhi in domestic cricket.\n\nConsidered to be one of the best cricketers in the world, he is widely regarded as one of the greatest batsmen in the history of the sport.[4] Nicknamed "The King", due to his dominant style of play and popularity, Kohli holds numerous records in his career across all formats.\n\nIn x2020, the International Cricket Council named him the male cricketer of the decade.\n\nKohli has also contributed to India\'s successes, captaining the team from 2014 to 2022, and winning the 2011 World Cup and the 2013 Champions trophy.\n\nHe is among the only four Indian cricketers who have played over 500 matches for India.[5]\n\nBorn and raised in New Delhi, Kohli trained at the West Delhi Cricket Academy and started his youth career with the Delhi Under-15 team.\n\nHe made his international debut in 2008 and quickly became a key player in the ODI team and later made his Test debut in 2011.\n\nIn 2013, Kohli reached the number one spot in the ICC rankings for ODI batsmen for the first time.\n\nDuring 2014 T20 World Cup, he set a record for the most runs scored in the tournament.\n\nIn 2018, he achieved yet another milestone, becoming the world\'s top-ranked Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game.\n\nHis form continued in 2019, when he became the first player to score 20,000 international runs in a single decade.\n\nIn 2021, Kohli made the decision to step down as the captain of the Indian national team for T20Is, following the T20 World Cup and in early 2022 he stepped down as the captain of the Test team as well.\n\n\n\nHe has received many accolades for his performances on the cricket field.\n\nHe was recognized as the ICC ODI Player of the Year in 2012 and has won the Sir Garfield Sobers Trophy, given to the ICC Cricketer of the Year, on two occasions, in 2017 and 2018 respectively.\n\nSubsequently, Kohli also won ICC Test Player of the Year and ICC ODI Player of the Year awards in 2018, becoming the first player to win both awards in the same year.\n\nAdditionally, he was named the Wisden Leading Cricketer in the World for three consecutive years, from 2016 to 2018.\n\nAt the national level, Kohli was honoured with the Arjuna Award in 2013, the Padma Shri under the sports category in 2017 and the Khel Ratna award, India\'s highest sporting honour, in 2018.\n\n\n\nIn 2016, he was ranked as one of the world\'s most famous athletes by ESPN, and one of the most valuable athlete brands by Forbes.\n\nIn 2018, Time magazine included him on its list of the 100 most influential people in the world.\n\nIn 2020, he was ranked 66th in Forbes list of the top 100 highest-paid athletes in the world for the year 2020 with estimated earnings of over $26 million.\n\nKohli has been deemed one of the most commercially viable cricketers, with estimated earnings of ₹165 crore (US$21 million) in the year 2022.']


In [60]:
print (chunks[0])

Learning Transferable Visual Models From Natural Language Supervision


Learning Transferable Visual Models From Natural Language Supervision

Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1

Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1

Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1

Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

This restricted form of super- vision limits their generality and usability since additional labeled data is needed to specify any other visual concept.

Learning directly from raw
text about images is a promising alternative which
leverages a much broader source of supervision.


We demonstrate that the simple pre-training task
of predicting which caption goes with which im-
age is an efficient and scalable way to learn SOTA
image representations from scratch on a dataset
of 400 million (image, text) pairs collected from
the in

**Learning**:
1. Notice the last four "sentences" are exceeding the token limits of the model (bert-base-uncased), which has a token limit of 512. In spite of that we got good results!

2. (in other notebook) Fixed size paragraph chunking didn't work (it was character based. say a number.. 256) as the paragraphs could be longer than 256 "characters". 

Example look at the following:
- Created a chunk of size 588, which is longer than the specified 256
- Created a chunk of size 781, which is longer than the specified 256
- Created a chunk of size 710, which is longer than the specified 256

3. How about Naive splitting? 
    like text.split(".") #it splits into too many chunks. we lose the "context" in it. (there's no chunk overlapping in here)
4. Recursive Chunking may be better (we used it in our code. )
    - try out again here. 
5. maybe its better to do chunkin with strategy no. 1?
6. Sapcy,NLTK with langchain also does semantic "sentence based" chunking.
7. We don't yet have a solution for *.txt IEEE paper abstracts getting split as they are in small column. 

In [263]:
dfsentences = pd.DataFrame(sentences)

In [273]:
sentences[1]

'\nIt was the best of times, it was the worst of times, \nit was the age of wisdom, it was the age of foolishness, \nit was the epoch of belief, it was the epoch of incredulity, \nit was the season of light, it was the season of darkness, \nit was the spring of hope, it was the winter of despair, \nwe had everything before us, we had nothing before us, \nwe were all going direct to heaven, \nwe were all going direct the other way–in short, \nthe period was so far like the present period, \nthat some of its noisiest authorities insisted on its being received, \nfor good or for evil, in the superlative degree of comparison only.\n'

23/09/20 22:56:25 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 18389698 ms exceeds timeout 120000 ms
23/09/20 22:56:25 WARN SparkContext: Killing executors is not supported by current scheduler.
23/09/20 22:56:26 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage

In [266]:
teststring = str(dfsentences.values[1,0])

In [272]:
teststring

'\nIt was the best of times, it was the worst of times, \nit was the age of wisdom, it was the age of foolishness, \nit was the epoch of belief, it was the epoch of incredulity, \nit was the season of light, it was the season of darkness, \nit was the spring of hope, it was the winter of despair, \nwe had everything before us, we had nothing before us, \nwe were all going direct to heaven, \nwe were all going direct the other way–in short, \nthe period was so far like the present period, \nthat some of its noisiest authorities insisted on its being received, \nfor good or for evil, in the superlative degree of comparison only.\n'

In [268]:
import numpy as np
import spacy

# Load the Spacy model
nlp = spacy.load('en_core_web_sm')

def process(text):
    doc = nlp(text)
    sents = list(doc.sents)
    vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])

    return sents, vecs

def cluster_text(sents, vecs, threshold):
    clusters = [[0]]
    for i in range(1, len(sents)):
        if np.dot(vecs[i], vecs[i-1]) < threshold:
            clusters.append([])
        clusters[-1].append(i)
    
    return clusters

def clean_text(text):
    # Add your text cleaning process here
    return text

# Initialize the clusters lengths list and final texts list
clusters_lens = []
final_texts = []

# Process the chunk
threshold = 0.3
#sents, vecs = process(str(dfsentences.values[0,0]))
sents, vecs = process(teststring)

# Cluster the sentences
clusters = cluster_text(sents, vecs, threshold)

for cluster in clusters:
    cluster_txt = clean_text(' '.join([sents[i].text for i in cluster]))
    cluster_len = len(cluster_txt)
    
    # Check if the cluster is too short
    if cluster_len < 1500:
        continue
    
    # Check if the cluster is too long
    elif cluster_len > 3000:
        threshold = 0.6
        sents_div, vecs_div = process(cluster_txt)
        reclusters = cluster_text(sents_div, vecs_div, threshold)
        
        for subcluster in reclusters:
            div_txt = clean_text(' '.join([sents_div[i].text for i in subcluster]))
            div_len = len(div_txt)
            
            if div_len < 60 or div_len > 3000:
                continue
            
            clusters_lens.append(div_len)
            final_texts.append(div_txt)
            
    else:
        clusters_lens.append(cluster_len)
        final_texts.append(cluster_txt)

In [269]:
print (final_texts)
# Print each chunk with an index and comment indicating its position in the original text
for i, chunk in enumerate(final_texts):
    print(f"-----Chunk {i+1}: {chunk}\n")

[]


In [92]:
tokens = tokenizer(final_texts)
df=pd.DataFrame(dict(tokens))
#df['input_ids'].iloc[0]
df['Length']=df['input_ids'].apply(len)
# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Print the DataFrame to a CSV file
df.to_csv('AdjacentSequenceClustering.csv')
df

Unnamed: 0,input_ids,token_type_ids,attention_mask,Length
0,"[101, 2023, 7775, 2433, 1997, 3565, 1011, 4432...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",405
1,"[101, 1006, 2289, 1007, 7645, 2009, 2001, 1343...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",509
2,"[101, 2096, 24404, 8405, 3802, 2632, 1012, 100...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",394
3,"[101, 2057, 17902, 2008, 2054, 2003, 2691, 240...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",35
4,"[101, 2035, 2122, 8107, 2024, 4083, 2013, 3019...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",69
5,"[101, 4083, 2013, 3019, 2653, 2038, 2195, 4022...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",15
6,"[101, 2009, 1521, 1055, 2172, 6082, 2000, 4094...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",57
7,"[101, 2612, 1010, 4725, 2029, 2147, 2006, 3019...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",28
8,"[101, 4083, 2013, 3019, 2653, 2036, 2038, 2019...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",51
9,"[101, 1999, 1996, 2206, 4942, 29015, 2015, 101...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",22
