To create a semantic search indexing function using Chonkie and doc2vec, follow these steps:

1. **Install necessary libraries**:
   - `Chonkie` for chunking text.
   - `gensim` for doc2vec.
   - `PyMuPDF` for importing PDF files.

2. **Import the PDF file and extract text**.
3. **Chunk the text into semantically similar chunks**.
4. **Encode the chunks into a vector space using doc2vec**.

Here's a step-by-step implementation:

### Step 1: Install necessary libraries


In [None]:
pip install chonkie gensim pypdf



### Step 2: Import the PDF file and extract text


In [None]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text



### Step 3: Chunk the text into semantically similar chunks


In [None]:
from chonkie import Chonkie

def chunk_text(text):
    chonkie = Chonkie()
    chunks = chonkie.chunk(text)
    return chunks



### Step 4: Encode the chunks into a vector space using doc2vec


In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

def encode_chunks(chunks):
    tagged_data = [TaggedDocument(words=chunk.split(), tags=[str(i)]) for i, chunk in enumerate(chunks)]
    model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
    model.build_vocab(tagged_data)
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
    return model

def vectorize_chunks(model, chunks):
    vectors = [model.infer_vector(chunk.split()) for chunk in chunks]
    return vectors



### Putting it all together


In [None]:
def create_semantic_search_index(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(text)
    model = encode_chunks(chunks)
    vectors = vectorize_chunks(model, chunks)
    return vectors, model

# Example usage
pdf_path = 'path/to/your/pdf/file.pdf'
vectors, model = create_semantic_search_index(pdf_path)



This code will:
1. Extract text from a PDF file.
2. Chunk the text into semantically similar chunks.
3. Encode the chunks into a vector space using doc2vec.

You can then use the `vectors` and `model` for semantic search indexing.