# Create Index and Import Data

Other notebooks:

* [query data](http://localhost:9999/notebooks/Documents/langchain/pinecode-read-and-write-data.ipynb)

A **transformer model** generates sentence embeddings by encoding a text sequence into dense numeric vectors that represent its meaning in a multidimensional space. These vectors capture semantic information such that texts with similar meanings are close together in that space.

Transformer models process input text through multiple layers of **self-attention** and **feed-forward neural networks**, enabling them to consider the relationships between all words in a sentence simultaneously. Unlike earlier models (like Word2Vec or GloVe) that treat words independently, transformers analyze each token in context—looking both forward and backward—so that the embedding for a word like "bank" differs depending on whether it appears in “river bank” or “bank account”.




```
YOUR_API_KEY="pcsk_4Sb7ji_4et5DjhU46G1avXaUMjtZru5XQvPKqcR7gnXC6VcVCvzFLcTRqVjLh2N7oxZqrf"

from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec

pc = Pinecone(api_key=YOUR_API_KEY)

pc.create_index(
    name="capitols-and-countries-2",
    vector_type="dense",
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    deletion_protection="disabled",
    tags={"environment": "development"}
)

```




In [109]:
lines = []
with open("word-test.v1.txt") as f:
    for line in f:
        # Skip comments and empty lines
        if not line.strip() or line.startswith("//") or line.startswith(":"):
            continue
        parts = line.strip().split()
        # You can combine pairs or treat each word separately
        lines.append(" ".join(parts))  # e.g., 'Athens Greece Baghdad Iraq'



Here, the model handles tokenization, attention processing, and mean pooling automatically, resulting in 384-dimensional embeddings that capture the semantic essence of each sentence.

**Mean pooling**: Models like Sentence Transformers (e.g., all-MiniLM-L6-v2) compute the mean of all token embeddings, weighted by the attention mask, to produce a compact embedding that reflects the overall semantic content


In [136]:
!source langchain/bin/activate

/bin/bash: line 1: langchain/bin/activate: No such file or directory


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [139]:
from dotenv import load_dotenv
import os

load_dotenv()   
YOUR_API_KEY=os.getenv("PINECONE_API_KEY")
YOUR_API_KEY

'pcsk_4Sb7ji_4et5DjhU46G1avXaUMjtZru5XQvPKqcR7gnXC6VcVCvzFLcTRqVjLh2N7oxZqrf'

In [120]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # Local, fast, decent for general text
embeddings = model.encode(lines)


In [121]:
df = pd.DataFrame(embeddings)


In [122]:
import pandas as pd
import numpy as np
import json


 
# Build Pinecone-ready DataFrame
df = pd.DataFrame({
    "id" : [f"vec_{i}" for i in range(len(lines))],
    "values" : [np.array(x, dtype=np.float32).tolist() for x in embeddings],
    "metadata": [json.dumps({"text": line}) for line in lines]
})

# Convert each embedding to numpy array for compatibility
#df["embedding"] = df["embedding"].apply(np.array)
df.to_parquet("embeddings.parquet", engine="pyarrow")

#aws s3 cp embeddings.parquet s3://capitols-and-countries/imports/__default__/0.parquet


In [123]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19544 entries, 0 to 19543
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        19544 non-null  object
 1   values    19544 non-null  object
 2   metadata  19544 non-null  object
dtypes: object(3)
memory usage: 458.2+ KB


In [124]:
!aws s3 cp embeddings.parquet s3://capitols-and-countries/imports/__default__/0.parquet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


upload: ./embeddings.parquet to s3://capitols-and-countries/imports/__default__/0.parquet


In [128]:
from pinecone import Pinecone, ImportErrorMode

# Initialize Pinecone client
pc = Pinecone(api_key=YOUR_API_KEY)

# Connect to your index
index = pc.Index("capitols-and-countries-2")

 

index.delete_namespace(namespace="__default__")


# Define your S3 Parquet path
s3_uri = "s3://capitols-and-countries/imports"

# Start import
index.start_import(
    uri=s3_uri,
    error_mode=ImportErrorMode.CONTINUE,  # or ABORT to stop on first error
    integration_id="ac88a7fa-1c63-43fe-821e-5fac2deac81c"  # omit if public S3 file
)
 

{
    "id": "15"
}

In [130]:
index.describe_import(id="15")

{
    "id": "15",
    "uri": "s3://capitols-and-countries/imports",
    "status": "Completed",
    "percent_complete": 100.0,
    "records_imported": 19544,
    "created_at": "2025-10-19T07:10:14.248853+00:00",
    "finished_at": "2025-10-19T07:11:10.454674+00:00"
}

In [131]:
stats = index.describe_index_stats()
print(stats)

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 19544}},
 'total_vector_count': 19544,
 'vector_type': 'dense'}


In [132]:
import pyarrow.parquet as pq

table = pq.read_table("embeddings.parquet")
print(table.schema)


id: string
values: list<element: double>
  child 0, element: double
metadata: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 601
