# Pinecone

This notebook shows how to use [Pinecone](https://www.pinecone.io/) to index and search for similar papers based on their embeddings.

**Strengths:**
- Managed service with scalable infrastructure, easing deployment and management.
- High performance with low query latency.
- Supports real-time updates and indexing of vectors.
- Native Python and Javascript libraries.
- Free tier for experimentation.

**Weaknesses:**
- Requires using Pinecone's managed service, limiting deployment flexibility.
- May have costs associated with the managed service.
- Limited customization options compared to open-source or self-hosted solutions.

In [12]:
import pandas as pd
import numpy as np

datafile_path = f"data/software_architecture/bib-text_embeddings.csv.csv"
df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(eval).apply(np.array)  # convert string to array
df

Unnamed: 0,author,title,doi,classes,url,abstract,embedding
0,Alessio Bucaioni and Patrizio Pelliccione and ...,Aligning Architecture with Business Goals in t...,10.1109/ICSA51549.2021.00020,"Meta Data{Research Level{Primary Research}, Ki...",https://doi.org/10.1109/ICSA51549.2021.00020,When designing complex automotive systems in p...,"[0.00016247970052063465, -0.006258538924157619..."
1,H{\'{e}}ctor Cadavid and Vasilios Andrikopoulo...,System- and Software-level Architecting Harmon...,10.1109/ICSA51549.2021.00010,"Meta Data{Kind{full}, Paper class{Evaluation R...",https://doi.org/10.1109/ICSA51549.2021.00010,The problems caused by the gap between system-...,"[0.01373602356761694, -0.007272415794432163, 0..."
2,Joshua Garcia and Mehdi Mirakhorli and Lu Xiao...,Constructing a Shared Infrastructure for Softw...,10.1109/ICSA51549.2021.00022,"Meta Data{Paper class{Evaluation Research}, Re...",https://doi.org/10.1109/ICSA51549.2021.00022,Over the past three decades software engineeri...,"[0.0022478506434708834, -0.0009823687141761184..."
3,Holger Knoche and Wilhelm Hasselbring,Continuous {API} Evolution in Heterogenous Ent...,10.1109/ICSA51549.2021.00014,"Meta Data{Research Level{Primary Research}, Ki...",https://doi.org/10.1109/ICSA51549.2021.00014,The ability to independently deploy parts of a...,"[0.0058586616069078445, -0.031336553394794464,..."
4,Duc Minh Le and Suhrid Karthik and Marcelo Sch...,Architectural Decay as Predictor of Issue- and...,10.1109/ICSA51549.2021.00017,"Meta Data{Paper class{Evaluation Research}, Re...",https://doi.org/10.1109/ICSA51549.2021.00017,Architectural decay imposes real costs in term...,"[0.015215540304780006, -0.02269367128610611, 0..."
...,...,...,...,...,...,...,...
148,"Keim, Jan and Schulz, Sophie and Fuch{\ss}, Do...",Trace {Link} {Recovery} for {Software} {Archit...,10.1007/978-3-030-86044-8_7,"Meta Data{Research Level{Primary Research}, Ki...",,Software Architecture Documentation often cons...,"[0.00465840520337224, 0.004803537856787443, 0...."
149,"Shabelnyk, Oleksandr and Frangoudis, Pantelis ...",Updating {Service}-{Based} {Software} {Systems...,10.1007/978-3-030-86044-8_10,"Meta Data{Paper class{Proposal of Solution}, R...",,Contemporary component-based systems often man...,"[-0.004098787903785706, -0.006373200099915266,..."
150,Stefan Kugele and David Hettler and Jan Peter,Data-Centric Communication and Containerizatio...,10.1109/ICSA.2018.00016,"Meta Data{Kind{full}, Research Level{Primary R...",https://doi.org/10.1109/ICSA.2018.00016,Context: The functional interconnection and da...,"[-0.011806309223175049, 0.010385949164628983, ..."
151,Banani Roy and Amit Kumar Mondal and Chanchal ...,Towards a Reference Architecture for Cloud-Bas...,10.1109/ICSA.2017.42,"Meta Data{Paper class{Validation Research, Eva...",https://doi.org/10.1109/ICSA.2017.42,The domain of plant genotyping and phenotyping...,"[-0.00909524504095316, -0.012267849408090115, ..."


In [13]:
import pinecone

pinecone.init(
    environment="asia-southeast1-gcp-free",
    api_key="c2533f08-33a8-437f-a10d-3a0711fade78",
)

In [17]:
embeddings = np.array(df["embedding"].tolist())
print("Embeddings shape:", embeddings.shape)

# Check if pinecone index exists
if "software-architecture-classes" not in pinecone.list_indexes():
    print("Creating index...")
    pinecone.create_index(
        "software-architecture-classes",
        dimension=embeddings.shape[1],
        metric="euclidean",
    )

print("indexes:", pinecone.list_indexes())

Embeddings shape: (153, 1536)
indexes: ['software-architecture-classes']


In [19]:
index = pinecone.Index("software-architecture-classes")
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [32]:
vectors = []

for inx, row in df.iterrows():
    vectors.append((row["doi"], row["embedding"].tolist()))

index.upsert(vectors)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 153}},
 'total_vector_count': 153}

In [31]:
example = df.iloc[0]

matches = index.query(
    vector=example["embedding"].tolist(),
    top_k=4,
    include_values=False  # we don't need the actual embeddings
)

print("Title:", example["title"])
print("Similar papers:")
for match in matches["matches"]:
    doi = match.id
    if doi == example["doi"]:
        continue

    print("-", df[df["doi"] == doi]["title"].values[0])

Title: Aligning Architecture with Business Goals in the Automotive Domain
Similar papers:
- Technical Architectures for Automotive Systems
- On Service-Orientation for Automotive Software
- On Interfaces to Support Agile Architecting in Automotive: An Exploratory Case Study
