# Project Embeddings Generation Notebook

This notebook loads the combined project data, generates embeddings using OpenAI's text-embedding-3-small model, and saves the result with embeddings.

In [5]:
import pandas as pd
import numpy as np
from openai import OpenAI
from tqdm.notebook import tqdm

# Initialize the OpenAI client
client = OpenAI()

In [6]:
# Load the combined project data
df = pd.read_pickle("../data/horizon_projects.pkl")
print(f"Loaded {len(df)} projects")
df.head()

Loaded 48726 projects


Unnamed: 0,id,title,objective,contentUpdateDate,title_length,objective_length
0,101006382,Mission-Oriented SwafS to Advance Innovation t...,While most SwafS initiatives have contributed ...,2024-07-22 12:39:54,64,1448
1,633080,Monitoring Atmospheric Composition and Climate...,MACC-III is the last of the pre-operational st...,2022-08-16 16:46:44,51,1932
2,633212,Aging Lungs in European Cohorts,This programme of work will advance the unders...,2023-10-25 16:11:30,31,1990
3,879534,The Enterprise Europe Network Baden-Wuerttembe...,BW-KAM 5 will implement tested and tailored in...,2022-10-28 14:08:00,106,754
4,743826,The Enterprise Europe Network Baden-Wuerttembe...,By providing Key Account Management and Enhanc...,2022-08-15 13:07:16,137,565


In [7]:
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def get_batched_embeddings(texts, model="text-embedding-3-small", batch_size=100):
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i+batch_size]
        batch = [text.replace("\n", " ") for text in batch]
        embeddings = client.embeddings.create(input=batch, model=model).data
        all_embeddings.extend([e.embedding for e in embeddings])
    return all_embeddings

In [8]:
# Combine title and objective for embedding
df['combined'] = df['title'] + " " + df['objective']

# Generate embeddings
print("Generating embeddings...")
df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))

# Remove the temporary column
df = df.drop('combined', axis=1)

print(f"Generated embeddings for {len(df)} projects")

Generating embeddings...
Generated embeddings for 48726 projects


In [9]:
# Verify the embeddings
print(f"Embedding shape: {len(df['ada_embedding'][0])}")
print("\nFirst few values of the first embedding:")
print(df['ada_embedding'][0][:5])

Embedding shape: 1536

First few values of the first embedding:
[0.006005911156535149, 0.03230809420347214, 0.04311935976147652, 0.007731314282864332, -0.027782445773482323]


In [13]:
# Save the DataFrame with embeddings
output_file = "../data/horizon_projects_embeddings.pkl"
df.to_pickle(output_file)
print(f"DataFrame with embeddings saved to {output_file}")

DataFrame with embeddings saved to ../data/horizon_projects_embeddings.pkl


## Conclusion

This notebook has successfully loaded the combined project data, generated embeddings using OpenAI's text-embedding-3-small model, and saved the result with embeddings. The new CSV file includes all original fields plus an 'ada_embedding' field containing the vector representation of each project's title and objective combined.