# Motivation
In this notebook, I'm going to experiment with creating text embeddings using OpenAI's text embedding model. They recently released [an updated version of the embedding model](https://openai.com/blog/new-and-improved-embedding-model/) (called `text-embedding-ada-002`) is apparently quite good, and I wanted to experiment with it. 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by changing my working directory. 

In [1]:
%cd ..

C:\Data\Personal Study\Programming\neural-needle-drop


Now, I'll import some libraries.

In [59]:
# Import statements
import requests
import os
from requests.structures import CaseInsensitiveDict
import numpy as np
import json
from numpy import dot
from numpy.linalg import norm

# Text Embedding API 
I'm reading through [the documentation for OpenAI's text embedding model](https://beta.openai.com/docs/api-reference/embeddings/create), and I wanted to play around with it. I'm going to create a couple of methods for accessing this endpoint. Technically, OpenAI *does* have a Python API, but I want to try and poke around things myself in order to ensure I know what I'm working with.

I'll start by writing a method that generates embeddings for the strings in an `input_text_list`. 

In [45]:
# This method will return a list of ndarrays, each representing text embeddings of 
# the text in each index of the input_text_list list
def generate_embeddings(input_text_list):
    
    # Get the OpenAI API key from the environment variables 
    api_key = os.getenv("OPENAI_API_KEY", "")
    
    # Build the API request
    url = "https://api.openai.com/v1/embeddings"
    headers = CaseInsensitiveDict()
    headers["Content-Type"] = "application/json"
    headers["Authorization"] = "Bearer " + api_key
    data = """{"input": """ + json.dumps(input_text_list) + ""","model":"text-embedding-ada-002"}"""
    
    # Send the API request
    resp = requests.post(url, headers=headers, data=data)
    
    # If the request was successful, return ndarrays of the embeddings. Otherwise, return None objects 
    if resp.status_code == 200:
        return [np.asarray(data_object['embedding']) for data_object in resp.json()['data']]
    else:
        return [None for txt in input_text_list]   

I'm also going to make a method that calculates the cosine similarity between two ndarrays.

In [60]:
# This method will return the cosine similarity of two ndarrays
def cosine_sim(a, b):
    return dot(a, b)/(norm(a)*norm(b))

Next, I'm going to test this method. I've got a list of strings - I'm going to generate embeddings for them, and then calculate some cosine similarities. 

In [62]:
# Generate embeddings for each of the strings in the txt_list
txt_list = ["Hey, my name is Trevor.", "Hello, my name is Trevor.", "Hello, my name is Casey", "Sup bitch I'm T-dawg"]
emb_list = generate_embeddings(txt_list)

# Print the cosine similarity between the first string and the other ones 
for idx, txt in enumerate(txt_list):
    print(f"""Comparing "{txt_list[0]}" to "{txt}"\nCosine Similarity: {cosine_sim(emb_list[0], emb_list[idx])}\n""")

Comparing "Hey, my name is Trevor." to "Hey, my name is Trevor."
Cosine Similarity: 1.0

Comparing "Hey, my name is Trevor." to "Hello, my name is Trevor."
Cosine Similarity: 0.9891420343232956

Comparing "Hey, my name is Trevor." to "Hello, my name is Casey"
Cosine Similarity: 0.8596241625291113

Comparing "Hey, my name is Trevor." to "Sup bitch I'm T-dawg"
Cosine Similarity: 0.8227929935393037



Nice: seems like everything is working pretty well. 