## Pulling back the curtains on `text2vec`

How does Weaviate concatenate & vectorize data?
(See the companion article on the [Weaviate blog](https://weaviate.io/blog.html))

### Set up Weaviate

We'll use the Cohere API here for embeddings. The `text2vec-cohere` module by default reads the API key from `COHERE_APIKEY` environment variable. It needs to be set prior to running `docker-compose up`, for example by running `export COHERE_APIKEY=my-key-here`. 

Here, we will read the key, and send it with every request as a header.

Note: At the time of writing (December 2022), Cohere offers a trial API key which is free, but rate-limited at 100 calls per minute.

In [1]:
with open("../../keys/cohere_trial_key.txt", "r") as f:
    cohere_key = f.read().strip()

In [2]:
import weaviate

In [3]:
client = weaviate.Client(
    # url="http://127.0.0.1:8080",  # If using a Docker-based instance
    url="https://YOUR_WCS_INSTANCE.semi.network",  # If using a WCS instance
    additional_headers={
        'X-Cohere-Api-Key': cohere_key
    }    
)
client.schema.delete_all()  # Delete data in case previous data exists
client.schema.get()

{'classes': []}

### Load data

Load our tiny dataset of questions from `Jeopardy!`. Original dataset from [Kaggle](https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions).

In [4]:
import pandas as pd
df = pd.read_csv('tiny_jeopardy.csv')
df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


### Add data to Weaviate

#### Use batch import

In [5]:
client.batch.configure(
    batch_size=100,
    dynamic=True,
    timeout_retries=3,
    callback=None
)

<weaviate.batch.crud_batch.Batch at 0x118866250>

Iterate through data and add to batch for vectorization

In [6]:
for i, row in df.iterrows():
    properties = {
        "question": row.Question,
        "answer": row.Answer
    }
    client.batch.add_data_object(properties, "Question")
client.batch.flush()

### Basic queries

Now that we have populated Weaviate, let's query it to make sure it works.

Let's grab any two objects.

In [7]:
client.query.get("Question", ["question", "answer"]).with_limit(2).do()

{'data': {'Get': {'Question': [{'answer': 'Jim Thorpe',
     'question': 'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'},
    {'answer': 'Arizona',
     'question': 'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'}]}}}

Let's get objects closest to "fast food chains". We can use the `nearText` query since we are using the `text2vec-cohere` module.

In [8]:
near_text = {
    "concepts": ["fast food chains"]
}
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_text(
    near_text
).do()

{'errors': [{'locations': [{'column': 24, 'line': 1}],
   'message': 'Unknown argument "nearText" on field "Question" of type "GetObjectsObj". Did you mean "nearVector" or "nearObject"?',
   'path': None}]}

Let's also retrieve additional attributes of distance to query and the target vector.

In [9]:
resp = client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_text(
    near_text
).with_additional(['distance', 'vector']).do()

Let's inspect the response further.

In [10]:
print(resp['data']['Get']['Question'][0]['_additional']['distance'])
resp_vect = resp['data']['Get']['Question'][0]['_additional']['vector']
print(len(resp_vect))
print(resp_vect[:2], "...", resp_vect[-2:])

KeyError: 'data'

We can confirm that the distance is reasonably small, and that a 768-dimensional vector is returned, with the shown dimensions.

### Vectorization

But - what is the exact text that is being vectorized?

Let's try to reproduce the same vector using the Cohere API only.

We can instantiate the Cohere module like so:

In [None]:
import cohere
co = cohere.Client(cohere_key)

Let's try to re-build the vector for the question about McDonald's.

In [None]:
input_props = resp['data']['Get']['Question']
input_props = {k:input_props[0][k] for k in ['question', 'answer']}
input_props

Let's try with one where we join the values of the dictionary:

In [None]:
str_in = ' '.join([i for i in input_props.values()])
str_in

And convert the string to a vector. Note that `multilingual-22-12` is the default model. 

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")

The response looks like this:

In [None]:
co_response

We can see that the response includes one embedding

In [None]:
len(co_response.embeddings)

And it is 768-dimensions long

In [None]:
len(co_response.embeddings[0])

Let's see how close this embedding is to the Weaviate embedding:

In [None]:
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()

Not bad! But they're not identical - why is that? Actually, Weaviate sorts the text to be concatenated by the property title. 

In [None]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = str_in.strip()  # remove trailing whitespace
str_in

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")
wv_response = client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()
wv_response

We are **so very close** - But the input vector is still not identical to the Weaviate-generated one! Why is that?

That's because the default behaviour also includes the class name. Let's add the class name.

In [None]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = 'question ' + str_in.strip()  # remove trailing whitespace

In [None]:
str_in

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()

It's very close, but not quite zero yet.

Weaviate also changes the text to lowercase, so let's incorporate that also.

In [None]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = 'question ' + str_in.lower().strip()  # remove trailing whitespace

In [None]:
str_in

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()

Look! We have an identical vector.

### Tweaking of vectorization

Now that we know the default behaviour, let's try tweaking it to do something different. 

Did you notice that in the exercise above, we *hadn't* set a schema before importing the data? What happened was that Weaviate created a schema with the auto-import feature.

Let's define a schema. Incidentally, this is where we define what properties are to be vectorized (or not).

In [None]:
client.schema.delete_all()
client.schema.get()

In [None]:
question_class = {
    "class": "Question",
    "description": "Details of a Jeopardy! question",    
    "moduleConfig": {
        "text2vec-cohere": {  # The vectorizer name - must match the vectorizer used
            "vectorizeClassName": False,  # Ignore class name
        },
    },
    "properties": [
        {
            "name": "answer",    
            "description": "What the host prompts the contestants with.",            
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": False  # Ignore property name
                }
            }
        },
        {
            "name": "question",
            "description": "What the contestant is to provide.",
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": True  # Do not ignore property name
                }
            }
        },        
    ]
}

client.schema.create_class(question_class)
client.schema.get()

Populate Weaviate again with the new schema

In [None]:
for i, row in df.iterrows():
    properties = {
        "question": row.Question,
        "answer": row.Answer
    }
    client.batch.add_data_object(properties, "Question")
client.batch.flush()
client.data_object.get()

Now let's take another look at the vector, comparing the distance to the nearest hit:

In [None]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = str_in.lower().strip()  # remove trailing whitespace
str_in = 'question ' + str_in

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()

The distance is no longer zero!

That's because we have changed the schema.

In [None]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        if k == 'question':
            str_in += k + ' '
        str_in += v + ' '
str_in = str_in.lower().strip()
str_in = str_in

In [None]:
str_in

In [None]:
co_response = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_response.embeddings[0]}
).with_additional(['distance']).do()

Adding the property title now returns our `distance` back to zero.