## Pulling back the curtains on `text2vec`

How does Weaviate concatenate & vectorize data?
(See the companion article on the [Weaviate blog](https://weaviate.io/blog.html))

### Set up Weaviate

We'll use the Cohere API here for embeddings. The `text2vec-cohere` module by default reads the API key from `COHERE_APIKEY` environment variable. It needs to be set prior to running `docker-compose up`, for example by running `export COHERE_APIKEY=my-key-here`. 

Here, we will read the key, and send it with every request as a header.

Note: At the time of writing (December 2022), Cohere offers a trial API key which is free, but rate-limited at 100 calls per minute.

In [1]:
# Read/set your Cohere API key here
with open("../../keys/cohere_trial_key.txt", "r") as f:  
    cohere_key = f.read().strip()

In [2]:
import weaviate

In [3]:
client = weaviate.Client(
    url="http://127.0.0.1:8080",  # If using a Docker-based instance
    # url="https://YOUR_WCS_INSTANCE.semi.network",  # If using a WCS instance
    additional_headers={
        'X-Cohere-Api-Key': cohere_key
    }    
)
client.schema.delete_all()  # Delete data in case previous data exists
client.schema.get()

{'classes': []}

### Load data

Load our tiny dataset of questions from `Jeopardy!`. Original dataset from [Kaggle](https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions).

In [4]:
import pandas as pd
df = pd.read_csv('tiny_jeopardy.csv')
df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


### Add data to Weaviate

#### Use batch import

In [5]:
client.batch.configure(
    batch_size=100,
    dynamic=True,
    timeout_retries=3,
    callback=None
)

<weaviate.batch.crud_batch.Batch at 0x1139016d0>

Iterate through data and add to batch for vectorization

In [6]:
for i, row in df.iterrows():
    properties = {
        "question": row.Question,
        "answer": row.Answer
    }
    client.batch.add_data_object(properties, "Question")
client.batch.flush()

### Basic queries

Now that we have populated Weaviate, let's query it to make sure it works.

Let's grab any two objects.

In [7]:
client.query.get("Question", ["question", "answer"]).with_limit(2).do()

{'data': {'Get': {'Question': [{'answer': 'the ant',
     'question': 'In the title of an Aesop fable, this insect shared billing with a grasshopper'},
    {'answer': 'Ceylon (or Sri Lanka)',
     'question': 'In 1000 Rajaraja I of the Cholas battled to take this Indian Ocean island now known for its tea'}]}}}

Let's get objects closest to "fast food chains". We can use the `nearText` query since we are using the `text2vec-cohere` module.

In [8]:
near_text = {
    "concepts": ["fast food chains"]
}
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_text(
    near_text
).do()

{'data': {'Get': {'Question': [{'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

Let's also retrieve additional attributes of distance to query and the target vector.

In [9]:
wv_resp = client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_text(
    near_text
).with_additional(['distance', 'vector']).do()

Let's inspect the response further.

In [10]:
print(wv_resp['data']['Get']['Question'][0]['_additional']['distance'])
wv_resp_vect = wv_resp['data']['Get']['Question'][0]['_additional']['vector']
print(len(wv_resp_vect))
print(wv_resp_vect[:2], "...", wv_resp_vect[-2:])

0.11430907
768
[0.24600697, -0.23755248] ... [0.031165963, -0.21798912]


We can confirm that the distance is reasonably small, and that a 768-dimensional vector is returned, with the shown dimensions.

### Vectorization

But - what is the exact text that is being vectorized?

Let's try to reproduce the same vector using the Cohere API only.

We can instantiate the Cohere module like so:

In [11]:
import cohere
co = cohere.Client(cohere_key)

Let's try to re-build the vector for the question about McDonald's.

In [12]:
input_props = wv_resp['data']['Get']['Question']
input_props = {k:input_props[0][k] for k in ['question', 'answer']}
input_props

{'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger',
 'answer': "McDonald's"}

Let's try with one where we join the values of the dictionary:

In [13]:
str_in = ' '.join([i for i in input_props.values()])
str_in

'In 1963, live on "The Art Linkletter Show", this company served its billionth burger McDonald\'s'

And convert the string to a vector. Note that `multilingual-22-12` is the default model. 

In [14]:
co_resp = co.embed([str_in], model="multilingual-22-12")

The response looks like this:

In [15]:
co_resp

cohere.Embeddings {
	embeddings: [[0.22701332, -0.030964054, 0.050336353, 0.049897335, -0.021324128, -0.16091986, 0.39172348, -0.39684454, 0.35449043, 0.28700465, -0.33484453, -0.41574034, -0.1838335, -0.015157565, 0.1193762, 0.13803549, 0.50380415, 0.09734542, -0.29555884, -0.12724696, -0.18610202, 0.6819611, -0.22728056, 0.19520861, -0.09814354, 0.18940791, -0.14462556, 0.1805138, 0.2214307, 0.25091803, 0.39456654, -0.30383578, 0.003720728, 0.6776438, -0.76900494, 0.41417876, 0.09526737, 0.039816678, -0.22038119, 0.6234025, 0.38795778, 0.09976828, 0.21993224, -0.059398625, 0.474929, 0.08870231, 0.0043426524, -0.049599346, -0.4271566, 0.0724841, -0.5687472, -0.27303874, 0.07239083, 0.025778376, -0.6908721, 0.68424636, -0.51627827, 0.485158, -0.15707815, -0.08897833, 0.20728052, -0.30644438, -0.37722677, 0.073214866, 0.3218, 0.06431876, -0.066812545, 0.31766108, 0.30044928, 0.3147846, -0.07034772, 0.10764563, 0.43878776, -0.028267745, -0.1105182, -0.18371834, -0.060697634, 0.13589285, 

We can see that the response includes one embedding

In [16]:
len(co_resp.embeddings)

1

And it is 768-dimensions long

In [17]:
len(co_resp.embeddings[0])

768

Let's see how close this embedding is to the Weaviate embedding:

In [18]:
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

{'data': {'Get': {'Question': [{'_additional': {'distance': 0.018137515},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.12236416},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

Not bad! But they're not identical - why is that? Actually, Weaviate sorts the text to be concatenated by the property title. 

In [19]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = str_in.strip()  # remove trailing whitespace
str_in

'McDonald\'s In 1963, live on "The Art Linkletter Show", this company served its billionth burger'

In [20]:
co_resp = co.embed([str_in], model="multilingual-22-12")
wv_resp = client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()
wv_resp

{'data': {'Get': {'Question': [{'_additional': {'distance': 0.014731407},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.11268568},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

We are **so very close** - But the input vector is still not identical to the Weaviate-generated one! Why is that?

That's because the default behaviour also includes the class name. Let's add the class name.

In [21]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = 'question ' + str_in.strip()  # remove trailing whitespace

In [22]:
str_in

'question McDonald\'s In 1963, live on "The Art Linkletter Show", this company served its billionth burger'

In [23]:
co_resp = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

{'data': {'Get': {'Question': [{'_additional': {'distance': 0.007935405},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.10493815},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

It's very close, but not quite zero yet.

Weaviate also changes the text to lowercase, so let's incorporate that also.

In [24]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = 'question ' + str_in.lower().strip()  # remove trailing whitespace

In [25]:
str_in

'question mcdonald\'s in 1963, live on "the art linkletter show", this company served its billionth burger'

In [26]:
co_resp = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

{'data': {'Get': {'Question': [{'_additional': {'distance': 1.7881393e-07},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.105109274},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

Look! We have an identical vector.

### Tweaking of vectorization

Now that we know the default behaviour, let's try tweaking it to do something different. 

Did you notice that in the exercise above, we *hadn't* set a schema before importing the data? What happened was that Weaviate created a schema with the auto-import feature.

Let's define a schema. Incidentally, this is where we define what properties are to be vectorized (or not).

In [27]:
client.schema.delete_all()
client.schema.get()

{'classes': []}

In [28]:
question_class = {
    "class": "Question",
    "description": "Details of a Jeopardy! question",    
    "moduleConfig": {
        "text2vec-cohere": {  # The vectorizer name - must match the vectorizer used
            "vectorizeClassName": False,  # Ignore class name
        },
    },
    "properties": [
        {
            "name": "answer",    
            "description": "What the host prompts the contestants with.",            
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": False  # Ignore property name
                }
            }
        },
        {
            "name": "question",
            "description": "What the contestant is to provide.",
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": True  # Do not ignore property name
                }
            }
        },        
    ]
}

client.schema.create_class(question_class)
client.schema.get()

{'classes': [{'class': 'Question',
   'description': 'Details of a Jeopardy! question',
   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
    'cleanupIntervalSeconds': 60,
    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
   'moduleConfig': {'text2vec-cohere': {'vectorizeClassName': False}},
   'properties': [{'dataType': ['string'],
     'description': 'What the host prompts the contestants with.',
     'moduleConfig': {'text2vec-cohere': {'skip': False,
       'vectorizePropertyName': False}},
     'name': 'answer',
     'tokenization': 'word'},
    {'dataType': ['string'],
     'description': 'What the contestant is to provide.',
     'moduleConfig': {'text2vec-cohere': {'skip': False,
       'vectorizePropertyName': True}},
     'name': 'question',
     'tokenization': 'word'}],
   'shardingConfig': {'virtualPerPhysical': 128,
    'desiredCount': 1,
    'actualCount': 1,
    'desiredVirtualCount': 128,
    'actualVirtualCount': 128,
    'key': '_id',

Populate Weaviate again with the new schema

In [29]:
for i, row in df.iterrows():
    properties = {
        "question": row.Question,
        "answer": row.Answer
    }
    client.batch.add_data_object(properties, "Question")
client.batch.flush()
client.data_object.get()

{'deprecations': None,
 'objects': [{'class': 'Question',
   'creationTimeUnix': 1673090452982,
   'id': '0fe37df4-856c-4c79-8ce8-5318d57d3c65',
   'lastUpdateTimeUnix': 1673090452982,
   'properties': {'answer': '(Lou) Gehrig',
    'question': 'No. 10: FB/LB for Columbia U. in the 1920s; MVP for the Yankees in \'27 & \'36; "Gibraltar in Cleats"'},
   'vectorWeights': None},
  {'class': 'Question',
   'creationTimeUnix': 1673090452525,
   'id': '109fd6af-f605-4bec-b419-45ff1d9bba79',
   'lastUpdateTimeUnix': 1673090452525,
   'properties': {'answer': 'Arizona',
    'question': 'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'},
   'vectorWeights': None},
  {'class': 'Question',
   'creationTimeUnix': 1673090452525,
   'id': '11c93af8-3c80-4d25-8239-8ed4f8a87c2c',
   'lastUpdateTimeUnix': 1673090452525,
   'properties': {'answer': 'John Adams',
    'question': 'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of

Now let's take another look at the vector, comparing the distance to the nearest hit:

In [30]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        str_in += v + ' '
str_in = str_in.lower().strip()  # remove trailing whitespace
str_in = 'question ' + str_in

In [31]:
co_resp = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

{'data': {'Get': {'Question': [{'_additional': {'distance': 0.003951192},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.10788387},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

The distance is no longer zero!

That's because we have changed the schema.

In [32]:
str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        if k == 'question':
            str_in += k + ' '
        str_in += v + ' '
str_in = str_in.lower().strip()
str_in = str_in

In [33]:
str_in

'mcdonald\'s question in 1963, live on "the art linkletter show", this company served its billionth burger'

In [34]:
co_resp = co.embed([str_in], model="multilingual-22-12")
client.query.get(
    "Question", 
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

{'data': {'Get': {'Question': [{'_additional': {'distance': -1.1920929e-07},
     'answer': "McDonald's",
     'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'},
    {'_additional': {'distance': 0.10982114},
     'answer': 'the cud',
     'question': 'Cows regurgitate this from the first stomach to the mouth & chew it again'}]}}}

Adding the property title now returns our `distance` back to zero.