# CodeNames Spymaster AI
## Using SBert word embeddings to generate clues for the game CodeNames

### Overview of Approach
#### One-Time pre-setup
1. Prepare a word embedding model, e.g. 'sentence-transformers/all-MiniLM-L6-v2'
2. Prepare a vector database, e.g. Milvus to store the dictionary of word embeddings.
3. Iterate over a dictionary of words, compute word embeddings for each, and insert into a vector database. Note that these words represent the set of possible clue words, not just the possible game words (e.g. the [Codenames AI Competition word pool](https://github.com/CodenamesAICompetition/Game/blob/master/codenames/game_wordpool.txt)).

#### Game Setup
1. randomly select 9 red words, 8 blue words, 7 grey words, and 1 black word from the dictionary to constitute the game board
2. compute word embeddings for the words on the board


#### Generating Clues
1. compute clusters on the embeddings generated during game setup that group red, blue, grey, and black words s.t. there is no overlap (there may be multiple clusters per color). TODO: select clustering algo, e.g. (Agglomerative Clustering)[https://www.sbert.net/examples/applications/clustering/README.html#agglomerative-clustering]
2. compute the centroid of each cluster. TODO: select centroid algo
3. 

### Visualization

### Follow-up
1. reduce the size and precision of the embeddings to reduce vector DB size, e.g. (dimensionality reduction)[https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/distillation/dimensionality_reduction.py]
2. Create a guesser and adjust interfaces to match the Google [Codenames AI Competition](https://github.com/CodenamesAICompetition/Game)


### Pre-Setup:

#### 1. Load the dictionary
filtering out words less than 3 letters and removing some common words (e.g. "the", "and", etc)

Sources:
~10k: 10k most common words on wikipedia
https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/English/Wikipedia_(2016)
https://wortschatz.uni-leipzig.de/de/download/English

~250k: Unix `/usr/share/dict/words`



In [3]:
with open('words/10k_wiki_words.txt') as f:
    dict_words = f.read().splitlines()

In [4]:
len(dict_words)

8973

#### 2. Compute Embeddings

In [5]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [6]:
dict_embeddings = list(map(lambda x: model.encode(x), dict_words))

In [7]:
len(dict_embeddings)

8973

In [18]:
import requests
url = "http://localhost:11434/api/embeddings"
request = {"model": "gemma:2b", "prompt": "hello"}

response = requests.post(url, json=request)

In [26]:
if response.status_code == 200:
    print("Request successful!")
    # Access the response data
    response_data = response.json()
    print(len(response_data.get("embedding")))
else:
    print(f"Request failed with status code {response.status_code}")

Request successful!
2048


#### 3. Load into Vector DB

In [11]:
from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
)
connections.connect("default", host="localhost", port="19530")

In [12]:
fields = [
    FieldSchema(
        name="id",
        dtype=DataType.INT64,
        is_primary=True,
        auto_id=False),
    FieldSchema(
        name="word",
        dtype=DataType.VARCHAR,
        max_length=32,
    ),
    FieldSchema(
        name="embeddings",
        dtype=DataType.FLOAT_VECTOR,
        dim=384,
    )
]
schema = CollectionSchema(fields, "Dictionary of word embeddings")
dict_db = Collection("dict_embeddings", schema)    

In [13]:
entries = [
    [i for i in range(len(dict_words))],
    dict_words,
    dict_embeddings,
]

In [98]:
dict_db.insert(entries)
dict_db.flush()

In [15]:
dict_db.num_entities

8973

In [100]:
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "COSINE",
    "params": {"nlist": 128},
}
dict_db.create_index("embeddings", index)

Status(code=0, message=)

In [16]:
dict_db.load()

In [24]:
vectors_to_search = red_embeddings
search_params = {
    "metric_type": "COSINE",
    "params": {"nprobe": 10},
}
result = dict_db.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["id", "word"])

In [25]:
for hits in result:
    print("====")
    for hit in hits:
        print (hit.entity)

====
id: 1308, distance: 0.6538330316543579, entity: {'id': 1308, 'word': 'animal'}
id: 6171, distance: 0.6095772981643677, entity: {'id': 6171, 'word': 'pigs'}
id: 6745, distance: 0.586907148361206, entity: {'id': 6745, 'word': 'rats'}
====
id: 3746, distance: 0.9999998807907104, entity: {'id': 3746, 'word': 'face'}
id: 3749, distance: 0.8222840428352356, entity: {'id': 3749, 'word': 'facial'}
id: 3748, distance: 0.81437087059021, entity: {'id': 3748, 'word': 'faces'}
====
id: 5285, distance: 0.9999999403953552, entity: {'id': 5285, 'word': 'march'}
id: 673, distance: 0.705102264881134, entity: {'id': 673, 'word': 'october'}
id: 305, distance: 0.7045801877975464, entity: {'id': 305, 'word': 'february'}
====
id: 4118, distance: 0.5216944813728333, entity: {'id': 4118, 'word': 'gambling'}
id: 8878, distance: 0.464677631855011, entity: {'id': 8878, 'word': 'winning'}
id: 2533, distance: 0.4633045196533203, entity: {'id': 2533, 'word': 'contest'}
====
id: 5636, distance: 1.0, entity: {'id

### Game Setup: 

#### Randomly select 25 words from the word pool and randomly assign them as red (9), blue (8), grey (7), and black (1)

In [82]:
with open('game_wordpool.txt') as f:
    game_words = f.read().splitlines()

import random

words = random.sample(game_words, k=25)

red_words = words[0:9]
blue_words = words[9:17]
grey_words = words[17:24]
black_words = words[24:]

#### Compute word embeddings for each word on the board

In [35]:
red_embeddings = list(map(lambda x: model.encode(x), red_words))

In [36]:
blue_embeddings = list(map(lambda x: model.encode(x), blue_words))

In [37]:
grey_embeddings = list(map(lambda x: model.encode(x), grey_words))

In [38]:
black_embeddings = list(map(lambda x: model.encode(x), black_words))

### Generating Clues

In [71]:
from sklearn.cluster import AgglomerativeClustering

red_clusters = AgglomerativeClustering(n_clusters=None, metric="cosine", linkage="complete", distance_threshold=0.8).fit_predict(red_embeddings)


In [75]:
red_clusters.tolist()

[2, 2, 1, 0, 2, 0, 2, 0, 1]

In [73]:
red_words

['HEAD', 'COPPER', 'WAVE', 'CRANE', 'ORGAN', 'EGYPT', 'KEY', 'WAKE', 'CONCERT']

In [78]:
zipped = zip(red_clusters.tolist(), red_words)

In [81]:
list(zipped)[0]

(2, 'HEAD')