This notebook sets up a database and installs python modules to use with the crossword generator project. 



### Install the Weaviate client

In [None]:
# Uncomment to clear your current pip cache
# !pip cache purge

# Uncomment to upgrade pip
# !pip install --upgrade pip

# Install client from public released
!pip3 install --no-cache -U "weaviate-client==4.*"

# Check installed client version
!pip show weaviate-client | grep Version

### Install additional Python libraries

In [None]:
# # Import tqdm progress monitor
# !pip install tqdm

# # Import the Ollama python client
# !pip install ollama

# # Import spacy named entity recognition for puzzle generation
# #   It might take a few minutes to build spacy and setup the data
# !pip install spacy
# !python -m spacy download en_core_web_sm

### Install Ollama

Install an LLM and an embedding model.

Ollam should start after you install it. To check if Ollama is running, open localhost:11434 in a browser

In [None]:
# # Uncomment to install Ollama
# !ollama pull llama3         # The LLM
# !ollama pull all-minilm     # For embeddings

### Connect the client to a local Weaviate instance


In [None]:
import weaviate

client = weaviate.connect_to_local()

# Uncomment to check the connection
client.is_ready()

### Check if the Ollama module is enabled in Weaviate

If the Ollama modules are not configured, enable the `text2vec-ollama` module and the `generative-ollama` module in your Weaviate [configuration file](/developers/weaviate/installation#configuration-files).

In [3]:
meta_info = client.get_meta()
if 'text2vec-ollama' not in meta_info["modules"] :
    print("Enable the text2vec-ollama module.")

if 'generative-ollama' not in meta_info["modules"] :
    print("Enable the generative-ollama module.")


### Set the collection name

You will need a collection to store your data. This code lets you choose a collection name and cleans up any earlier versions if they exist.

In [None]:
# Set the collection name
collection_name = "CrosswordPuzzles"

# Uncomment to remove old versions of this collection
if (client.collections.exists(collection_name)):
    client.collections.delete(collection_name)
    print(f"Removed old collection: {collection_name}")

### Define a collection

The local collection holds some books from [Project Gutenberg](https://www.gutenberg.org/).

This definition is very basic. When the books in the database are converted to vector embeddings below, they aren't given any meta-data to record as properties here.  

In [None]:

from weaviate.classes.config import Property, DataType, Configure

# lets create the collection, specifing our base url accordingling
collection = client.collections.create(
    name=collection_name,
    description="Source texts for puzzles",
    properties=[
        Property(name="text", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_ollama(
        api_endpoint="http://localhost:11434",
        model="all-minilm"
    ),
    generative_config=Configure.Generative.ollama(
        api_endpoint="http://localhost:11434",
        model="llama3"
    )
)

# # Uncomment to check the collection definition
# collection_definition = client.collections.export_config(collection_name)
# print(f"Name: {collection_definition.name}     Description: {collection_definition.description}")


### Import the data into the collection

Process some text files (books from Project Guttenberg) to use as project specific data. 

In [51]:
import os
import spacy
import ollama


# Initiate spacy to process the files
nlp = spacy.load('en_core_web_sm')

# Get a list of the sources
source_dir = "../inputs/"
source_files = [f for f in os.listdir(source_dir) if os.path.isfile(source_dir + f)]

for sf in source_files:
    with open(source_dir + sf, 'r') as f:
        sentences = []
        header_flag = True
        source_text = f.read()

        # Split each source file into sentence-like strings
        for s in nlp(source_text).sents:
            s = str(s)
            if header_flag:
                if s.startswith('*** START'):
                    header_flag = False
                continue
            else:
                if len(s) > 0:
                    sentences.append(s)

        # Create embeddings for the sentences
        with collection.batch.dynamic() as batch:
            for snt in sentences:
                response = ollama.embeddings(model="all-minilm", prompt=snt)
                embedding = response["embedding"]
                batch.add_object(
                    properties = {"text": snt},
                    vector = embedding,
                )


### Check the data upload

In [52]:
# # Uncomment to print the number of objects
# collection = client.collections.get(collection_name)
# response = collection.aggregate.over_all(total_count=True)
# print(f"Collection size: {response.total_count}")

# # Uncomment to print the first 3 objects
# response = collection.query.fetch_objects(
#     limit=3,
#     include_vector=True
#     )
# for o in response.objects:
#     pp.pprint(o.properties)
#     print(o.vector)

Collection size: 44
{'text': '_Nelson Evans, Los Angeles.'}
{'default': [0.19579169154167175, -0.40505698323249817, -0.037693947553634644, 0.14545543491840363, 0.4332829415798187, 0.23615577816963196, 0.13612712919712067, 0.08527970314025879, 0.0987408310174942, 0.3624538779258728, 0.16044436395168304, -0.11926710605621338, -0.02824527584016323, -0.013377237133681774, 0.07106252014636993, 0.01584317907691002, -0.09646245837211609, -0.2935028076171875, -0.14679211378097534, -0.37195566296577454, -0.1780683398246765, 0.3705618679523468, 0.058706995099782944, 0.28247204422950745, -0.10197389125823975, -0.3002275824546814, -0.2227848619222641, 0.10833949595689774, 0.3033396005630493, 0.11573261022567749, 0.3275470733642578, -0.2217307835817337, 0.018864037469029427, 0.11530236154794693, 0.5067391395568848, -0.082425057888031, -0.09569010138511658, -0.335968941450119, 0.09638530761003494, 0.1867569088935852, -0.0862208753824234, 0.25832903385162354, 0.15274909138679504, -0.11262793838977814