# Local LLM & Embedding Tests

These cells verify that the **Docker Model Runner** OpenAI‑compatible endpoint works from Python

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

Initialize the OpenAI client pointing to **Docker Model Runner** using `LLM_BASE_URL` and send a **chat completion** request to test the local LLM

In [2]:
from openai import OpenAI

client = OpenAI(
    api_key='not-needed',
    base_url=os.getenv('LLM_BASE_URL')
)

PROMPT = 'Explain how to use Pyspark to read a CSV file.'

message = {
    'role': 'user',
    'content': PROMPT
}

response = client.chat.completions.create(
    model='ai/llama3.1',
    messages=[message]
)

print(response.choices[0].message.content)

**Reading a CSV File with PySpark**

You can use PySpark to read a CSV file by using the `read.csv` method provided by the SparkSession object. Here's a step-by-step guide to achieve this:

### Prerequisites

* PySpark must be installed. If not, you can install it using `pip install pyspark`.
* The SparkSession object must be created.

### Code

```python
from pyspark.sql import SparkSession

# Create a SparkSession object
spark = SparkSession.builder.appName("CSV Reader").getOrCreate()

# Read the CSV file
df = spark.read.csv(
    "path_to_your_file.csv",
    header=True,  # Assuming the first row is the header
    inferSchema=True,  # Automatically infer the schema of the file
    sep=",",  # The separator used in the file
    naValues=["NA"]  # The value used to represent missing values
)

# Print the first few rows of the DataFrame
df.show()
```

### Explanation

* `spark = SparkSession.builder.appName("CSV Reader").getOrCreate()`: Creates a SparkSession object.
* `spark.read.csv()

Create **text embeddings** for the given inputs.

In [3]:
document = 'To read a CSV file with PySpark, you can use the `read.csv()` function.'

query = 'Represent this sentence for searching relevant passages: How do I read a CSV in PySpark?'

resp = client.embeddings.create(model='ai/mxbai-embed-large', input=[query, document])

print('Usage:', resp.usage, '\nData:', resp.data[0].embedding, '\nDimensions:', len(resp.data[0].embedding))

Usage: Usage(prompt_tokens=49, total_tokens=49) 
Data: [0.0198382418602705, 0.02137097157537937, 0.021393904462456703, 0.003140340093523264, 0.007801663596183062, -0.01564009301364422, 0.04011332243680954, -0.008549734950065613, 0.023549363017082214, 0.0774080827832222, -0.013655592687427998, -0.04434587433934212, 0.03513387590646744, -0.02021830901503563, -0.03570776432752609, -0.056706514209508896, 0.024996809661388397, -0.006293039303272963, -0.03738256171345711, 0.0031822053715586662, -0.017491931095719337, 0.04579836502671242, -0.06497251242399216, 0.01710066758096218, -0.04162462800741196, 0.0231020487844944, -0.0540921613574028, 0.00741297984495759, 0.04275768622756004, 0.028191568329930305, -0.022558698430657387, 0.016050608828663826, -0.03847094625234604, -0.0663968026638031, 0.03247026354074478, 0.0018827725434675813, 0.04341043904423714, 0.022294552996754646, -0.03628738969564438, -0.006017886567860842, 0.012544175609946251, 0.03187567740678787, 0.022399114444851875, -0.0132

In [4]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(resp.data[0].embedding, resp.data[1].embedding)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.9057438738179454
