# LOTUS Semantic Operators Walkthrough

In [1]:
# pip install lotus-ai

In [2]:
# pip install python-dotenv
import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import pandas as pd

import lotus
from lotus.models import OpenAIModel, E5Model

lm = OpenAIModel(api_key=os.getenv('OPENAI_API_KEY'))
rm = E5Model(device="cpu")

lotus.settings.configure(lm=lm, rm=rm, model_params={"temperature": 0.0, "max_tokens": 256})

  from tqdm.autonotebook import tqdm, trange
2024-10-20 14:05:14,261 - INFO - Loading faiss with AVX2 support.
2024-10-20 14:05:14,380 - INFO - Successfully loaded faiss with AVX2 support.


In [4]:
# When needed, we can also alter other configurations managed by the settings module
lotus.settings.keys()

dict_keys(['lm', 'helper_lm', 'rm', 'model_params'])

# Semantic Operators

Semantic operators are a key component in the LOTUS programming model. Semantic operators extend the relational model with AI-based operations that users can compose into powerful, reasoning-based query pipelines over structured and unstructured data.

In [5]:
# First let's initialize the dataframe we will use to perform semantic operations on
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "Operating Systems and Systems Programming",
        "Compilers",
        "Computer Networks",
        "Deep Learning",
        "Graphics",
        "Databases",
        "Art History",
    ]
}
df = pd.DataFrame(data)

## Semantic Filter

Semantic filter returns the subset of rows that pass the filtering condition, specified by the user's **langex**. A langex is a parameterized natural language expression which denotes columns in brackets to perform operations on. Let's see how it can be used to filter for classes that require a lot of math.

In [6]:
df.sem_filter("{Course Name} requires a lot of math")

2024-10-20 14:05:15,415 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,421 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,425 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,475 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,477 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,493 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,511 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,591 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:15,600 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name
0,Probability and Random Processes
1,Optimization Methods in Engineering
2,Digital Design and Integrated Circuits
7,Deep Learning
8,Graphics


## Semantic Top K

Semantic top K ranks a set of rows according to the user-defined criteria, and returns the K rows that best match the ranking criteria. The signature of the langex provides a general ranking criteria according to one or more columns. Let's see how it can be used to find the top 3 most relevant classes for distributed systems.

In [7]:
df.sem_topk("{Course Name} is most relevant for distributed systems", K=3)

2024-10-20 14:05:16,234 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,294 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,297 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,321 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,329 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,331 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,352 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,360 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:16,374 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name
0,Operating Systems and Systems Programming
1,Computer Networks
2,Databases


# Semantic Join

Semantic join combines data from two tables, evaluating the user's predicate to return the set of rows from the left and right table that pass. Let's see how it can be used to match courses with gained skills.

In [8]:
skills_df = pd.DataFrame({"Skill": ["Art", "Cryptography", "Baking"]})
df.sem_join(skills_df, "Taking {Course Name} will make me better at {Skill}")

2024-10-20 14:05:22,378 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:22,386 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:22,414 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:22,992 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:22,996 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:24,017 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:24,631 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:24,635 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:24,647 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name,Skill
0,Probability and Random Processes,Cryptography
2,Digital Design and Integrated Circuits,Cryptography
3,Computer Security,Cryptography
6,Computer Networks,Cryptography
8,Graphics,Art
8,Graphics,Baking
10,Art History,Art


# Semantic Indexing and Semantic Search

Semantic indexing allows us to perform semantic search over a column. Semantic search performs a top K similarity search over a column. Let's see how it can be used to find the 2 most simila courses to painting.

In [9]:
# First create a semantic index on Course Name column and save it to a directory
df = df.sem_index("Course Name", "course_name_index_dir")
df.sem_search("Course Name", "painting", K=2)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.83it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.72it/s]


Unnamed: 0,Course Name
10,Art History
8,Graphics


# Semantic Similarity Join

Semantic similarity join provides a variant of the semantic join, such that rows are matched according to their semantic similarity, rather than an arbitrary natural-language predicate. It's similar to semantic search and requires a semantic to be built on the right table. Let's see how it can be used to find the most relevant course for a skill.

In [10]:
skills_df.sem_sim_join(df, left_on="Skill", right_on="Course Name", K=1)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.23it/s]


Unnamed: 0,Skill,_scores,Course Name
0,Art,0.901964,Art History
1,Cryptography,0.894124,Computer Security
2,Baking,0.797217,Graphics


# Semantic Cluster By
Semantic cluster by assigns a group to each row of the table according to semantic similarity. Let's see how it can be used to cluster similar classes.

In [11]:
# Make 3 clusters
df.sem_cluster_by("Course Name", 3).sort_values(by=['cluster_id'])



Unnamed: 0,Course Name,cluster_id
5,Compilers,0
8,Graphics,0
9,Databases,0
0,Probability and Random Processes,1
2,Digital Design and Integrated Circuits,1
3,Computer Security,1
4,Operating Systems and Systems Programming,1
6,Computer Networks,1
7,Deep Learning,1
10,Art History,1


# Semantic Map

Semantic map performs a natural language projection over an existing column and outputs a new column in the table. Let's see how it can be used to generate a short study plan for each class.


In [12]:
df.sem_map("Generate a short study plan to succeed in {Course Name}")

2024-10-20 14:05:35,282 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:35,460 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:35,497 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:35,998 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:36,015 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:36,321 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:36,409 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:36,649 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:37,431 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name,cluster_id,_map
0,Probability and Random Processes,1,"**Study Plan for ""Probability and Random Proce..."
1,Optimization Methods in Engineering,2,"**Study Plan for ""Optimization Methods in Engi..."
2,Digital Design and Integrated Circuits,1,"**Study Plan for ""Digital Design and Integrate..."
3,Computer Security,1,**Study Plan for Computer Security Course**\n\...
4,Operating Systems and Systems Programming,1,"**Study Plan for ""Operating Systems and System..."
5,Compilers,0,**Study Plan for Compilers Course**\n\n**Week ...
6,Computer Networks,1,"**Study Plan for ""Computer Networks"" Course**\..."
7,Deep Learning,1,**Study Plan for Deep Learning Course**\n\n**W...
8,Graphics,0,**Study Plan for the Course: Graphics**\n\n**W...
9,Databases,0,"**Study Plan for ""Databases"" Course**\n\n**Wee..."


# Semantic Aggregation

Semantic aggregation performs an aggregation over all rows of the table. Let's see how it can be used to generate one study plan for all courses.

In [13]:
df.sem_agg("Generate a study plan for all {Course Name}s")._output[0]

2024-10-20 14:05:41,936 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


"To create a comprehensive study plan for the listed courses, it's essential to allocate time effectively across each subject while considering their complexity and interrelatedness. Here’s a suggested study plan:\n\n**Study Plan:**\n\n1. **Week 1-2: Probability and Random Processes**\n   - Focus on foundational concepts, probability theory, and applications in random processes.\n   - Allocate 10 hours per week for lectures, readings, and problem-solving.\n\n2. **Week 3-4: Optimization Methods in Engineering**\n   - Study optimization techniques, algorithms, and their applications in engineering problems.\n   - Dedicate 8 hours per week to this course.\n\n3. **Week 5-6: Digital Design and Integrated Circuits**\n   - Cover digital logic design, circuit analysis, and integrated circuit design principles.\n   - Spend 10 hours per week on practical exercises and theory.\n\n4. **Week 7: Computer Security**\n   - Understand security principles, cryptography, and network security measures.\n 

# Advanced Features

Now that we are familiar with some LOTUS operators, let's take a look at some other features LOTUS provides such as few-shot demonstrations and chain of though reasoning.

# Few-Shot Demonstrations

Few-shot demonstrations may be provided as a table with an "Answer" column. Let's see how it can be used to generate a bulleted list of tips for each class.

In [14]:
examples_df = pd.DataFrame({"Course Name": ["Computer Architecture"],
                            "Answer": ["1. Understand RISC-V fundamentals. 2. Spend time to understand memory consistency."]})

df.sem_map("What are some tips for {Course Name}?", examples=examples_df)

2024-10-20 14:05:43,782 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:43,786 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:43,791 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:43,984 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:43,990 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:44,292 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:44,497 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:44,501 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:05:44,503 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name,cluster_id,_map
0,Probability and Random Processes,1,"1. Master the basics of probability theory, in..."
1,Optimization Methods in Engineering,2,"1. Master the basics of optimization theory, i..."
2,Digital Design and Integrated Circuits,1,"1. Master the basics of digital logic design, ..."
3,Computer Security,1,1. Familiarize yourself with key concepts: Und...
4,Operating Systems and Systems Programming,1,1. **Master the Basics**: Ensure you have a so...
5,Compilers,0,1. **Master the Basics**: Ensure you have a st...
6,Computer Networks,1,1. Master the OSI and TCP/IP models: Familiari...
7,Deep Learning,1,1. Master the basics of neural networks: Under...
8,Graphics,0,1. Master the basics of 2D and 3D graphics con...
9,Databases,0,1. Master SQL: Practice writing complex querie...


# Chain-of-Thought (CoT) Reasoning

LOTUS supports CoT reasoning. Let's see how it can be used to filter for courses related to machine learning.

In [15]:
cot_df = pd.DataFrame({"Course Name": ["Computer Architecture", "Linear Algebra"],
                            "Answer": [False, True],
                            "Reasoning": ["Computer architecture is not directly related to machine learning",
                                          "Linear algebra provides foundations for math for machine learning"]})
df.sem_filter("{Course Name} is very useful for machine learning", examples=cot_df, strategy="cot", return_explanations=True)

2024-10-20 14:06:06,924 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:06,926 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:06,927 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,005 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,014 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,037 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,136 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,169 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-20 14:06:07,338 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "

Unnamed: 0,Course Name,cluster_id,explanation_filter
0,Probability and Random Processes,1,Probability and random processes are essential...
1,Optimization Methods in Engineering,2,Optimization methods are crucial in machine le...
7,Deep Learning,1,"Deep learning is a subset of machine learning,..."
9,Databases,0,While databases are important for managing and...
