## Find similar content from MLOps.Community meetups based on search. 

For context, I wanted to do that for all MLOps.community Meetups worldwide but the API for Meetup.com isn't free, therefore I decided to only do it for the ones of Berlin, I had to manually copy the data as well as Meetup wants you to be logged in to access the details of a Meetup.

I am using Milvus to perform a Similarity Search with the content of the MLOps.community meetups. 

I also use ChatGPT to summarise the content of the most similar meetups.

### Install all Libraries 

In [1]:
! poetry add milvus==2.2.14 pymilvus openai python-dotenv sentence-transformers

Using version [39;1m^2.3.7[39;22m for [36mpymilvus[39m
Using version [39;1m^1.14.1[39;22m for [36mopenai[39m
Using version [39;1m^1.0.1[39;22m for [36mpython-dotenv[39m
Using version [39;1m^2.5.1[39;22m for [36msentence-transformers[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(3.9s)[39;22m

[39;1mPackage operations[39;22m: [34m0[39m installs, [34m4[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mUpdating [39m[36mfsspec[39m[39m ([39m[39;1m2024.2.0[39;22m[39m -> [39m[39;1m2024.3.1[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m-[39;22m [39mUpdating [39m[36mfsspec[39m[39m ([39m[39;1m2024.2.0[39;22m[39m -> [39m[39;1m2024.3.1[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m-[39;22m [39mUpdating [39m[36mfsspec[39m[39m ([39m[39;1m2024.2.0[39;22m[39m -> [39m[39;1m2024.3.1[39;22m[39m)[39m: [34mDownloading...[39m [39;1m100%[39;22m
[1A[0J 

In [5]:
! pip install milvus==2.2.14 pymilvus openai python-dotenv sentence-transformers

Collecting llama-index
  Downloading llama_index-0.10.19-py3-none-any.whl.metadata (8.8 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.5-py3-none-any.whl.metadata (695 bytes)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.9-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.11.0,>=0.10.19 (from llama-index)
  Downloading llama_index_core-0.10.19-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl.metadata (654 bytes)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.4-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48-py3-none-any.whl.metadata (8.5 kB)
Collec

In [2]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")


#### I encountered some problems with using `default_server` from Milvus so I am using the `debug_server`. 

In [3]:
from pymilvus import connections, db
from milvus import default_server

conn = connections.connect(uri=f'http://localhost:{default_server.listen_port}')

db.list_database()


['default']

## Create the Milvus collection needed to perform a similarity search

In [5]:
from pymilvus import utility

if utility.has_collection("mlops_meetups_berlin"):
    utility.drop_collection("mlops_meetups_berlin")

In [6]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# object should be inserted in the format of title, date, content, content embedding
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name="mlops_meetups_berlin", schema=schema)

In [7]:
collection.create_index(field_name="embedding")
collection.load()

In [8]:
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
import pandas as pd 

df = pd.read_csv('data/data_meetup.csv')
df.head()

Unnamed: 0,title,date,content
0,MLOps.community Berlin Meetup 05,"Thursday, September 14, 2023",Details\nHello Community!\n\nGet your calendar...
1,MLOps.community Berlin 04: Pre-event Women+ In...,"Thursday, June 29, 2023",Details\nHello Community!\n\nGet your calendar...
2,[In-person] Vector DB & LLM Hackathon,"Saturday, June 17, 2023","Details\nJoin us on Saturday, June 17 in Berli..."
3,MLOps.community Berlin Meetup 03,"Thursday, February 2, 2023",Details\nHello Community!\n\nOn February 2nd w...
4,MLOps.community Berlin Meetup 02,"Thursday, October 6, 2022",Details\nHello Community!\n\nOn October 6th we...


In [11]:
content_detail = df['content']
content_detail = content_detail.tolist()
embeddings = [transformer.encode(c) for c in content_detail]

#### Embed the content to be able to search it.

In [12]:
df['embedding'] = embeddings
df.head()

Unnamed: 0,title,date,content,embedding
0,MLOps.community Berlin Meetup 05,"Thursday, September 14, 2023",Details\nHello Community!\n\nGet your calendar...,"[-0.12466437, -0.050888978, -0.011999283, 0.07..."
1,MLOps.community Berlin 04: Pre-event Women+ In...,"Thursday, June 29, 2023",Details\nHello Community!\n\nGet your calendar...,"[-0.0037062387, -0.04097239, 0.040712878, 0.04..."
2,[In-person] Vector DB & LLM Hackathon,"Saturday, June 17, 2023","Details\nJoin us on Saturday, June 17 in Berli...","[-0.036718927, -0.07346592, 0.009544252, 0.016..."
3,MLOps.community Berlin Meetup 03,"Thursday, February 2, 2023",Details\nHello Community!\n\nOn February 2nd w...,"[-0.03421999, -0.063562274, -0.011161464, -0.0..."
4,MLOps.community Berlin Meetup 02,"Thursday, October 6, 2022",Details\nHello Community!\n\nOn October 6th we...,"[-0.014285266, -0.08103279, -0.023105947, 0.06..."


In [13]:
collection.insert(data=df)

(insert count: 6, delete count: 0, upsert count: 0, timestamp: 448470526131961858, success count: 6, err count: 0)

In [19]:
search_terms = "The speaker speaks about Open Source and ML Platform"
search_data = [transformer.encode(search_terms)] # Must be a list.

In [20]:
res = collection.search(
    data=search_data,  # Embedded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "IP"},
    limit = 3,  # Limit to top_k results per search
    output_fields=["title", "content"]  # Include title field in result
)

In [21]:
def summarise_meetup_content(content: str) -> str: 
    response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
          "role": "system",
          "content": "Summarize content you are provided with."
        },
        {
          "role": "user",
          "content": f"{content}"
        }
    ],
        temperature=0,
        max_tokens=1024,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    summary = response.choices[0].message.content
    return summary

In [25]:
for hits_i, hits in enumerate(res):
    print("Search Terms:", search_terms)
    print("Results:")
    for hit in hits:
        content_test = hit.entity.get("content")
        print(hit.entity.get("title"), "----", hit.distance)
        print(f'{summarise_meetup_content(hit.entity.get("content"))} \n')
    print()

Search Terms: The speaker speaks about Open Source and ML Platform
Results:
First MLOps.community Berlin Meetup ---- 0.5537543296813965
The MLOps.community meetup in Berlin on June 30th will feature a main talk by Stephen Batifol from Wolt on Scaling Open-Source Machine Learning. The event will also include lightning talks, networking, and food/drinks. The agenda includes Stephen's talk, Q&A session, lightning talks, and socializing. Attendees can sign up for lightning talks on Meetup.com. The event is in collaboration with neptune.ai. 

MLOps.community Berlin 04: Pre-event Women+ In Data and AI Festival ---- 0.46235063672065735
The MLOps.community Berlin is hosting a special edition event on June 29th and 30th at Thoughtworks. This event serves as a warm-up for the Women+ In Data and AI festival. The meetup will feature speakers Fiona Coath discussing surveillance capitalism and Magdalena Stenius talking about the carbon footprint of machine learning. The agenda includes talks, lightn