# Search White House Speeches from 2021 to 2022 Based On Content

A semantic search example based on White House Speeches from 2021 to 2022. Many of these speeches were made after GPT 3.5 was trained. The White House (Speeches and Remarks) 12/10/2022 dataset can be found on [Kaggle](https://www.kaggle.com/datasets/mohamedkhaledelsafty/the-white-house-speeches-and-remarks-12102022). For this example, we've also made this available on Google Drive. We put together a system to semantically search these speeches using a vector database and the sentence-transformers library. For this example, we use [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to run our vector database locally.

We begin by installing the necessary libraries:

In [None]:
! pip install pymilvus sentence-transformers gdown milvus

## Download Dataset

Next, we download and extract our dataset

In [1]:
import gdown
url = 'https://drive.google.com/uc?id=10_sVL0UmEog7mczLedK5s1pnlDOz3Ukf'
output = './white_house_2021_2022.zip'
gdown.download(url, output)

import zipfile

with zipfile.ZipFile("./white_house_2021_2022.zip","r") as zip_ref:
    zip_ref.extractall("./white_house_2021_2022")

Downloading...
From: https://drive.google.com/uc?id=10_sVL0UmEog7mczLedK5s1pnlDOz3Ukf
To: /Users/yujiantang/Documents/workspace/hello_world_project/text/white_house_2021_2022.zip
100%|██████████| 1.63M/1.63M [00:01<00:00, 1.43MB/s]


## Clean the Data

This dataset is not a precleaned dataset so we need to clean it up before we can work on it. Our first preprocessing step is to drop all rows with any `Null` or `NaN` data using `.dropna()`. Next, we ensure that we aren't picking up any partial speeches by only taking speeches that have more than 50 characters. We also get rid of all the return and newline characters in the speeches. Finally, we convert the dates into the universally accepted datetime format.

In [2]:
import pandas as pd
df = pd.read_csv("./white_house_2021_2022/The white house speeches.csv")
df.head()

Unnamed: 0,Title,Date_time,Location,Speech
0,Remarks by President Biden Before Marine One D...,"OCTOBER 12, 2022",Not determined,
1,Remarks by President Biden in a Virtual Recept...,"OCTOBER 11, 2022",Not determined,"6:47 P.M. EDT\r\n \r\nTHE PRESIDENT: Well, th..."
2,Remarks by President Biden at the Summit on Fi...,"OCTOBER 11, 2022",Eisenhower Executive Office Building,"2:56 P.M. EDT\r\n\r\nTHE PRESIDENT: Doctor, t..."
3,Remarks by Vice President Harris at a Democrat...,"OCTOBER 10, 2022","Princeton, New Jersey","THE VICE PRESIDENT: Good morning, everyone.\r..."
4,Remarks by Vice President Harris in a Conversa...,"OCTOBER 09, 2022","Austin, Texas",


In [4]:
df = df.dropna()
df

Unnamed: 0,Title,Date_time,Location,Speech
1,Remarks by President Biden in a Virtual Recept...,"OCTOBER 11, 2022",Not determined,"6:47 P.M. EDT\r\n \r\nTHE PRESIDENT: Well, th..."
2,Remarks by President Biden at the Summit on Fi...,"OCTOBER 11, 2022",Eisenhower Executive Office Building,"2:56 P.M. EDT\r\n\r\nTHE PRESIDENT: Doctor, t..."
3,Remarks by Vice President Harris at a Democrat...,"OCTOBER 10, 2022","Princeton, New Jersey","THE VICE PRESIDENT: Good morning, everyone.\r..."
5,Remarks by Vice President Harris in a Keynote ...,"OCTOBER 09, 2022","Austin, Texas",5:44 P.M. CDT\r\n \r\nTHE VICE PRESIDENT: Go...
6,Remarks by President Biden on the Economy and ...,"OCTOBER 07, 2022","Hagerstown, Maryland","1:24 P.M. EDT\r\n\r\nTHE PRESIDENT: Please, h..."
...,...,...,...,...
1095,Remarks by President Biden on the Fight to Con...,"JANUARY 26, 2021",Not determined,4:50 P.M. EST\r\n\r\n THE PRESIDENT: Than...
1096,Remarks by President Biden at Signing of an Ex...,"JANUARY 26, 2021",Not determined,2:06 P.M. EST \r\n THE PRESIDENT: Good af...
1097,REMARKS BY VICE PRESIDENT HARRIS AFTER RECEIVI...,"JANUARY 26, 2021","Bethesda, Maryland",3:53 P.M. EST\r\n\r\n THE VICE PRESIDENT: ...
1098,Remarks by President Biden at Signing of Execu...,"JANUARY 25, 2021",Not determined,3:42 P.M. EST\r\n\r\n THE PRESIDENT: Good...


In [5]:
cleaned_df = df.loc[(df["Speech"].str.len() > 50)]

In [6]:
cleaned_df

Unnamed: 0,Title,Date_time,Location,Speech
1,Remarks by President Biden in a Virtual Recept...,"OCTOBER 11, 2022",Not determined,"6:47 P.M. EDT\r\n \r\nTHE PRESIDENT: Well, th..."
2,Remarks by President Biden at the Summit on Fi...,"OCTOBER 11, 2022",Eisenhower Executive Office Building,"2:56 P.M. EDT\r\n\r\nTHE PRESIDENT: Doctor, t..."
3,Remarks by Vice President Harris at a Democrat...,"OCTOBER 10, 2022","Princeton, New Jersey","THE VICE PRESIDENT: Good morning, everyone.\r..."
5,Remarks by Vice President Harris in a Keynote ...,"OCTOBER 09, 2022","Austin, Texas",5:44 P.M. CDT\r\n \r\nTHE VICE PRESIDENT: Go...
6,Remarks by President Biden on the Economy and ...,"OCTOBER 07, 2022","Hagerstown, Maryland","1:24 P.M. EDT\r\n\r\nTHE PRESIDENT: Please, h..."
...,...,...,...,...
1091,Remarks By Vice President Harris To State Depa...,"FEBRUARY 04, 2021",Harry S. Truman Building,"THE VICE PRESIDENT: Thank you, Secretary Blin..."
1095,Remarks by President Biden on the Fight to Con...,"JANUARY 26, 2021",Not determined,4:50 P.M. EST\r\n\r\n THE PRESIDENT: Than...
1096,Remarks by President Biden at Signing of an Ex...,"JANUARY 26, 2021",Not determined,2:06 P.M. EST \r\n THE PRESIDENT: Good af...
1097,REMARKS BY VICE PRESIDENT HARRIS AFTER RECEIVI...,"JANUARY 26, 2021","Bethesda, Maryland",3:53 P.M. EST\r\n\r\n THE VICE PRESIDENT: ...


In [7]:
cleaned_df["Speech"] = cleaned_df["Speech"].str.replace("\r\n", "")
cleaned_df.iloc[0]["Speech"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df["Speech"] = cleaned_df["Speech"].str.replace("\r\n", "")


'6:47 P.M. EDT THE PRESIDENT:  Well, thank you very much.  And I thought I saw Fred Sears in that picture.  PARTICIPANT: You have. THE PRESIDENT:  And, by the way, you know, I owe — I owe Fred a debt of gratitude.  Years and years ago, he — he’s the reason why my first wife ended up marrying me.  We flipped a coin.  I said I wanted to go talk to her first, down in Flor- — in the Bahamas on spring break.  And another guy named Mike McCrann wanted to see her.  He said, “I’ll flip a coin.”  And I won the toss.  Thank you, Fred.  I’m indebted to you, pal. All kidding aside, look, I want to thank Lisa.  Look, you’re all a big part of — she’s a big part of why I got elected — all of you are — national co-chair of the campaign, helped lead the Vice Presidential Selection Committee, and a great partner, and someone I trust completely.  When Lisa ran for Congress, she’d say, “When Lisa goes to Washington, we all go to Washington.”  Well, that’s Lisa.  She brings everybody along.  Doesn’t leave 

In [8]:
import datetime

# Convert the 'date' column to datetime objects
cleaned_df["Date_time"] = pd.to_datetime(cleaned_df["Date_time"], format="%B %d, %Y")

# Convert the datetime objects to Unix time format
cleaned_df["unix_time"] = cleaned_df["Date_time"].apply(lambda x: int(x.timestamp()))

cleaned_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df["Date_time"] = pd.to_datetime(cleaned_df["Date_time"], format="%B %d, %Y")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df["unix_time"] = cleaned_df["Date_time"].apply(lambda x: int(x.timestamp()))


Unnamed: 0,Title,Date_time,Location,Speech,unix_time
1,Remarks by President Biden in a Virtual Recept...,2022-10-11,Not determined,"6:47 P.M. EDT THE PRESIDENT: Well, thank you ...",1665446400
2,Remarks by President Biden at the Summit on Fi...,2022-10-11,Eisenhower Executive Office Building,"2:56 P.M. EDTTHE PRESIDENT: Doctor, thank you...",1665446400
3,Remarks by Vice President Harris at a Democrat...,2022-10-10,"Princeton, New Jersey","THE VICE PRESIDENT: Good morning, everyone. A...",1665360000
5,Remarks by Vice President Harris in a Keynote ...,2022-10-09,"Austin, Texas",5:44 P.M. CDT THE VICE PRESIDENT: Good eveni...,1665273600
6,Remarks by President Biden on the Economy and ...,2022-10-07,"Hagerstown, Maryland","1:24 P.M. EDTTHE PRESIDENT: Please, have a se...",1665100800
...,...,...,...,...,...
1091,Remarks By Vice President Harris To State Depa...,2021-02-04,Harry S. Truman Building,"THE VICE PRESIDENT: Thank you, Secretary Blin...",1612396800
1095,Remarks by President Biden on the Fight to Con...,2021-01-26,Not determined,4:50 P.M. EST THE PRESIDENT: Thank you fo...,1611619200
1096,Remarks by President Biden at Signing of an Ex...,2021-01-26,Not determined,2:06 P.M. EST THE PRESIDENT: Good aftern...,1611619200
1097,REMARKS BY VICE PRESIDENT HARRIS AFTER RECEIVI...,2021-01-26,"Bethesda, Maryland","3:53 P.M. EST THE VICE PRESIDENT: Well, s...",1611619200


## Establish a Vector Database and Schema

With all of our datacleaning done, it's time to set up our vector database, Milvus Lite. We start by declaring some constants before starting a server and establishing a connection.

In [9]:
COLLECTION_NAME = "white_house_2021_2022"
DIMENSION = 384
BATCH_SIZE = 128
TOPK = 3

In [10]:
from milvus import default_server
from pymilvus import connections, utility

default_server.start()
connections.connect(host="127.0.0.1", port=default_server.listen_port)

utility.get_server_version()

[93m[get_server_version] retry:4, cost: 0.27s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m




    __  _________ _   ____  ______
   /  |/  /  _/ /| | / / / / / __/
  / /|_/ // // /_| |/ / /_/ /\ \
 /_/  /_/___/____/___/\____/___/ {Lite}

 Welcome to use Milvus!

 Version:   v2.2.8-lite
 Process:   42841
 Started:   2023-05-17 15:16:44
 Config:    /Users/yujiantang/.milvus.io/milvus-server/2.2.8/configs/milvus.yaml
 Logs:      /Users/yujiantang/.milvus.io/milvus-server/2.2.8/logs

 Ctrl+C to exit ...


'v2.2.8-lite'

Just to make sure that we are starting from a blank slate, we check for the existence of any collection with the same name as the one we chose and drop it.

In [11]:
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

Now we establish our schema. For this data set, we have four attributes to work off - the title of the speech, the date the speech was given, the location where the speech was given, and the speech itself. We want to perform a semantic search on the content of the actual speech so the schema will contain the title, the date, the location, and a vector embedding of the actual speech.

For each `VARCHAR` datatype (string format) we give a max length. In this case, none of these max lengths are hit, but serve as a rough upper bound estimate.

In [12]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# object should be inserted in the format of (title, date, location, speech embedding)
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="location", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

With a vector database server up and running as well as a collection and schema established, the final thing to do before inserting the vectors is to establish our vector index. For this example, we use an `IVF_FLAT` index on an `L2` distance metric and 128 clusters (`nlist`).

In [13]:
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

## Get Vector Embeddings and Populate the Database 

Here we use the `SentenceTransformer` library to get our vector embeddings for the speeches and populate our Milvus instance with our newly generated vector embeddings. For this example, we use the [MiniLM L6 v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) transformer to get a vector embedding.

In [14]:
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')

We create a function, `embed_insert`, that gets the embeddings for a batch of speeches, and then inserts that batch into our Milvus instance.

In [15]:
# expects a list of [title, date, location, speech]
def embed_insert(data: list):
    embeddings = transformer.encode(data[3])
    ins = [
        data[0],
        data[1],
        data[2],
        [x for x in embeddings]
    ]
    collection.insert(ins)

With our helper function written, we are ready to embed and insert the text. First, we turn our `pandas` dataframe into the right format, a list of lists, to insert. For this example, we need a list of four lists. The inner lists correspond to the title, date, location, and speech respectively. We batch the lists and call the `embed_insert` function we wrote above on each of them. Finally, when all of the data has been inserted, we `flush` the collection to ensure that everything is indexed.

In [16]:
data_batch = [[], [], [], []]

for title, date, location, speech in zip(cleaned_df.loc[:, "Title"], cleaned_df.loc[:, "Date_time"], cleaned_df.loc[:, "Location"], cleaned_df.loc[:, "Speech"]):
    data_batch[0].append(title)
    data_batch[1].append(str(date))
    data_batch[2].append(location)
    data_batch[3].append(speech)
    if len(data_batch[0]) % BATCH_SIZE == 0:
        embed_insert(data_batch)
        data_batch = [[], [], [], []]

# Embed and insert the remainder
if len(data_batch[0]) != 0:
    embed_insert(data_batch)

# Call a flush to index any unsealed segments.
collection.flush()

## Run a Semantic Search

With the database populated, it's now possible to search all of the speeches based on their content. In this example, we search for a speech where the President speaks about renewable energy at NREL, and a speech where the Vice President and the Prime Minister of Canada both speak. We get the embeddings for these descriptions, and then search our vector database for the 3 speeches with the closest embeddings. 

We expect the first description to have the speech titled "Remarks by President Biden During a Tour of the National Renewable Energy Laboratory" in its results and the second description to have the speech titled "REMARKS BY VICE PRESIDENT HARRIS AND PRIME MINISTER TRUDEAU OF CANADA BEFORE BILATERAL MEETING" in its results.

In [17]:
import time
search_terms = ["The President speaks about the impact of renewable energy at the National Renewable Energy Lab.", "The Vice President and the Prime Minister of Canada both speak."]

# Search the database based on input text
def embed_search(data):
    embeds = transformer.encode(data) 
    return [x for x in embeds]

search_data = embed_search(search_terms)

start = time.time()
res = collection.search(
    data=search_data,  # Embeded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "L2",
            "params": {"nprobe": 10}},
    limit = TOPK,  # Limit to top_k results per search
    output_fields=["title"]  # Include title field in result
)
end = time.time()

for hits_i, hits in enumerate(res):
    print("Title:", search_terms[hits_i])
    print("Search Time:", end-start)
    print("Results:")
    for hit in hits:
        print( hit.entity.get("title"), "----", hit.distance)
    print()

Title: The President speaks about the impact of renewable energy at the National Renewable Energy Lab.
Search Time: 0.009615898132324219
Results:
Remarks by President Biden During a Tour of the National Renewable Energy Laboratory ---- 1.0144075155258179
Press Gaggle by Vice President Harris Aboard Air Force Two Before Departure ---- 1.043080449104309
Remarks by President Biden at the Virtual Leaders Summit on Climate Opening Session ---- 1.0471298694610596

Title: The Vice President and the Prime Minister of Canada both speak.
Search Time: 0.009615898132324219
Results:
REMARKS BY VICE PRESIDENT HARRIS AND PRIME MINISTER TRUDEAU OF CANADA BEFORE BILATERAL MEETING ---- 0.8196960687637329
Remarks by Vice President Harris After Meeting to Discuss the Importance of Passing the Build Back Better Agenda ---- 0.9929051399230957
Remarks by President Biden and Prime Minister Boris Johnson of the United Kingdom Before Bilateral Meeting ---- 1.0264476537704468



Clean up the server.

In [18]:
default_server.stop()

ModuleNotFoundError: No module named 'pymilvus.milvus_client'