# Embedding Based Search

In this notebook, we will leverage embedding spaces & nearest neighbor search to recommend news articles. We can take features of the news articles, convert them into embeddings, and then utilize similarity search to find the most similar embedding vectors to a given article's embedding, thereby finding similar and relevant news articles.

In [1]:
import openai
import pandas as pd
import regex as re
import pickle

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


We use a [Kaggle Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset). Download the Kaggle dataset, and save it in the same directory as this notebook as `News_Category_Dataset_v3.json.`

In [2]:
df = pd.read_json('News_Category_Dataset_v3.json', lines=True)

In [3]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


There's a lot of columns, and we probably won't need most of them.

In [4]:
df = df[["headline", "short_description", "category"]].dropna()

In [5]:
df.head()

Unnamed: 0,headline,short_description,category
0,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,U.S. NEWS
1,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,U.S. NEWS
2,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",COMEDY
3,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",PARENTING
4,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,U.S. NEWS


In [6]:
len(df)

209527

The dataset is colossal. Let's work with a small sample of the data for convenience.

In [7]:
df = df.sample(250)

Clean the data with regex.

In [8]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\&", " and ", text)
    text = re.sub(r"\|", " ", text)
    text = re.sub(r"\s+", " ", text)
    # Eliminate all punctuation
    text = re.sub(r"[^\w\d\s]", "", text)
    return text.strip()

df["headline"] = df["headline"].apply(clean_text)
df["short_description"] = df["short_description"].apply(clean_text)

In [9]:
for _, row in df.head(5).iterrows():
    print("Headline:", row["headline"])
    print("Category:", row["category"])
    print("About:", row["short_description"])
    print()

Headline: 10 Women Photographers Who Are Changing The Way We See The World
Category: ARTS & CULTURE
About: Women have been fundamental to the art of photography since well there were photographs

Headline: Jared Leto Nearly Sacrificed His Eyebrows For The Sake Of Acting
Category: STYLE
About: Could the world handle a browless Leto face

Headline: How Do You Measure Commitment To The Iran Nuclear Deal
Category: WORLD NEWS
About: While Irans targets are technical and verifiable the targets for US compliance are not

Headline: All Saints Day 2015 Dates Facts And Traditions
Category: RELIGION
About: On this day Catholics honor all those who have entered heaven

Headline: Trumps Week Of Errors Exaggerations And FlatOut Falsehoods
Category: POLITICS
About: Donald Trump says he is a truthful man Maybe truthful to a fault he boasted last week at a North Carolina rally where



Which features of the news articles should we use when trying to recommend similar news articles? A combinination of the headline and description is a good start. News articles with a semantically similar headline + description are probably relevant to one another.

In [10]:
# make new column that appends headline and short_description. 
# this will be the input to the model
df["text"] = df["headline"] + " " + df["short_description"]

In [11]:
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023

client = openai.OpenAI()

def get_embedding(text: str, model: str = EMBEDDING_MODEL):
    # print(text)
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [12]:
import numpy as np

In [13]:
h1 = "Tech Giant Announces Groundbreaking AI Advancements in Automation"
h2 = "Leading Tech Corporation Unveils Revolutionary Developments in AI Technology"

print(np.array(get_embedding(h1)) - np.array(get_embedding(h2)))

[-0.00433637 -0.01465408  0.00101204 ... -0.00239996 -0.00236265
 -0.00170826]


In [14]:
# Establish a cache of embeddings to avoid recomputing - saves time and money
# Cache is a dict of tuples (text, model) -> embedding, saved as a pickle file

# Set path to embedding cache
embedding_cache_path = "recommendations_embeddings_cache.pkl"

# Load the cache if it exists, and save a copy to disk
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

def embedding_from_string(
    string: str,
    model: str = EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> list:
    # Return embedding of given string, using a cache to avoid recomputing.
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]

In [15]:
# as an example, take the first description from the dataset
example_string = df["text"].values[0]
print(f"\nExample string: {example_string}")

# print the first 10 dimensions of the embedding
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")



Example string: 10 Women Photographers Who Are Changing The Way We See The World Women have been fundamental to the art of photography since well there were photographs

Example embedding: [-0.00557529553771019, 6.832373765064403e-05, 0.00066228280775249, -0.013964998535811901, 0.02160860039293766, 0.015690160915255547, -0.042839426547288895, 0.004737899638712406, 0.007284717168658972, -0.03258919343352318]...


In [16]:
import numpy as np

In [17]:
def distances_from_embeddings(query_embedding: list, embeddings: list) -> list:
    """Return distances between query and each embedding in embeddings."""
    def cosine_similarity(embedding1, embedding2):
        return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

    return [cosine_similarity(query_embedding, embedding) for embedding in embeddings]

In [18]:
def indices_of_closest_matches_from_distances(distances: list) -> list:
    """Return indices of n_matches closest embeddings to query."""
    # distances = distances_from_embeddings(query, embeddings)
    # return sorted(range(len(distances)), key=lambda i: distances[i])[:n_matches]
    return (sorted(range(len(distances)), key=lambda i: distances[i]))[::-1]

In [20]:
def print_recommendations_from_strings(
    strings: list[str],
    index_of_source_string: int,
    k_nearest_neighbors: int = 1,
    model=EMBEDDING_MODEL,
) -> list[int]:
    """Print out the k nearest neighbors of a given string."""
    # get embeddings for all strings
    embeddings = [embedding_from_string(string, model=model) for string in strings]
    # get the embedding of the source string
    query_embedding = embeddings[index_of_source_string]
    # get distances between the source embedding and other embeddings
    distances = distances_from_embeddings(query_embedding, embeddings)
    
    indices_of_nearest_neighbors = indices_of_closest_matches_from_distances(distances)

    # print out source string
    query_string = strings[index_of_source_string]
    # print out its k nearest neighbors
    k_counter = 0
    for i in indices_of_nearest_neighbors:
        # skip any strings that are identical matches to the starting string
        if query_string == strings[i]:
            continue
        # stop after printing out k articles
        if k_counter >= k_nearest_neighbors:
            break
        k_counter += 1

        # print out the similar strings and their distances
        print(
            f"""
        --- Recommendation #{k_counter} (nearest neighbor {k_counter} of {k_nearest_neighbors}) ---
        String: {strings[i]}
        Distance: {distances[i]:0.3f}"""
        )

    return indices_of_nearest_neighbors


Now, for a given article, we can generate recommendations for it. Try this with different `article_no` values.

In [21]:
article_no = 0

print("Headline:", df.iloc[article_no]["headline"])
print("Description:", df.iloc[article_no]["short_description"])

Headline: 10 Women Photographers Who Are Changing The Way We See The World
Description: Women have been fundamental to the art of photography since well there were photographs


In [22]:
df["text"].values[article_no]

'10 Women Photographers Who Are Changing The Way We See The World Women have been fundamental to the art of photography since well there were photographs'

In [23]:
descriptions = df["text"].values

print_recommendations_from_strings(descriptions, article_no, k_nearest_neighbors=10)


        --- Recommendation #1 (nearest neighbor 1 of 10) ---
        String: International Festival Of Arts And Ideas Kicks Off In New Haven PHOTOS The two week festival was founded in 1996 differentiating itself from the other established art fairs with its interdisciplinary
        Distance: 0.780

        --- Recommendation #2 (nearest neighbor 2 of 10) ---
        String: Helena Bonham Carter Vogue UK Cover Is Porcelain Perfection PHOTOS Want more Be sure to check out HuffPost Style on Twitter Facebook Tumblr Pinterest and Instagram at HuffPostStyle Shot
        Distance: 0.776

        --- Recommendation #3 (nearest neighbor 3 of 10) ---
        String: Cameron Diazs Beauty Evolution From Teen Model To Screen Siren PHOTOS As 39yearold Diaz celebrates the release of What To Expect When Youre Expecting on May 18th were taking a look back
        Distance: 0.776

        --- Recommendation #4 (nearest neighbor 4 of 10) ---
        String: 6 Ways Change Will Transform You Change is l

[0,
 240,
 238,
 24,
 110,
 225,
 60,
 53,
 246,
 162,
 18,
 52,
 127,
 196,
 210,
 147,
 101,
 149,
 93,
 58,
 249,
 37,
 223,
 165,
 218,
 142,
 183,
 19,
 173,
 108,
 178,
 145,
 205,
 136,
 212,
 8,
 167,
 245,
 80,
 16,
 13,
 124,
 116,
 59,
 111,
 234,
 164,
 221,
 115,
 48,
 180,
 199,
 163,
 122,
 227,
 202,
 179,
 228,
 76,
 144,
 74,
 241,
 233,
 208,
 159,
 140,
 187,
 230,
 229,
 146,
 102,
 6,
 43,
 23,
 189,
 132,
 50,
 9,
 158,
 3,
 170,
 54,
 209,
 70,
 40,
 112,
 155,
 200,
 192,
 17,
 243,
 28,
 175,
 169,
 26,
 36,
 134,
 126,
 117,
 119,
 123,
 104,
 91,
 201,
 133,
 7,
 195,
 22,
 81,
 141,
 83,
 77,
 100,
 185,
 27,
 130,
 148,
 12,
 143,
 32,
 216,
 242,
 128,
 171,
 57,
 186,
 98,
 51,
 177,
 151,
 191,
 94,
 247,
 15,
 42,
 75,
 217,
 92,
 89,
 95,
 156,
 34,
 56,
 105,
 99,
 79,
 86,
 68,
 226,
 135,
 84,
 139,
 33,
 5,
 29,
 213,
 2,
 193,
 150,
 25,
 239,
 47,
 184,
 152,
 63,
 85,
 232,
 55,
 97,
 35,
 90,
 194,
 203,
 118,
 38,
 198,
 67,
 153,
 44,
 231,


The recommendations *should* make sense. If they don't, you must have gotten a really unlucky sample of documents.

Now that you've reached the end, here are some additional things you can spend your time doing in groups:

For the more Data Science / ML oriented people:
- Try to do this with completely different datasets! What about taking Amazon Reviews and doing a review recommendation system? Think about how your preprocessing will differ (your reviews dataset may include lots of numbers you'd want to remove or substitute, etc.)

For the more Computer Science / Data Structures & Algo oriented:
- K-Nearest Neighbors - the search algorithm we used - is pretty inefficient. Approximate Nearest Neighbors, or ANN, is significantly quicker, but sacrifices some accuracy. Try to do the recommendation search, but with an ANN heuristic like Hierarchical Navigable Small World (HNSW). Many vector databases use HNSW, so this should be an interesting and relevant exercise that'll provide you some background for similarity search next week.

Some other questions to maybe ponder, and get answered:
- What if we didn't use embeddings? What if we used [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term Frequency * Inverse Document Frequency) vectorization instead and did similarity search based on that? 
- What if we use another distance function, like euclidean distance, or dot product instead of cosine similarity?