# Semantic search in business news

This notebook implements semantic search in [news](https://www.kaggle.com/datasets/rmisra/news-category-dataset) articles. 
The dataset is filtered for news in the 'BUSINESS' category.

We are embedding
- headlines
- news body (short description)
- and date
  
to be able to search for
- notable events, or
- related articles to a specific story.

There is a possibility to skew the results towards older or fresher news,
and also to influence the results using a specific search term.

## Boilerplate

### Installation

In [2]:
%pip install superlinked==35.1.1

### Imports and constants

In [3]:
from datetime import datetime, timedelta, timezone

import os
import sys
import altair as alt
import pandas as pd
from superlinked import framework as sl

alt.renderers.enable(sl.get_altair_renderer())
alt.data_transformers.disable_max_rows()
pd.set_option("display.max_colwidth", 190)

In [4]:
YEAR_IN_DAYS = 365
TOP_N = 10
DATASET_URL = "https://storage.googleapis.com/superlinked-notebook-news-dataset/business_news.json"
# as the dataset contains articles from 2022 and before, we can set our application's "NOW" to this date
END_OF_2022_TS = int(datetime(2022, 12, 31, 23, 59).timestamp())
EXECUTOR_DATA = {sl.CONTEXT_COMMON: {sl.CONTEXT_COMMON_NOW: END_OF_2022_TS}}

## Prepare & explore dataset

In [5]:
NROWS = int(os.getenv("NOTEBOOK_TEST_ROW_LIMIT", str(sys.maxsize)))
business_news = pd.read_json(DATASET_URL, convert_dates=True).head(NROWS)

In [6]:
# we are going to need an id column
business_news = business_news.reset_index().rename(columns={"index": "id"})
# let's create utc timestamps
business_news["date"] = [int(date.replace(tzinfo=timezone.utc).timestamp()) for date in business_news.date]

In [7]:
# a sneak peak into the data
business_news.head()

Unnamed: 0,id,link,headline,category,short_description,authors,date
0,162,https://www.huffpost.com/entry/rei-workers-berkeley-store-union_n_6307a5f4e4b0f72c09ded80d,REI Workers At Berkeley Store Vote To Unionize In Another Win For Labor,BUSINESS,They follow in the footsteps of REI workers in New York City who formed a union earlier this year.,Dave Jamieson,1661385600
1,353,https://www.huffpost.com/entry/twitter-elon-musk-trial-october_n_62d7c115e4b000da23f9c7df,Twitter Lawyer Calls Elon Musk 'Committed Enemy' As Judge Sets October Trial,BUSINESS,Delaware Chancery Judge Kathaleen McCormick dealt the world's richest person a setback in ordering a speedy trial on his abandoned deal to buy Twitter.,Marita Vlachou,1658275200
2,632,https://www.huffpost.com/entry/starbucks-leaves-russian-market-shuts-stores_n_628b9804e4b05cfc268f4413,"Starbucks Leaving Russian Market, Shutting 130 Stores",BUSINESS,Starbucks' move follows McDonald's exit from the Russian market last week.,"DEE-ANN DURBIN, AP",1653264000
3,690,https://www.huffpost.com/entry/coinbase-crypto-slumping_n_627c5582e4b0b74b0e7ed621,Crypto Crash Leaves Trading Platform Coinbase Slumped,BUSINESS,Cryptocurrency trading platform Coinbase has lost half its value in the past week.,"Matt Ott, AP",1652313600
4,727,https://www.huffpost.com/entry/us-april-jobs-report-2022_n_627517dfe4b009a811c295ec,"US Added 428,000 Jobs In April Despite Surging Inflation",BUSINESS,"At 3.6%, unemployment nearly reached the lowest level in half a century.","Paul Wiseman, AP",1651795200


### Understand release date distribution

In [8]:
# some quick transformations and an altair histogram
years_to_plot: pd.DataFrame = pd.DataFrame(
    {"year_of_publication": [int(datetime.fromtimestamp(ts).year) for ts in business_news["date"]]}
)
alt.Chart(years_to_plot).mark_bar().encode(
    alt.X("year_of_publication:N", bin=True, title="Year of publication"),
    y=alt.Y("count()", title="Count of articles"),
).properties(width=400, height=400)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


The largest period time should be around 11 years as the oldest article is from 2012.

As most articles are between 2012-2017, therefore, it also makes sense to differentiate across the relatively scarce recent period of 4 years.

It can also make sense to give additional weight to more populous time periods - small differences can be amplified by adding extra weight compared to regions where the data is scarce and differences are larger on average.

## Set up Superlinked

In [9]:
# set up schema to accommodate our inputs
class NewsSchema(sl.Schema):
    description: sl.String
    headline: sl.String
    release_timestamp: sl.Timestamp
    id: sl.IdField

In [10]:
news = NewsSchema()

In [11]:
# textual characteristics are embedded using a sentence-transformers model
description_space = sl.TextSimilaritySpace(text=news.description, model="sentence-transformers/all-mpnet-base-v2")
headline_space = sl.TextSimilaritySpace(text=news.headline, model="sentence-transformers/all-mpnet-base-v2")
# release date is encoded using our recency embedding algorithm
recency_space = sl.RecencySpace(
    timestamp=news.release_timestamp,
    period_time_list=[
        sl.PeriodTime(timedelta(days=4 * YEAR_IN_DAYS), weight=1),
        sl.PeriodTime(timedelta(days=11 * YEAR_IN_DAYS), weight=2),
    ],
    negative_filter=0.0,
)

In [12]:
# we create an index of our spaces
news_index = sl.Index(spaces=[description_space, headline_space, recency_space])

In [13]:
# simple query will serve us right when we simply want to search the dataset with a search term
# the term will search in both textual fields
# and we will have to option to weight certain inputs' importance
simple_query = (
    sl.Query(
        news_index,
        weights={
            description_space: sl.Param("description_weight"),
            headline_space: sl.Param("headline_weight"),
            recency_space: sl.Param("recency_weight"),
        },
    )
    .find(news)
    .similar(description_space, sl.Param("query_text"))
    .similar(headline_space, sl.Param("query_text"))
    .select([news.description, news.headline, news.release_timestamp])
    .limit(sl.Param("limit"))
)

# news query on the other hand will search in the database with the vector of a news article
# weighting possibility is still there
news_query = (
    sl.Query(
        news_index,
        weights={
            description_space: sl.Param("description_weight"),
            headline_space: sl.Param("headline_weight"),
            recency_space: sl.Param("recency_weight"),
        },
    )
    .find(news)
    .with_vector(news, sl.Param("news_id"))
    .select([news.description, news.headline, news.release_timestamp])
    .limit(sl.Param("limit"))
)

In [14]:
dataframe_parser = sl.DataFrameParser(
    schema=news,
    mapping={news.release_timestamp: "date", news.description: "short_description"},
)

In [15]:
source: sl.InMemorySource = sl.InMemorySource(news, parser=dataframe_parser)
executor: sl.InMemoryExecutor = sl.InMemoryExecutor(sources=[source], indices=[news_index], context_data=EXECUTOR_DATA)
app: sl.InMemoryApp = executor.run()

In [16]:
source.put([business_news])

## Understanding recency

In [17]:
recency_plotter = sl.RecencyPlotter(recency_space, context_data=EXECUTOR_DATA)
recency_plotter.plot_recency_curve()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


## Queries

Let's search for one of the biggest acquisitions of the last decade! We are going to set recency's weight to 0 as it does not matter at this point.

In [None]:
result = app.query(
    simple_query,
    query_text="Microsoft acquires LinkedIn",
    description_weight=1,
    headline_weight=1,
    limit=TOP_N,
)

df = sl.PandasConverter.to_pandas(result)
sl.PandasConverter.format_date_column(df, "release_timestamp", "release_date")

Unnamed: 0,description,headline,id,similarity_score,release_date
0,"(Reuters) - Microsoft Corp agreed to buy LinkedIn Corp for $26.2 billion in cash, the companies said in",Microsoft Agrees To Acquire LinkedIn For $26.2 Billion,64890,0.879656,2016-06-13
1,"Without question, LinkedIn has forever altered the business landscape -- both digitally and in the physical world. Nowadays, so much of what we call business or career development is car...",The LinkedIn of Things,110756,0.526925,2015-01-06
2,LinkedIn works very well for the millions of people who make the effort to understand how to leverage it effectively. And those people are very likely not spending more than fifteen a mi...,The 7 LinkedIn Job Search Mistakes That Might Be Costing You a Job,127772,0.422179,2014-06-26
3,"Although under-used by average LinkedIn members, LinkedIn Groups can be critical to a successful job search because they enable you to communicate directly with recruiters. And vice versa.",Get LinkedIn to Recruiters for Your Job Search,126794,0.366359,2014-07-07
4,NEW YORK (AP) — Anthem is buying rival Cigna for $48 billion in a deal that would create the nation's largest health insurer,MEGA-MERGER: Anthem To Buy Cigna For $54 Billion,93451,0.346358,2015-07-24
5,The struggling social network is looking for a buyer.,Twitter Is Reportedly In Sales Talks With Google And Salesforce,55880,0.326202,2016-09-23
6,The company's value has soared in the last five years and it has more users than Twitter.,Snapchat Is Reportedly Planning A $25 Billion IPO,54727,0.32464,2016-10-06
7,"If failed corporate mergers teach us anything about business, it's that bigger is not always better. Yep, with a 70 to 90",9 Mergers That Epically Failed,173078,0.323164,2013-02-23
8,"“With the Snap investment, we have invested over $1.5 billion in promising digital businesses in the last eighteen months.”",NBCUniversal Invested $500 Million In Snap Inc As Part Of IPO,41556,0.314068,2017-03-03
9,"Another day, another merger. Telephone companies, drug companies, drugstores, airlines, hospitals, retail stores and beer. Why?",The Great Remix: Why Mergers Are Booming,84610,0.313754,2015-11-01


The first result is about the deal, others are related to some aspect of the query. Let's try upweighting recency to see a recent big acquisition jump to the second place.

In [21]:
result = app.query(
    simple_query,
    query_text="Microsoft acquires LinkedIn",
    description_weight=1,
    headline_weight=1,
    recency_weight=1,
    limit=TOP_N,
)

df = sl.PandasConverter.to_pandas(result)
sl.PandasConverter.format_date_column(df, "release_timestamp", "release_date")

Unnamed: 0,description,headline,id,similarity_score,release_date
0,"(Reuters) - Microsoft Corp agreed to buy LinkedIn Corp for $26.2 billion in cash, the companies said in",Microsoft Agrees To Acquire LinkedIn For $26.2 Billion,64890,0.744531,2016-06-13
1,"“My offer is my best and final offer and if it is not accepted, I would need to reconsider my position as a shareholder,” Musk said in a filing.",Elon Musk Offers To Buy 100% Of Twitter,849,0.53172,2022-04-14
2,Starbucks' move follows McDonald's exit from the Russian market last week.,"Starbucks Leaving Russian Market, Shutting 130 Stores",632,0.490162,2022-05-23
3,"Without question, LinkedIn has forever altered the business landscape -- both digitally and in the physical world. Nowadays, so much of what we call business or career development is car...",The LinkedIn of Things,110756,0.462349,2015-01-06
4,Delaware Chancery Judge Kathaleen McCormick dealt the world's richest person a setback in ordering a speedy trial on his abandoned deal to buy Twitter.,Twitter Lawyer Calls Elon Musk 'Committed Enemy' As Judge Sets October Trial,353,0.45594,2022-07-20
5,The decision comes as surging oil prices have been rattling global markets and after Ukraine’s foreign minister criticized Shell for continuing to buy Russian oil.,"Shell Says It Will Stop Buying Russian Oil, Natural Gas",1054,0.45203,2022-03-08
6,That makes seven Starbucks stores that have voted to unionize in a matter of months.,Starbucks Workers In Seattle Vote To Form Union,967,0.449442,2022-03-23
7,Recent statements by CEO Howard Schultz offer a glimpse of some of the hardball tactics that might lay ahead.,"For The Starbucks Union Campaign, A Bruising Contract Fight Is Just Beginning",837,0.43846,2022-04-16
8,"Android creator Andy Rubin, accused of sexual harassment, was given a severance package worth $240 million, according to the lawsuit.",Google's Alphabet Settles With Shareholders Over Payoffs To Execs Accused Of Harassment,4025,0.426013,2020-09-26
9,The store in Arizona joins two in New York as the only corporate Starbucks stores with a union.,Another Starbucks Store Votes To Unionize,1117,0.425772,2022-02-25


Subsequently we can also search with the news article about Elon Musk offering to buy Twitter. As the dataset is quite biased towards old articles, what we get back is news about either Elon Musk or Twitter.

In [22]:
result = app.query(
    news_query,
    description_weight=1,
    headline_weight=1,
    news_id="849",
    limit=TOP_N,
)

df = sl.PandasConverter.to_pandas(result)
sl.PandasConverter.format_date_column(df, "release_timestamp", "release_date")

Unnamed: 0,description,headline,id,similarity_score,release_date
0,"“My offer is my best and final offer and if it is not accepted, I would need to reconsider my position as a shareholder,” Musk said in a filing.",Elon Musk Offers To Buy 100% Of Twitter,849,0.998989,2022-04-14
1,The U.S. Securities and Exchange Commission filed a motion asking for the Tesla CEO to show why he shouldn't be held in contempt.,SEC Says Elon Musk Violated Fraud Settlement With Tweet,7191,0.568103,2019-02-26
2,Delaware Chancery Judge Kathaleen McCormick dealt the world's richest person a setback in ordering a speedy trial on his abandoned deal to buy Twitter.,Twitter Lawyer Calls Elon Musk 'Committed Enemy' As Judge Sets October Trial,353,0.562954,2022-07-20
3,Recent statements by CEO Howard Schultz offer a glimpse of some of the hardball tactics that might lay ahead.,"For The Starbucks Union Campaign, A Bruising Contract Fight Is Just Beginning",837,0.49407,2022-04-16
4,Don't bet against Musk.,Why Elon Musk’s Plan To Merge Tesla With SolarCity Will Probably Work,59220,0.479866,2016-08-16
5,The decision comes as surging oil prices have been rattling global markets and after Ukraine’s foreign minister criticized Shell for continuing to buy Russian oil.,"Shell Says It Will Stop Buying Russian Oil, Natural Gas",1054,0.478979,2022-03-08
6,"Android creator Andy Rubin, accused of sexual harassment, was given a severance package worth $240 million, according to the lawsuit.",Google's Alphabet Settles With Shareholders Over Payoffs To Execs Accused Of Harassment,4025,0.478042,2020-09-26
7,Starbucks' move follows McDonald's exit from the Russian market last week.,"Starbucks Leaving Russian Market, Shutting 130 Stores",632,0.467965,2022-05-23
8,Elon Musk's empire is consolidating.,Tesla Is Buying Sister Company SolarCity For $2.6 Billion,60483,0.465636,2016-08-01
9,The billionaire wants to marry Tesla and SolarCity. But he says SpaceX should remain a bachelor.,The One Company Elon Musk Wants To Keep Independent,60237,0.462553,2016-08-04


That we can start biasing towards recency, navigating the tradeoff of letting less connected but recent news into the mix. 

In [23]:
result = app.query(
    news_query,
    description_weight=1,
    headline_weight=1,
    recency_weight=1,
    news_id="849",
    limit=TOP_N,
)

df = sl.PandasConverter.to_pandas(result)
sl.PandasConverter.format_date_column(df, "release_timestamp", "release_date")

Unnamed: 0,description,headline,id,similarity_score,release_date
0,"“My offer is my best and final offer and if it is not accepted, I would need to reconsider my position as a shareholder,” Musk said in a filing.",Elon Musk Offers To Buy 100% Of Twitter,849,0.998989,2022-04-14
1,The U.S. Securities and Exchange Commission filed a motion asking for the Tesla CEO to show why he shouldn't be held in contempt.,SEC Says Elon Musk Violated Fraud Settlement With Tweet,7191,0.568103,2019-02-26
2,Delaware Chancery Judge Kathaleen McCormick dealt the world's richest person a setback in ordering a speedy trial on his abandoned deal to buy Twitter.,Twitter Lawyer Calls Elon Musk 'Committed Enemy' As Judge Sets October Trial,353,0.562954,2022-07-20
3,Recent statements by CEO Howard Schultz offer a glimpse of some of the hardball tactics that might lay ahead.,"For The Starbucks Union Campaign, A Bruising Contract Fight Is Just Beginning",837,0.49407,2022-04-16
4,Don't bet against Musk.,Why Elon Musk’s Plan To Merge Tesla With SolarCity Will Probably Work,59220,0.479866,2016-08-16
5,The decision comes as surging oil prices have been rattling global markets and after Ukraine’s foreign minister criticized Shell for continuing to buy Russian oil.,"Shell Says It Will Stop Buying Russian Oil, Natural Gas",1054,0.478979,2022-03-08
6,"Android creator Andy Rubin, accused of sexual harassment, was given a severance package worth $240 million, according to the lawsuit.",Google's Alphabet Settles With Shareholders Over Payoffs To Execs Accused Of Harassment,4025,0.478042,2020-09-26
7,Starbucks' move follows McDonald's exit from the Russian market last week.,"Starbucks Leaving Russian Market, Shutting 130 Stores",632,0.467965,2022-05-23
8,Elon Musk's empire is consolidating.,Tesla Is Buying Sister Company SolarCity For $2.6 Billion,60483,0.465636,2016-08-01
9,The billionaire wants to marry Tesla and SolarCity. But he says SpaceX should remain a bachelor.,The One Company Elon Musk Wants To Keep Independent,60237,0.462553,2016-08-04
