# Medium Article Semantic Search by Title+Subtitle

### Load Data

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv("medium_post_titles.csv", nrows=10000) # excercise whole data set
# data source: https://www.kaggle.com/datasets/nulldata/medium-post-titles

In [8]:
df["subtitle_truncated_flag"].value_counts()

False    6318
True     3682
Name: subtitle_truncated_flag, dtype: int64

### Data Cleanup

In [9]:
# df.isna().sum()

df = df.dropna()
df = df[~df["subtitle_truncated_flag"]]
# df["subtitle_truncated_flag"].value_counts()

df['title_extended'] = df['title'] + df['subtitle']

In [10]:
# df.head()
# df['category'].nunique()  # metadata
# df.shape # 6k vectors, full set in excercise

### Prep for Upsert

In [40]:
# init pinecone

# API_KEY = 
# ENV = 

import pinecone
from tqdm.autonotebook import tqdm # warning taken care of

pinecone.init(api_key = API_KEY, environment = ENV)


  from tqdm.autonotebook import tqdm


In [None]:
pinecone.create_index(name='medium-data', dimension=384, pod_type='s1', metric="cosine" )

In [2]:
from sentence_transformers import SentenceTransformer
import torch

In [4]:
model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda') # cuda or cpu

In [None]:
df['values'] = df['title_extended'].map(
    lambda x: (model.encode(x)).tolist()) # python list, 6k rows 1 min

In [50]:
df['id'] = df.reset_index(drop = 'index').index

In [52]:
df['metadata'] = df.apply(lambda x: {
    'title' : x['title'],
    'subtitle': x['subtitle'],
    'category': x['category']
    
}, axis=1)

In [54]:
df_upsert = df[['id', 'values', 'metadata']]

In [57]:
df_upsert['id'] = df_upsert['id'].map(lambda x: str(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_upsert['id'] = df_upsert['id'].map(lambda x: str(x))


In [58]:
index =pinecone.Index('medium-data')

In [59]:
index.upsert_from_dataframe(df_upsert) # 6k takes 1 min

sending upsert requests:   0%|          | 0/6211 [00:00<?, ?it/s]

{'upserted_count': 6211}

### Query

In [75]:
xc = index.query((model.encode("which city is the most beautiful")).tolist(), # python list
           top_k=10,
           include_metadata=True) 

In [76]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['title']}: {result['metadata']['category']} ")

0.57: 3 Places Where You Can Find Beauty: photography 
0.46: 6 Easy Reasons to Enjoy Exploring South Wales: travel 
0.45: A City That’s Better for the Blind Is Better for Everyone: accessibility 
0.45: A Shining City on a Hill: politics 
0.42: A Most Beautiful Game: sports 
0.4: 6 Literary Cities for Book Lovers To Visit This Year: travel 
0.4: Ace Hotel: A UX Case Study: ux 
0.39: A city and its architecture: cities 
0.39: Adaptive urban design: design 
0.38: Aesthetics of Being: spirituality 


In [77]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['subtitle']}: {result['metadata']['category']} ")

0.57: If you are willing to look hard enough, eventually you will see beauty in the most difficult of places.: photography 
0.46: Pembrokeshire is as beautiful as the Italian Coast.: travel 
0.45: Complete parity with the sighted may seem like an impossible goal, but maybe the only thing holding us back is a lack of imagination.: accessibility 
0.45: What does America stand for?: politics 
0.42: The World Cup gets advertising right: sports 
0.4: Combine your love for books and travel with these 6 literary cities.: travel 
0.4: Discover the city you are visting like a local: ux 
0.39: Bangalore Chapter: cities 
0.39: Choatic nature of order: design 
0.38: Examining life through a lens of beauty: spirituality 


### Excercise: Upsert all data