# Semantic Search - Question Answering System

https://drive.google.com/drive/folders/1Q9prRy02buQ09X5RDia-KMO1G19qPwws?usp=sharing

### Import Packages

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset
import torch
import pickle
import numpy as np
from sentence_transformers import InputExample, models, SentenceTransformer, losses



### Load the Data

In [1]:
# load the dataset and convert to pandas dataframe
df = load_dataset(
    "fabiochiu/medium-articles",
    data_files="medium_articles.csv",
    split="train"
).to_pandas()

Using custom data configuration fabiochiu--medium-articles-96791ff68926910d
Reusing dataset csv (C:\Users\Sridhar Kamoji\.cache\huggingface\datasets\fabiochiu___csv\fabiochiu--medium-articles-96791ff68926910d\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


In [2]:
# drop empty rows and select 50k articles
df = df.dropna().sample(100000, random_state=32)
df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
4172,How the Data Stole Christmas,by Anonymous\n\nThe door sprung open and our t...,https://medium.com/data-ops/how-the-data-stole...,[],2019-12-24 13:22:33.143000+00:00,"['Data Science', 'Big Data', 'Dataops', 'Analy..."
174868,Automating Light Switch using the ESP32 Board ...,A story about how I escaped the boring task th...,https://python.plainenglish.io/automating-ligh...,['Tomas Rasymas'],2021-09-14 07:20:52.342000+00:00,"['Programming', 'Python', 'Software Developmen..."
100171,Keep Going Quotes Sayings for When Hope is Lost,It’s a very thrilling thing to achieve a goal....,https://medium.com/@yourselfquotes/keep-going-...,['Yourself Quotes'],2021-01-05 12:13:04.018000+00:00,['Quotes']
141757,When Will the Smoke Clear From Bay Area Skies?,Bay Area cities are contending with some of th...,https://thebolditalic.com/when-will-the-smoke-...,['Matt Charnock'],2020-09-15 22:38:33.924000+00:00,"['Bay Area', 'San Francisco', 'California', 'W..."
183489,"The ABC’s of Sustainability… easy as 1, 2, 3",By Julia DiPrete\n\n(according to the Jackson ...,https://medium.com/sipwines/the-abcs-of-sustai...,['Sip Wines'],2021-03-02 23:39:49.948000+00:00,"['Wine Tasting', 'Sustainability', 'Wine']"


#### We will use the article title and its text for generating embeddings. For that, we join the article title and the first 1000 characters from the article text.

In [3]:
# select first 1000 characters
# Because initial sentences of the article will pretty much give us the context of the article

df["text"] = df["text"].str[:1000]
# join article title and the text
df["title_text"] = df["title"] + ". " + df["text"]

In [5]:
# we can use fine tuned Sbert model from the notebook 5.SBertTraining
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')



In [11]:
# Create Sentence Vectors
df['sent_emb'] = [model.encode(txt) for txt in df['title_text']]

df.reset_index(drop = True, inplace = True)

In [27]:
# store the data as binary file
# pickle.dump(df, open('./MediumArticlesSentenceEmbedded.pkl', 'wb'))

In [2]:
df = pickle.load(open('./MediumArticlesSentenceEmbedded.pkl', 'rb'))

# get the sentence vectors as numpy array
sent_vecs = np.vstack(df['sent_emb'])

In [6]:
# s = 'which are good places to visit in London'
s = 'which are best places to visit in Greece'
s = 'which are beautiful islands to visit in summer'
s_emb = model.encode(s)

In [7]:
# get the similarity scores between the query and sentence embeddings
sim_scores = cosine_similarity(s_emb.reshape((1,-1)), sent_vecs)

sim_scores_sorted = np.argsort(sim_scores)

df.loc[sim_scores_sorted[0][::-1][:10]]['title'].values.tolist()

['Best Gay Holiday Destinations',
 '10 Best Beaches on the Hawaiian Islands',
 'Santorini Island. The power of this volcanic island creates an energy that overwhelms the senses…',
 'Petani and Myrtos beaches',
 '10 MOST BEAUTIFUL BEACHES IN THE WORLD TO EXPLORE!!!!',
 'Best Place To Travel in 2020 -Bali golden tour — Froxee',
 '7 Reasons to Spend Your Holiday at Karon Beach, Phuket',
 'The Most Beautiful Hidden Beaches in the World',
 'Budget-Friendly Holidays: Visit The Best Summer Destinations In Greece | easyGuide',
 '7 Best Ideas To Enjoy Summer Vacations of 2020']

## Further Enahncements

### 1. TFIDF Coupled with Sbert or Bert

### 2. Using NER for search 
ref: https://docs.pinecone.io/docs/ner-search