# NER-Powered Semantic Search

Combine NER technique with semantic search to improve the results

### Setup Pinecone

In [2]:
from pinecone import Pinecone
import os

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

if "medium-data" not in pc.list_indexes():
    pc.create_index("medium-data", dimension=768, spec={"serverless": {"cloud": "aws", "region": "us-east-1"}})

index = pc.Index("medium-data")

### Setup NER

In [4]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

model_id = "dslim/bert-base-NER"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

device = "cuda" if torch.cuda.is_available() else "cpu"

nlp = pipeline("ner", model=model, tokenizer=tokenizer, device=device, aggregation_strategy="max")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
Device set to use cpu


In [5]:
nlp("Bill Gates is a software engineer and founder of Microsoft")

[{'entity_group': 'PER',
  'score': 0.999742,
  'word': 'Bill Gates',
  'start': 0,
  'end': 10},
 {'entity_group': 'ORG',
  'score': 0.9983804,
  'word': 'Microsoft',
  'start': 49,
  'end': 58}]

In [9]:
from sentence_transformers import SentenceTransformer
import pandas as pd

retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base")

df = pd.read_csv("medium_articles_10k.csv")

df.head()


Unnamed: 0.1,Unnamed: 0,title,text,url,authors,timestamp,tags
0,0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."
