## Semantic similarity using Sentence Transformers
**Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.**  
Reference: https://towardsdatascience.com/the-auto-sommelier-how-to-implement-huggingface-transformers-and-build-a-search-engine-9e0f401b1bda

In [1]:
import os
import numpy as np
import pandas as pd
import datetime
import pytz
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

In [2]:
import torch

print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

True
0
1
Tesla T4


In [3]:
import transformers
transformers.logging.set_verbosity(transformers.logging.CRITICAL)

In [4]:
pd.options.mode.chained_assignment = None

#pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_columns', 50)

In [5]:
# !pip install texthero

In [6]:
# !pip install sentence_transformers

In [7]:
# !pip install nmslib

In [8]:
# !pip install spacy --upgrade

In [9]:
import texthero as hero
from sentence_transformers import SentenceTransformer, util
import nmslib

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


#### Copy files to local FS from GCP bucket

In [10]:
path_news = '/home/jupyter/data/news'
news_articles = 'news_samsung.json'

In [11]:
if not os.path.isdir(path_news): os.makedirs(path_news)

In [12]:
# !gsutil -m cp -n 'gs://msca-bdp-data-open/news/news_samsung.json' '/home/jupyter/data/news/'

In [13]:
news_df = pd.read_json(os.path.join(path_news, news_articles), orient='records', lines=True)

In [14]:
# Filter non-English articles
news_eng = news_df[news_df['language']=='english'].reset_index(drop=True)
news_eng.shape

(28479, 4)

In [15]:
%%time 

# Clean text with texthero
news_eng['text_clean'] = hero.clean(news_df['text'])
news_eng['text_clean'] = hero.remove_digits(news_eng['text_clean'], only_blocks=False)

# Clean text with texthero
news_eng['title_clean'] = hero.clean(news_df['title'])
news_eng['title_clean'] = hero.remove_digits(news_eng['title_clean'], only_blocks=False)

news_eng = news_eng.query('text_clean.str.len() > 1 and title_clean.str.len() > 1', engine='python')

news_eng.shape

CPU times: user 24.6 s, sys: 216 ms, total: 24.8 s
Wall time: 24.8 s


(27914, 6)

In [16]:
news_eng[['title', 'title_clean']].head(5)

Unnamed: 0,title,title_clean
0,"Tech Talks #845 - Facebook Wallet, Xiaomi CC9, Truecaller VOIP, Samsung Dual Display, Revolt Bike",tech talks facebook wallet xiaomi cc truecaller voip samsung dual display revolt bike
1,SAMSUNG SOUNDBAR RRP $399 | Home Theatre Systems Campbelltown Area - Campbelltown | 1221285503,samsung soundbar rrp home theatre systems campbelltown area campbelltown
2,It just dawned on me that my Samsung SSD is insanely slow,dawned samsung ssd insanely slow
3,Samsung announces neural processing plans,samsung announces neural processing plans
4,Fs: Samsung S7 Edge,fs samsung s edge


In [17]:
news_eng[['text', 'text_clean']].head(5)

Unnamed: 0,text,text_clean
0,"Tech Talks #845 - Facebook Wallet, Xiaomi CC9, Truecaller VOIP, Samsung Dual Display, Revolt Bike Trending story found Should your friends see this too? Share on Facebook The biggest internet trends, by email Share on Facebook Share on Twitter TG Deals@ https://tg.deals/ New Channel: https://goo.gl/Jz6p5K Namaskaar Dosto, Tech Talks ke is Episode mein maine aapse kuch interesting Tech News Share ki hai jaise Facebook Wallet, Xiaomi CC9, Truecaller VOIP, Samsung Dual Display, Revolt Bike aur ...",tech talks facebook wallet xiaomi cc truecaller voip samsung dual display revolt bike trending story found friends see share facebook biggest internet trends email share facebook share twitter tg deals https tg deals new channel https goo gl jz p k namaskaar dosto tech talks ke episode mein maine aapse kuch interesting tech news share ki hai jaise facebook wallet xiaomi cc truecaller voip samsung dual display revolt bike aur bahut kuch mujhe umeed hai ki yeh video aapko pasand aayega share...
1,"We are a Pawnbroker located in Campbelltown, NSW and we have a large range of goods for sale\nHere we are selling this\nSAMSUNG SOUNDBAR\nMODEL HW-M360\n2 BUILT IN SPEAKERS\nWIRELESS MUSIC\nWIRELESS SUBWOOFER\nMODEL PS-WM20\nCOMES WITH REMOTE\nEXCELLENT CONDITION\nWe are located at 79 Dumaresq Street Campbelltown\nWe do NOT Deliver - Pick Up Only\nCome in to Campbelltown Cash Convenience and take a look.\nOpening Hours:\n9am-5.30pm Monday to Friday\n9am-2pm Saturday\nCLOSED Sunday",pawnbroker located campbelltown nsw large range goods sale selling samsung soundbar model hw m built speakers wireless music wireless subwoofer model ps wm comes remote excellent condition located dumaresq street campbelltown deliver pick come campbelltown cash convenience take look opening hours am pm monday friday am pm saturday closed sunday
2,"I’ve got 2TB of 860 evos in raid 0 and my boot time to desktop is about 50 seconds from power button press. Anyone have a clue why that could be? CrystalDiskMark shows my read and writes at about 1GB/s\nEdit: okay I just changed some bios settings and I’m down to like 20 seconds now lol, false alarm I guess. Don’t know why it was like that in the first place",' got tb evos raid boot time desktop seconds power button press anyone clue could crystaldiskmark shows read writes gb edit okay changed bios settings ' like seconds lol false alarm guess ' know like first place
3,"04:38 Samsung has announced that it plans to develop its neural processing unit (NPU) capabilities in order to deliver enhanced AI applications in the future. With this announcement, the Korean firm said that it will create 2,000 new jobs related to the field, worldwide, by 2030. In order to hire more skilled people for these vacancies, Samsung will be collaborating with more universities and research institutes to help them develop the talent they need. With the new generation of NPU techno...",samsung announced plans develop neural processing unit npu capabilities order deliver enhanced ai applications future announcement korean firm said create new jobs related field worldwide order hire skilled people vacancies samsung collaborating universities research institutes help develop talent need new generation npu technologies samsung develop hardware optimised power vehicle infotainment ivi advanced driver assistance systems adas next generation data centres processing big data samsu...
4,for sale s7 edge\nMinimal scratches\nNo charger\nWith box\n-SOLD-,sale s edge minimal scratches charger box sold


#### Transformer Embeddings

In [18]:
%%time

distilbert = SentenceTransformer('distilbert-base-uncased')
# embeddings = distilbert.encode(news_eng['text_clean'].values, convert_to_tensor=True) # If running on CPU
embeddings = distilbert.encode(news_eng['text_clean'].values, convert_to_tensor=False) # If running on GPU

Batches:   0%|          | 0/873 [00:00<?, ?it/s]

CPU times: user 1min 11s, sys: 2min 43s, total: 3min 54s
Wall time: 3min 30s


#### Hierarchical Navigable Small Word Index

In [19]:
%%time

news_eng['distilbert'] = np.array(embeddings).tolist()
# df['pdisroberta']= vect_to_df

# initialize a new index, using a HNSW index on Cosine Similarity
distilbert_index = nmslib.init(method='hnsw', space='cosinesimil')

distilbert_index.addDataPointBatch(embeddings)

distilbert_index.createIndex({'post': 2}, print_progress=True)


0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
******************************************************

CPU times: user 1min 13s, sys: 1.14 s, total: 1min 15s
Wall time: 11.6 s


#### Function to connect back to dataframe values

In [20]:
def search_distance(num_matches, dataframe, userQuery):
    
    if dataframe is not None and userQuery is not None:
        df = dataframe.copy()
        
        # query = distilbert.encode([userQuery], convert_to_tensor=True) # If running on CPU
        query = distilbert.encode([userQuery], convert_to_tensor=False) # If running on GPU
        
        ids, distances = distilbert_index.knnQuery(query, k=num_matches)

        matches = []

        for i,j in zip(ids,distances):

            matches.append({'text_clean': df.text_clean.values[i],
                            'text': df.text.values[i],
                            'distance': j
                       })

        return pd.DataFrame(matches)

In [21]:
closest_matches = search_distance(num_matches=20,
                                  dataframe=news_eng, 
                                  userQuery = 'microwave oven stove kitchen applicance')
closest_matches

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,text_clean,text,distance
0,samsung freestanding dishwasher installed jan https www samsung com au cooking appliances dish washer dw h fs,Samsung freestanding dishwasher.\nInstalled Jan 2017\nhttps://www.samsung.com/au/cooking-appliances/dish-washer-dw60h6050fs/,0.19604
1,samsung microwave oven grill bought germany good condition,Samsung microwave- oven with grill. bought in Germany and in very good condition.,0.197602
2,samsung french door fridge rf remove freezer divider,Samsung french door fridge Rf4287. How do I remove the freezer divider?,0.205998
3,samsung french door freezer cold drawers fridge also cold checking first thanks millie,"My Samsung French Door freezer is too cold and the drawers in the fridge is also too cold. What should I be checking first?\nThanks,\nMillie",0.208722
4,samsung french door freezer cold drawers fridge also cold checking first thanks millie,"My Samsung French Door freezer is too cold and the drawers in the fridge is also too cold. What should I be checking first?\nThanks,\nMillie",0.208722
5,stainless steel fridge samsung featured variety conveniences home kitchen attractive stainless steel design double door system ton space inside pull bottom freezer,Stainless steel fridge by Samsung featured a variety of conveniences for you home kitchen Attractive stainless steel design with a double door system with a ton of space inside. Pull out bottom freezer.,0.211114
6,fridge samsung l top mount refrigerator excellent condition crack fresh room sliding door complete refrigerator user manual,Fridge Samsung 320L Top Mount Refrigerator. Excellent condition. Does have crack on fresh room sliding door. Complete with Refrigerator User Manual.,0.215966
7,samsung dishwasher good condition upgrading kitchen unfortunately colour suit manual included pick mortdale,Samsung Dishwasher in good condition. We are upgrading our kitchen and unfortunately the colour does not suit. Manual included. Pick up from Mortdale.,0.216648
8,samsung dishwasher description samsung dishwasher model dwfn t couple years old working perfectly upgrading kitchen integrated appliances hence sale shipping collection,Samsung Dishwasher Description\nSamsung Dishwasher. Model no DWFN320T. A couple of years old. Working perfectly. Upgrading kitchen to all integrated appliances hence sale. Shipping: Collection,0.219252
9,samsung me m l microwave oven,SAMSUNG - ME83M - 23L MICROWAVE OVEN,0.220262


In [22]:
datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Wed, 27 October 2021 17:24:56'