# **How to create wikipedia embeddings using sentence transormers**

In this notebook I will show how to create wikipedia embeddings using any sentence-transformer model you want! In addition, the faiss index file will be created for searching between our prompt embeddings and wikipedia embeddings to find similar texts and improve our retrieval!

We will use [the dataset](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) shared with us by JJ (@jjinho).

In [1]:
import faiss
import pickle
import pandas as pd
import os
import numpy as np
from sentence_transformers import SentenceTransformer
import subprocess

from IPython.display import FileLink, display

ModuleNotFoundError: No module named 'faiss'

Of course you can directly download it from kaggle, but sometimes I struggle with big files (>1GB), so I use that function.

In [2]:
def download_file(path, file_name):
    os.chdir('/kaggle/working/')
    zip = f"/kaggle/working/{file_name}.zip"
    command = f"zip {zip} {path} -r"
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print("Unable to run zip command!")
        print(result.stderr)
        return
    display(FileLink(f'{file_name}.zip'))

# What are the options?

Here is the list of pretrained sentence transformer models you can use from [sbert website](https://www.sbert.net/docs/pretrained_models.html):

* sentence-transformers/all-mpnet-base-v2
* sentence-transformers/multi-qa-MiniLM-L6-cos-v1
* sentence-transformers/all-distilroberta-v1
* sentence-transformers/all-MiniLM-L12-v2
* sentence-transformers/multi-qa-distilbert-cos-v1
* sentence-transformers/all-MiniLM-L6-v2
* sentence-transformers/multi-qa-MiniLM-L6-cos-v1
* sentence-transformers/paraphrase-multilingual-mpnet-base-v2
* sentence-transformers/paraphrase-albert-small-v2
* sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
* sentence-transformers/paraphrase-MiniLM-L3-v2
* sentence-transformers/distiluse-base-multilingual-cased-v1
* sentence-transformers/distiluse-base-multilingual-cased-v2

I do not think most of them are useful for our task. For example, multilingual models, but who knows...

To use any of them just change **model_name** to the model string and with internet on it will be downloaded.

# What to look for?

The most important things are the embedding dimension, speed and quality of a given model.

For example, the best performance comes from `all-mpnet-base-v2` and its embedding dimension is 768 (more RAM needed) and speed is 2800 sentences/sec on V100 (compare it to 384 ED and 7500 s/s by `all-MiniLM-L12-v2`).

Note that using GPU is much faster! So this notebook uses P100.

In [3]:
model_name = "/kaggle/input/multi-qa-minilm-l6-cos-v1/multi-qa-MiniLM-L6-cos-v1" 
sentence_transformer = SentenceTransformer(model_name)
parquet_folder = "/kaggle/input/wikipedia-20230701"
faiss_index_path = "/kaggle/working/wikipedia_embeddings.index"

In [4]:
a_df = pd.read_parquet("/kaggle/input/wikipedia-20230701/a.parquet")
a_df.head()

Unnamed: 0,id,title,text,categories
0,49495844,A & B High Performance Firearms,A & B High Performance Firearms was a competit...,"[Defunct firearms manufacturers, Defunct manuf..."
1,3579086,A & C Black,A & C Black is a British book publishing compa...,"[Encyclopædia Britannica, Ornithological publi..."
2,62397582,A & F Harvey Brothers,"A & F Harvey Brothers, first Spinning Cotton M...",[Cotton mills]
3,15547032,A & G Price,A & G Price Limited is an engineering firm and...,"[Locomotive manufacturers of New Zealand, Tham..."
4,8021609,A & M Karagheusian,thumb|right|238px|A portion of the Karagheusia...,"[1904 establishments in the United States, Arm..."


In [5]:
a_df['text'] = a_df['text'].apply(lambda x: x.split("==")[0])
a_df.head()

Unnamed: 0,id,title,text,categories
0,49495844,A & B High Performance Firearms,A & B High Performance Firearms was a competit...,"[Defunct firearms manufacturers, Defunct manuf..."
1,3579086,A & C Black,A & C Black is a British book publishing compa...,"[Encyclopædia Britannica, Ornithological publi..."
2,62397582,A & F Harvey Brothers,"A & F Harvey Brothers, first Spinning Cotton M...",[Cotton mills]
3,15547032,A & G Price,A & G Price Limited is an engineering firm and...,"[Locomotive manufacturers of New Zealand, Tham..."
4,8021609,A & M Karagheusian,thumb|right|238px|A portion of the Karagheusia...,"[1904 establishments in the United States, Arm..."


#  In case you have enough RAM

In [None]:
document_embeddings = []
for idx, filename in enumerate(os.listdir(parquet_folder)):
    # number, other and wiki_2023_index files are not what we need
    if filename.endswith(".parquet") and not (filename.endswith("number.parquet") or filename.endswith("other.parquet") or filename.endswith("wiki_2023_index.parquet")):
        print(f"Processing file_id: {idx} - file_name: {filename} ......")
        parquet_path = os.path.join(parquet_folder, filename)
        df = pd.read_parquet(parquet_path)
        df.text = df.text.apply(lambda x: x.split("==")[0])# we trim an article to an abstract in this line
        sentences = df.text.tolist()
        embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)
        del df, sentences # free some memory
        document_embeddings.extend(embeddings)

document_embeddings = np.array(document_embeddings)
index = faiss.IndexFlatL2(document_embeddings.shape[1])
index.add(document_embeddings)
faiss.write_index(index, faiss_index_path)
print(f"Faiss Index Successfully Saved to '{faiss_index_path}'")

Processing file_id: 0 - file_name: x.parquet ......


Batches:   0%|          | 0/404 [00:00<?, ?it/s]

Processing file_id: 1 - file_name: h.parquet ......


Batches:   0%|          | 0/7347 [00:00<?, ?it/s]

Processing file_id: 2 - file_name: w.parquet ......


Batches:   0%|          | 0/5394 [00:00<?, ?it/s]

Processing file_id: 3 - file_name: g.parquet ......


Batches:   0%|          | 0/7378 [00:00<?, ?it/s]

Processing file_id: 4 - file_name: a.parquet ......


Batches:   0%|          | 0/13836 [00:00<?, ?it/s]

Processing file_id: 5 - file_name: y.parquet ......


Batches:   0%|          | 0/1490 [00:00<?, ?it/s]

Processing file_id: 6 - file_name: l.parquet ......


#  In case you do not have enough RAM

In this case we can do next:

1. File by file create embeddings;

2. Dump them to .pickle files;

In [None]:
your_file_name = 'some_file_name'

document_embeddings = []
for idx, filename in enumerate(os.listdir(parquet_folder)):
    if filename.endswith(f"{your_file_name}.parquet"):
        print(f"Processing file_id: {idx} - file_name: {filename} ......")
        parquet_path = os.path.join(parquet_folder, filename)
        df = pd.read_parquet(parquet_path)
        df.text = df.text.apply(lambda x: x.split("==")[0])# we trim an article to an abstract in this line
        sentences = df.text.tolist()
        embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)
        del df, sentences # free some memory
        document_embeddings.extend(embeddings)

# pickle your list of embeddings
with open(f"embs_{your_file_name}", "wb") as fp: 
    pickle.dump(document_embeddings, fp)
    
download_file(f"/kaggle/working/embs_{your_file_name}", f"embs_{your_file_name}")

3.When all embedding lists are obtained, just unpickle and add them to a final list to a create faiss index file.

That is all!