### In this notebook we will assume we have already downloaded the data from kaggle and put it into a standard structured format.

### The purpose of this notebook is to display the functionality of later parts of the ingestion pipeline, including the chunker and the embedding model. 

### Note - you will need to set up the conda environment in order to run this notebook.  See the readme for more details.

The first step is to initialize a chunker and an embedding model.  The chunker will either chunk by sentences or my number of tokens.  Here we chunk by sentences.  The Embedding model is mainly wrapper for Huggingface Sentence Transformers - it handles passing batches of passages to the Sentence Transformer model. We can use any Sentence Transformer, here we use 'all-MiniLM-L6-v2' a lightweight model that will allow us reasonable throughput without a GPU.  Note we can use a very large batch size as long as we keep the chunks reasonably short since the embedding model's memory usage scales with sequence length squared.

In [1]:
from Chunker.Chunking import Chunker
from Embed.Embeddings import EmbeddingModel

chunker = Chunker(chunk_type='sentence', max_len = 64)
model = EmbeddingModel(model_name = 'all-MiniLM-L6-v2', batch_size = 256)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/williamshabecoff/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading all-MiniLM-L6-v2 from sentence transformers.
Model max sequence length is 256.


### Now we initialize a TextDataset object

We will not chunk on initialization since we will chunk again when computing the embeddings. TextDataset keeps track of documents and chunks. Embeddings are stored seperately as a compressed numpy array (npz file) - but we save the mappings between embeddings and chunks in the TextDataset.

So you don't have to wait long on the embeddings - I have set aside a sample set of 2000 of the football articles.

In [2]:
from TextDataset.utils import make_dataset, chunk_and_embed_dataset

football_ds = make_dataset('Data/Text/football-articles-sample', chunker = chunker, save_chunks = False)

embs_path = 'Embeddings/football.npz'

chunk_and_embed_dataset(
    text_dataset=football_ds,
    chunker=chunker,
    embedding_model=model,
    embeddings_path=embs_path,
)

Chunking documents:  88%|████████████████████████████████████████████▋      | 1751/2000 [00:44<00:05, 42.19it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (692 > 512). Running this sequence through the model will result in indexing errors
Chunking documents: 100%|███████████████████████████████████████████████████| 2000/2000 [00:50<00:00, 39.25it/s]
Embedding passages: 100%|█████████████████████████████████████████████████████| 153/153 [01:57<00:00,  1.30it/s]


### Next we save the TextDataset to json files at the specified directory.  We then load in the dataset as well as the embeddings.

We could just use the existing textdataset, but I wanted to show that saving and reloading the object is very easy! This will create json objects in the 'Data/Dataset' directory.  Embeddings are saved to the Embeddings dir. 

In [7]:
from TextDataset.utils import save_text_dataset, load_text_dataset
import numpy as np

dataset_pth = 'Data/Dataset/football-articles-sample'

save_text_dataset(football_ds, dataset_pth)
loaded_ds = load_text_dataset(directory_path = dataset_pth, chunker = chunker)

embs = np.load(embs_path)['embeddings']

### Bonus feature! basic similarity search using our embeddings

VSS class uses Faiss-based vector similarity search over our chunk embeddings. We can use this to find most similar chunks to an arbitrary query!

Our chunks are currently sentences because of our choice of chunker. 

In [8]:
from App.Search import VSS

engine = VSS(loaded_ds, model, embs)

In [9]:
query = 'Manchester United\'s greatest win'
retrieved_chunks = engine.similarity_search(query)
retrieved_chunks

Embedding passages: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.35it/s]


['Manchester United are one of the most decorated clubs in the history of English football.',
 'THE BIGGER PICTURE: Manchester United’s last trophy triumph came in the Europa League back in 2016-17, when Mourinho was at their helm.',
 '“It is only fitting that Sergio has been recognised with a statue of his own, in celebration and honour of his accomplishments in one of the most important chapters of Manchester City’s rich and long history.” When Sergio Aguero won Manchester City the title with this 😍The greatest Premier League moment of all time?',
 'Manchester City secured the 2021-22 Premier League title with a dramatic 3-2 comeback victory against Steven Gerrard’s Aston Villa on the final day of the season, beating Liverpool to glory by a single point.',
 'Manchester United are by far the most successful club of the Premier League era, having won 13 league titles since 1992.']

Results seem quite reasonable, especially impressive for such a lightweight embedding model — with that we have a basic search program built on our data ingestion pipeline!  Feel free to play around with your own football related queries.