# Embed
This notebook focuses on the project data ingestion pipeline: Embedding. Here is the pipeline flow:

1) **Load**: Load raw data from README files.
2) **Chunk**: Data is chunked at bullet points' level.
3) <s>**Tokenize**: No tokenization is done for this dataset.</s>
4) **Embed**: Data is embedded using Sentence Transformer.
5) **Export**: Data is indexed into Elastic Search index.

In [1]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import json
import os
from tqdm.auto import tqdm

In [2]:
folder = '../data/'
json_file = f"{folder}readme_notes_with_ids.json"

with open(json_file) as f:
    docs = json.loads(f.readline())

In [3]:
model_name = 'sentence-transformers/all-MiniLM-L12-v2'
model = SentenceTransformer(model_name)



In [4]:
for doc in tqdm(docs):
    doc['header_vec'] = model.encode(doc['header'])
    doc['subheader_vec'] = model.encode(doc['subheader'])
    doc['doc_text_vec'] = model.encode(doc['doc_text'])

  0%|          | 0/385 [00:00<?, ?it/s]

In [5]:
df = pd.json_normalize(docs)

In [6]:
df.head()

Unnamed: 0,doc_id,header,subheader,document,doc_text,header_vec,subheader_vec,doc_text_vec
0,08e49f1028_1,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer has maximum control of cloud resourc...,Customer has maximum control of cloud resources.,"[0.09377581, -0.09716232, -0.029859366, -0.000...","[-0.028358031, -0.050796855, 0.0057167998, -0....","[0.075708605, -0.053222008, -0.05039908, 0.006..."
1,08e49f1028_2,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer has largest share of responsibility ...,Customer has largest share of responsibility i...,"[0.09377581, -0.09716232, -0.029859366, -0.000...","[-0.028358031, -0.050796855, 0.0057167998, -0....","[-0.007555988, -0.020715015, -0.017336251, -0...."
2,08e49f1028_3,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Only the physical resources are controlled by...,Only the physical resources are controlled by ...,"[0.09377581, -0.09716232, -0.029859366, -0.000...","[-0.028358031, -0.050796855, 0.0057167998, -0....","[0.1366964, 0.025903566, -0.085381664, 0.00066..."
3,08e49f1028_4,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer is responsible for installation and ...,Customer is responsible for installation and c...,"[0.09377581, -0.09716232, -0.029859366, -0.000...","[-0.028358031, -0.050796855, 0.0057167998, -0....","[-0.10961027, 0.025502004, 0.089653715, 0.0039..."
4,08e49f1028_5,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),"[**Scenarios to use IaaS**:, Lift-and-shift mi...",**Scenarios to use IaaS**:\nLift-and-shift mig...,"[0.09377581, -0.09716232, -0.029859366, -0.000...","[-0.028358031, -0.050796855, 0.0057167998, -0....","[0.0007722905, -0.059340313, -0.028729936, 0.0..."


In [9]:
output_pkl = 'az900_notes_with_vectors.pkl'
df.to_pickle(f"{folder}{output_pkl}")