# - Part 02: Loading and Embedding the Data

## 🗒️ This notebook is divided in 3 sections:
1. Loading the Feature Group from the Hopsworks Feature Store
2. Embedding the data using the sentence-transformers library
3. Saving the model to the Hopsworks Model Registry

In [1]:
from dotenv import load_dotenv
import os
import streamlit as st
import hopsworks

### Pulling the Feature Group

In [2]:
# Load hopsworks API key from .env file or secrets.toml file
load_dotenv()

try:
    HOPSWORKS_API_KEY = os.getenv('HOPSWORKS_API_KEY')
    # HOPSWORKS_API_KEY = st.secrets.HOPSWORKS.HOPSWORKS_API_KEY
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

In [3]:
try:
    project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
    fs = project.get_feature_store()
    
    print("Connected to the Hopsworks Feature Store")
except Exception as e:
    print(f"An error occurred: {e}")

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/47254
Connected. Call `.close()` to terminate connection gracefully.
Connected to the Hopsworks Feature Store


In [4]:
feature_group = fs.get_feature_group("papers_info", version=1)

In [5]:
# Pull the feature group as a Pandas DataFrame
df = feature_group.read()

2023-06-12 09:26:15,385 INFO: USE `paper_whiz_featurestore`
2023-06-12 09:26:15,902 INFO: SELECT `fg0`.`id` `id`, `fg0`.`titles` `titles`, `fg0`.`abstracts` `abstracts`, `fg0`.`terms` `terms`, `fg0`.`urls` `urls`
FROM `paper_whiz_featurestore`.`papers_info_1` `fg0`




In [9]:
import pandas as pd
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

df.head()

Unnamed: 0,id,titles,abstracts,terms,urls
0,5139,Adversarial Learning for Debiasing Knowledge Graph Embeddings,"Knowledge Graphs (KG) are gaining increasing attention in both academia and\nindustry. Despite their diverse benefits, recent research have identified\nsocial and cultural biases embedded in the representations learned from KGs.\nSuch biases can have detrimental consequences on different population and\nminority groups as applications of KG begin to intersect and interact with\nsocial spheres. This paper aims at identifying and mitigating such biases in\nKnowledge Graph (KG) embeddings. As a first step, we explore popularity bias --\nthe relationship between node popularity and link prediction accuracy. In case\nof node2vec graph embeddings, we find that prediction accuracy of the embedding\nis negatively correlated with the degree of the node. However, in case of\nknowledge-graph embeddings (KGE), we observe an opposite trend. As a second\nstep, we explore gender bias in KGE, and a careful examination of popular KGE\nalgorithms suggest that sensitive attribute like the gender of a person can be\npredicted from the embedding. This implies that such biases in popular KGs is\ncaptured by the structural properties of the embedding. As a preliminary\nsolution to debiasing KGs, we introduce a novel framework to filter out the\nsensitive attribute information from the KG embeddings, which we call FAN\n(Filtering Adversarial Network). We also suggest the applicability of FAN for\ndebiasing other network embeddings which could be explored in future work.","[""cs.LG"",""cs.AI"",""cs.SI"",""stat.ML""]",http://arxiv.org/abs/2006.16309v2
1,42559,Graph Convolution for Re-ranking in Person Re-identification,"Nowadays, deep learning is widely applied to extract features for similarity\ncomputation in person re-identification (re-ID) and have achieved great\nsuccess. However, due to the non-overlapping between training and testing IDs,\nthe difference between the data used for model training and the testing data\nmakes the performance of learned feature degraded during testing. Hence,\nre-ranking is proposed to mitigate this issue and various algorithms have been\ndeveloped. However, most of existing re-ranking methods focus on replacing the\nEuclidean distance with sophisticated distance metrics, which are not friendly\nto downstream tasks and hard to be used for fast retrieval of massive data in\nreal applications. In this work, we propose a graph-based re-ranking method to\nimprove learned features while still keeping Euclidean distance as the\nsimilarity metric. Inspired by graph convolution networks, we develop an\noperator to propagate features over an appropriate graph. Since graph is the\nessential key for the propagation, two important criteria are considered for\ndesigning the graph, and three different graphs are explored accordingly.\nFurthermore, a simple yet effective method is proposed to generate a profile\nvector for each tracklet in videos, which helps extend our method to video\nre-ID. Extensive experiments on three benchmark data sets, e.g., Market-1501,\nDuke, and MARS, demonstrate the effectiveness of our proposed approach.","[""cs.CV""]",http://arxiv.org/abs/2107.02220v2
2,55744,"Towards Safe, Explainable, and Regulated Autonomous Driving","There has been recent and growing interest in the development and deployment\nof autonomous vehicles, encouraged by the empirical successes of powerful\nartificial intelligence techniques (AI), especially in the applications of deep\nlearning and reinforcement learning. However, as demonstrated by recent traffic\naccidents, autonomous driving technology is not fully reliable for safe\ndeployment. As AI is the main technology behind the intelligent navigation\nsystems of self-driving vehicles, both the stakeholders and transportation\nregulators require their AI-driven software architecture to be safe,\nexplainable, and regulatory compliant. In this paper, we propose a design\nframework that integrates autonomous control, explainable AI (XAI), and\nregulatory compliance to address this issue, and then provide an initial\nvalidation of the framework with a critical analysis in a case study. Moreover,\nwe describe relevant XAI approaches that can help achieve the goals of the\nframework.","[""cs.AI""]",http://arxiv.org/abs/2111.10518v4
3,82384,InfoSSM: Interpretable Unsupervised Learning of Nonparametric State-Space Model for Multi-modal Dynamics,"The goal of system identification is to learn about underlying physics\ndynamics behind the time-series data. To model the probabilistic and\nnonparametric dynamics model, Gaussian process (GP) have been widely used; GP\ncan estimate the uncertainty of prediction and avoid over-fitting. Traditional\nGPSSMs, however, are based on Gaussian transition model, thus often have\ndifficulty in describing a more complex transition model, e.g. aircraft\nmotions. To resolve the challenge, this paper proposes a framework using\nmultiple GP transition models which is capable of describing multi-modal\ndynamics. Furthermore, we extend the model to the information-theoretic\nframework, the so-called InfoSSM, by introducing a mutual information\nregularizer helping the model to learn interpretable and distinguishable\nmultiple dynamics models. Two illustrative numerical experiments in simple\nDubins vehicle and high-fidelity flight simulator are presented to demonstrate\nthe performance and interpretability of the proposed model. Finally, this paper\nintroduces a framework using InfoSSM with Bayesian filtering for air traffic\ncontrol tracking.","[""stat.ML"",""cs.LG"",""stat.AP""]",http://arxiv.org/abs/1809.07109v2
4,3910,SimMIM: A Simple Framework for Masked Image Modeling,"This paper presents SimMIM, a simple framework for masked image modeling. We\nsimplify recently proposed related approaches without special designs such as\nblock-wise masking and tokenization via discrete VAE or clustering. To study\nwhat let the masked image modeling task learn good representations, we\nsystematically study the major components in our framework, and find that\nsimple designs of each component have revealed very strong representation\nlearning performance: 1) random masking of the input image with a moderately\nlarge masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting\nraw pixels of RGB values by direct regression performs no worse than the patch\nclassification approaches with complex designs; 3) the prediction head can be\nas light as a linear layer, with no worse performance than heavier ones. Using\nViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by\npre-training also on this dataset, surpassing previous best approach by +0.6%.\nWhen applied on a larger model of about 650 million parameters, SwinV2-H, it\nachieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We\nalso leverage this approach to facilitate the training of a 3B model\n(SwinV2-G), that by $40\times$ less data than that in previous practice, we\nachieve the state-of-the-art on four representative vision benchmarks. The code\nand models will be publicly available at https://github.com/microsoft/SimMIM.","[""cs.CV""]",http://arxiv.org/abs/2111.09886v2


### Embedding process

In [10]:
from sentence_transformers import SentenceTransformer

In [11]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our feature we like to encode
sentences = df['titles']

# Features are encoded by calling model.encode()
embeddings = model.encode(sentences)

2023-06-12 09:32:19,576 INFO: Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

2023-06-12 09:32:43,306 INFO: Use pytorch device: cpu


Batches:   0%|          | 0/2486 [00:00<?, ?it/s]

In [12]:
# Printing embeddings
c = 0
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding length:", len(embedding)) # list of floats
    print("")
    if c >=10:
        break
    c +=1 

Sentence: Adversarial Learning for Debiasing Knowledge Graph Embeddings
Embedding length: 384

Sentence: Graph Convolution for Re-ranking in Person Re-identification
Embedding length: 384

Sentence: Towards Safe, Explainable, and Regulated Autonomous Driving
Embedding length: 384

Sentence: InfoSSM: Interpretable Unsupervised Learning of Nonparametric State-Space Model for Multi-modal Dynamics
Embedding length: 384

Sentence: SimMIM: A Simple Framework for Masked Image Modeling
Embedding length: 384

Sentence: Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
Embedding length: 384



In [13]:
import pickle

# Saving sentences and corresponding embeddings
with open('../models/titles_embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

with open('../models/titles_sentences.pkl', 'wb') as f:
    pickle.dump(sentences, f)

### Saving the model to the Hopsworks Model Registry

In [31]:
try:
    mr = project.get_model_registry()
    
    print("Connected to the Hopsworks Model Registry")
except Exception as e:
    print(f"An error occurred: {e}")

Connected. Call `.close()` to terminate connection gracefully.
Connected to the Hopsworks Model Registry


In [32]:
mr_sentences = mr.python.create_model(
    name="titles_sentences",
    description="Scientific papers titles"
)

In [34]:
mr_sentences.save("../models/titles_sentences.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://c.app.hopsworks.ai:443/p/47254/models/titles_sentences/1


Model(name: 'titles_sentences', version: 1)

In [35]:
mr_embeddings = mr.python.create_model(
    name="titles_embeddings",
    description="Scientific papers embeddings"
)

In [37]:
mr_embeddings.save("../models/titles_embeddings.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://c.app.hopsworks.ai:443/p/47254/models/titles_embeddings/1


Model(name: 'titles_embeddings', version: 1)