# - Part 01: Initial Data Scraping and Loading

## 🗒️ This notebook is divided in 4 sections:
1. Scraping the arXiv website for scientific papers using the arXiv API,
2. Performing some basic data cleaning and feature engineering,
3. Connect to the Hopsworks feature store,
4. Create feature groups and upload them to the feature store.

### arXiv Scraping

In this section, we scrape the arXiv website for papers in the category "cs.CV" (Computer Vision), "stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.

In [1]:
import arxiv
import pandas as pd
from tqdm import tqdm

Let's start by defining a list of keywords that we will use to query the arXiv API.

In [2]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
    "\"large language models\"",
    "\"llms\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder\"",
    "\"decoder\"",
    "\"multimodal\"",
    "\"multimodal deep learning\"",
]

Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date. 

In [3]:
client = arxiv.Client(num_retries=20, page_size=500)


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []

    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)

    # Return the four lists.
    return terms, titles, abstracts, urls

In [4]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []

for query in query_keywords:
    terms, titles, abstracts, urls = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)

"image segmentation": 3180it [00:47, 67.40it/s]
"self-supervised learning": 0it [00:06, ?it/s]
"representation learning": 6774it [01:51, 60.49it/s]
"image generation": 2425it [00:37, 64.98it/s]
"object detection": 7194it [02:04, 57.56it/s]
"transfer learning": 5477it [01:27, 62.32it/s]
"transformers": 10000it [02:40, 62.13it/s]
"adversarial training: 0it [00:03, ?it/s]
"generative adversarial networks": 5879it [01:35, 61.55it/s]
"model compressions": 781it [00:13, 59.48it/s]
"image segmentation": 3180it [00:48, 65.95it/s]
"few-shot learning": 0it [00:03, ?it/s]
"natural language": 10000it [02:36, 63.89it/s]
"graph": 10000it [02:23, 69.55it/s]
"colorization": 10000it [02:29, 66.72it/s]
"depth estimation": 1344it [00:20, 66.41it/s]
"point cloud": 4871it [01:14, 65.48it/s]
"structured data": 2055it [00:37, 54.18it/s]
"optical flow": 1617it [00:26, 60.53it/s]
"reinforcement learning": 10000it [02:31, 65.92it/s]
"super resolution": 3144it [00:47, 66.47it/s]
"attention": 10000it [02:43, 61.2

Now, we create a pandas.DataFrame object to store the results.

In [5]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

In [6]:
arxiv_data_indexed = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

In [7]:
arxiv_data_indexed.reset_index(inplace=True)
arxiv_data_indexed.rename(columns = {'index':'id'}, inplace=True)

### Data Preprocessing

In this part, we preprocess the data collected in the previous section. We start by removing duplicates and then we clean the text by removing punctuation, stopwords and lemmatizing the words.

In [None]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

nltk.download('punkt')
nltk.download('stopwords')

In [12]:
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

arxiv_data.head()

Unnamed: 0,titles,abstracts,terms,urls
0,Mean Shift Mask Transformer for Unseen Object Instance Segmentation,"Segmenting unseen objects from images is a critical perception skill that a\nrobot needs to acquire. In robot manipulation, it can facilitate a robot to\ngrasp and manipulate unseen objects. Mean shift clustering is a widely used\nmethod for image segmentation tasks. However, the traditional mean shift\nclustering algorithm is not differentiable, making it difficult to integrate it\ninto an end-to-end neural network training framework. In this work, we propose\nthe Mean Shift Mask Transformer (MSMFormer), a new transformer architecture\nthat simulates the von Mises-Fisher (vMF) mean shift clustering algorithm,\nallowing for the joint training and inference of both the feature extractor and\nthe clustering. Its central component is a hypersphere attention mechanism,\nwhich updates object queries on a hypersphere. To illustrate the effectiveness\nof our method, we apply MSMFormer to unseen object instance segmentation. Our\nexperiments show that MSMFormer achieves competitive performance compared to\nstate-of-the-art methods for unseen object instance segmentation. The video and\ncode are available at https://irvlutd.github.io/MSMFormer","[cs.CV, cs.AI, cs.LG, cs.RO]",http://arxiv.org/abs/2211.11679v2
1,AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation,"Aerial Image Segmentation is a top-down perspective semantic segmentation and\nhas several challenging characteristics such as strong imbalance in the\nforeground-background distribution, complex background, intra-class\nheterogeneity, inter-class homogeneity, and tiny objects. To handle these\nproblems, we inherit the advantages of Transformers and propose AerialFormer,\nwhich unifies Transformers at the contracting path with lightweight\nMulti-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path.\nOur AerialFormer is designed as a hierarchical structure, in which Transformer\nencoder outputs multi-scale features and MD-CNNs decoder aggregates information\nfrom the multi-scales. Thus, it takes both local and global contexts into\nconsideration to render powerful representations and high-resolution\nsegmentation. We have benchmarked AerialFormer on three common datasets\nincluding iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive\nablation studies show that our proposed AerialFormer outperforms previous\nstate-of-the-art methods with remarkable performance. Our source code will be\npublicly available upon acceptance.",[cs.CV],http://arxiv.org/abs/2306.06842v1
2,VPUFormer: Visual Prompt Unified Transformer for Interactive Image Segmentation,"The integration of diverse visual prompts like clicks, scribbles, and boxes\nin interactive image segmentation could significantly facilitate user\ninteraction as well as improve interaction efficiency. Most existing studies\nfocus on a single type of visual prompt by simply concatenating prompts and\nimages as input for segmentation prediction, which suffers from low-efficiency\nprompt representation and weak interaction issues. This paper proposes a simple\nyet effective Visual Prompt Unified Transformer (VPUFormer), which introduces a\nconcise unified prompt representation with deeper interaction to boost the\nsegmentation performance. Specifically, we design a Prompt-unified Encoder\n(PuE) by using Gaussian mapping to generate a unified one-dimensional vector\nfor click, box, and scribble prompts, which well captures users' intentions as\nwell as provides a denser representation of user prompts. In addition, we\npresent a Prompt-to-Pixel Contrastive Loss (P2CL) that leverages user feedback\nto gradually refine candidate semantic features, aiming to bring image semantic\nfeatures closer to the features that are similar to the user prompt, while\npushing away those image semantic features that are dissimilar to the user\nprompt, thereby correcting results that deviate from expectations. On this\nbasis, our approach injects prompt representations as queries into Dual-cross\nMerging Attention (DMA) blocks to perform a deeper interaction between image\nand query inputs. A comprehensive variety of experiments on seven challenging\ndatasets demonstrates that the proposed VPUFormer with PuE, DMA, and P2CL\nachieves consistent improvements, yielding state-of-the-art segmentation\nperformance. Our code will be made publicly available at\nhttps://github.com/XuZhang1211/VPUFormer.","[cs.CV, cs.RO, eess.IV]",http://arxiv.org/abs/2306.06656v1
3,AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder,"The recently introduced Segment Anything Model (SAM) combines a clever\narchitecture and large quantities of training data to obtain remarkable image\nsegmentation capabilities. However, it fails to reproduce such results for\nOut-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM\nis conditioned on either a mask or a set of points, it may be desirable to have\na fully automatic solution. In this work, we replace SAM's conditioning with an\nencoder that operates on the same input image. By adding this encoder and\nwithout further fine-tuning SAM, we obtain state-of-the-art results on multiple\nmedical images and video benchmarks. This new encoder is trained via gradients\nprovided by a frozen SAM. For inspecting the knowledge within it, and providing\na lightweight segmentation solution, we also learn to decode it into a mask by\na shallow deconvolution network.",[cs.CV],http://arxiv.org/abs/2306.06370v1
4,Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception,"We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured\nusing Aria glasses with extensive object, environment, and human level ground\ntruth. This ADT release contains 200 sequences of real-world activities\nconducted by Aria wearers in two real indoor scenes with 398 object instances\n(324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two\nmonochrome camera streams, one RGB camera stream, two IMU streams; b) complete\nsensor calibration; c) ground truth data including continuous\n6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye\ngaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d)\nphoto-realistic synthetic renderings. To the best of our knowledge, there is no\nexisting egocentric dataset with a level of accuracy, photo-realism and\ncomprehensiveness comparable to ADT. By contributing ADT to the research\ncommunity, our mission is to set a new standard for evaluation in the\negocentric machine perception domain, which includes very challenging research\nproblems such as 3D object detection and tracking, scene reconstruction and\nunderstanding, sim-to-real learning, human pose prediction - while also\ninspiring new machine perception tasks for augmented reality (AR) applications.\nTo kick start exploration of the ADT research use cases, we evaluated several\nexisting state-of-the-art methods for object detection, segmentation and image\ntranslation tasks that demonstrate the usefulness of ADT as a benchmarking\ndataset.","[cs.CV, cs.AI, cs.LG]",http://arxiv.org/abs/2306.06362v1


In [13]:
arxiv_data_indexed.head()

Unnamed: 0,id,titles,abstracts,terms,urls
0,0,Mean Shift Mask Transformer for Unseen Object Instance Segmentation,"Segmenting unseen objects from images is a critical perception skill that a\nrobot needs to acquire. In robot manipulation, it can facilitate a robot to\ngrasp and manipulate unseen objects. Mean shift clustering is a widely used\nmethod for image segmentation tasks. However, the traditional mean shift\nclustering algorithm is not differentiable, making it difficult to integrate it\ninto an end-to-end neural network training framework. In this work, we propose\nthe Mean Shift Mask Transformer (MSMFormer), a new transformer architecture\nthat simulates the von Mises-Fisher (vMF) mean shift clustering algorithm,\nallowing for the joint training and inference of both the feature extractor and\nthe clustering. Its central component is a hypersphere attention mechanism,\nwhich updates object queries on a hypersphere. To illustrate the effectiveness\nof our method, we apply MSMFormer to unseen object instance segmentation. Our\nexperiments show that MSMFormer achieves competitive performance compared to\nstate-of-the-art methods for unseen object instance segmentation. The video and\ncode are available at https://irvlutd.github.io/MSMFormer","[cs.CV, cs.AI, cs.LG, cs.RO]",http://arxiv.org/abs/2211.11679v2
1,1,AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation,"Aerial Image Segmentation is a top-down perspective semantic segmentation and\nhas several challenging characteristics such as strong imbalance in the\nforeground-background distribution, complex background, intra-class\nheterogeneity, inter-class homogeneity, and tiny objects. To handle these\nproblems, we inherit the advantages of Transformers and propose AerialFormer,\nwhich unifies Transformers at the contracting path with lightweight\nMulti-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path.\nOur AerialFormer is designed as a hierarchical structure, in which Transformer\nencoder outputs multi-scale features and MD-CNNs decoder aggregates information\nfrom the multi-scales. Thus, it takes both local and global contexts into\nconsideration to render powerful representations and high-resolution\nsegmentation. We have benchmarked AerialFormer on three common datasets\nincluding iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive\nablation studies show that our proposed AerialFormer outperforms previous\nstate-of-the-art methods with remarkable performance. Our source code will be\npublicly available upon acceptance.",[cs.CV],http://arxiv.org/abs/2306.06842v1
2,2,VPUFormer: Visual Prompt Unified Transformer for Interactive Image Segmentation,"The integration of diverse visual prompts like clicks, scribbles, and boxes\nin interactive image segmentation could significantly facilitate user\ninteraction as well as improve interaction efficiency. Most existing studies\nfocus on a single type of visual prompt by simply concatenating prompts and\nimages as input for segmentation prediction, which suffers from low-efficiency\nprompt representation and weak interaction issues. This paper proposes a simple\nyet effective Visual Prompt Unified Transformer (VPUFormer), which introduces a\nconcise unified prompt representation with deeper interaction to boost the\nsegmentation performance. Specifically, we design a Prompt-unified Encoder\n(PuE) by using Gaussian mapping to generate a unified one-dimensional vector\nfor click, box, and scribble prompts, which well captures users' intentions as\nwell as provides a denser representation of user prompts. In addition, we\npresent a Prompt-to-Pixel Contrastive Loss (P2CL) that leverages user feedback\nto gradually refine candidate semantic features, aiming to bring image semantic\nfeatures closer to the features that are similar to the user prompt, while\npushing away those image semantic features that are dissimilar to the user\nprompt, thereby correcting results that deviate from expectations. On this\nbasis, our approach injects prompt representations as queries into Dual-cross\nMerging Attention (DMA) blocks to perform a deeper interaction between image\nand query inputs. A comprehensive variety of experiments on seven challenging\ndatasets demonstrates that the proposed VPUFormer with PuE, DMA, and P2CL\nachieves consistent improvements, yielding state-of-the-art segmentation\nperformance. Our code will be made publicly available at\nhttps://github.com/XuZhang1211/VPUFormer.","[cs.CV, cs.RO, eess.IV]",http://arxiv.org/abs/2306.06656v1
3,3,AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder,"The recently introduced Segment Anything Model (SAM) combines a clever\narchitecture and large quantities of training data to obtain remarkable image\nsegmentation capabilities. However, it fails to reproduce such results for\nOut-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM\nis conditioned on either a mask or a set of points, it may be desirable to have\na fully automatic solution. In this work, we replace SAM's conditioning with an\nencoder that operates on the same input image. By adding this encoder and\nwithout further fine-tuning SAM, we obtain state-of-the-art results on multiple\nmedical images and video benchmarks. This new encoder is trained via gradients\nprovided by a frozen SAM. For inspecting the knowledge within it, and providing\na lightweight segmentation solution, we also learn to decode it into a mask by\na shallow deconvolution network.",[cs.CV],http://arxiv.org/abs/2306.06370v1
4,4,Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception,"We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured\nusing Aria glasses with extensive object, environment, and human level ground\ntruth. This ADT release contains 200 sequences of real-world activities\nconducted by Aria wearers in two real indoor scenes with 398 object instances\n(324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two\nmonochrome camera streams, one RGB camera stream, two IMU streams; b) complete\nsensor calibration; c) ground truth data including continuous\n6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye\ngaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d)\nphoto-realistic synthetic renderings. To the best of our knowledge, there is no\nexisting egocentric dataset with a level of accuracy, photo-realism and\ncomprehensiveness comparable to ADT. By contributing ADT to the research\ncommunity, our mission is to set a new standard for evaluation in the\negocentric machine perception domain, which includes very challenging research\nproblems such as 3D object detection and tracking, scene reconstruction and\nunderstanding, sim-to-real learning, human pose prediction - while also\ninspiring new machine perception tasks for augmented reality (AR) applications.\nTo kick start exploration of the ADT research use cases, we evaluated several\nexisting state-of-the-art methods for object detection, segmentation and image\ntranslation tasks that demonstrate the usefulness of ADT as a benchmarking\ndataset.","[cs.CV, cs.AI, cs.LG]",http://arxiv.org/abs/2306.06362v1


In [14]:
print(f"There are {len(arxiv_data_indexed)} rows in the dataset.")

There are 82581 rows in the dataset.


Real-world data is noisy. One of the most commonly observed source of noise is data duplication. Here we notice that our initial dataset has got about 20k duplicate entries.

In [15]:
total_duplicate_titles = sum(arxiv_data_indexed["titles"].duplicated())
print(f"There are {total_duplicate_titles} duplicate titles.")

There are 23466 duplicate titles.


Before proceeding further, we drop these entries.

In [16]:
arxiv_data_indexed = arxiv_data_indexed[~arxiv_data_indexed["titles"].duplicated()]
print(f"There are {len(arxiv_data_indexed)} rows in the deduplicated dataset.")

There are 59115 rows in the deduplicated dataset.


### Connecting to the Hopsworks Feature Store

Before creating a feature group, we need to connect to Hopsworks feature store.

In [17]:
from dotenv import load_dotenv
import os
import streamlit as st
import hopsworks

In [18]:
# Load hopsworks API key from .env file or secrets.toml file
load_dotenv()

try:
    HOPSWORKS_API_KEY = os.getenv('HOPSWORKS_API_KEY')
    # HOPSWORKS_API_KEY = st.secrets.HOPSWORKS.HOPSWORKS_API_KEY
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

In [19]:
try:
    project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
    print("Connected to the Hopsworks project")
    
    fs = project.get_feature_store()
    print("Connected to the Hopsworks Feature Store")
except Exception as e:
    print(f"An error occurred: {e}")


Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/47254
Connected to the Hopsworks project
Connected. Call `.close()` to terminate connection gracefully.
Connected to the Hopsworks Feature Store


### Creating feature groups and uploading them to the Feature Store

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In this case, we will create 1 feature group representing the scientific paper information.

In [20]:
paper_info_fg = fs.get_or_create_feature_group(
    name="papers_info",
    version=1,
    description="Scientific papers info for recommendations.",
    primary_key=['id'],
)

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent, we need to populate it with its associated data using the `insert` function.

In [21]:
try:
    paper_info_fg.insert(arxiv_data_indexed, overwrite=True)
except Exception as e:
    print(f"An error occurred: {e}")


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/47254/fs/46148/fg/60955


Uploading Dataframe: 0.00% |          | Rows 0/59115 | Elapsed Time: 00:00 | Remaining Time: ?

An error occurred: KafkaError{code=TOPIC_AUTHORIZATION_FAILED,val=29,str="Unable to produce message: Broker: Topic authorization failed"}


In [None]:
feature_descriptions = [
    {"name": "id", "description": "Scientific paper IDs"}, 
    {"name": "titles", "description": "Scientific paper titles"}, 
    {"name": "abstracts", "description": "Scientific paper abstracts"}, 
    {"name": "terms", "description": "Scientific paper categories"}, 
    {"name": "urls", "description": "URLs to scientific paper detail pages"}, 
]

for desc in feature_descriptions: 
    paper_info_fg.update_feature_description(desc["name"], desc["description"])

The feature group is now accessible and searchable in the UI