# Create arXiv Embeddings

In this notebook we will:

1) Pull the arXiv dataset from Kaggle
2) Perform data preprocessing and cleanup
3) Create HuggingFace embeddings
4) Create OpenAI embeddings

## 1 - Pull the arXiv dataset from Kaggle
You will need to get a free API key from kaggle.com in order to [download this dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). You can also manually download it as long as the `.json` file ends up in this directory.


In [2]:
!pip install kaggle pandas



In [1]:
import kaggle

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Unzip the file and there you have it!

## 2 - Perform data preprocessing and cleanup

In [1]:
import json
import pandas as pd
import re
import string


DATA_PATH = "arxiv-metadata-oai-snapshot.json"
YEAR_CUTOFF = 2012
YEAR_PATTERN = r"(19|20[0-9]{2})"
ML_CATEGORY = "cs.LG"
DATASET_SIZE=1000

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Preprocessing function to clean data
def process(paper: dict):
    paper = json.loads(paper)
    if paper['journal-ref']:
        years = [int(year) for year in re.findall(YEAR_PATTERN, paper['journal-ref'])]
        years = [year for year in years if (year <= 2022 and year >= 1991)]
        year = min(years) if years else None
    else:
        year = None
    return {
        'id': paper['id'],
        'title': paper['title'],
        'year': year,
        'authors': paper['authors'],
        'categories': ','.join(paper['categories'].split(' ')),
        'abstract': paper['abstract']
    }

# Data loading function
def papers():
    with open(DATA_PATH, 'r') as f:
        for paper in f:
            paper = process(paper)
            if paper['year']:
                if paper['year'] >= YEAR_CUTOFF and ML_CATEGORY in paper['categories']:
                    yield paper

In [3]:
# Load dataset into Pandas dataframe and take a sample
df = pd.DataFrame(papers()).sample(n=DATASET_SIZE)

In [4]:
# Avg length of the abstracts - num tokens
df.abstract.apply(lambda a: len(a.split())).mean()

170.084

In [5]:
# Helper function to clean the description!
def clean_description(description: str):
    if not description:
        return ""
    # remove unicode characters
    description = description.encode('ascii', 'ignore').decode()

    # remove punctuation
    description = re.sub('[%s]' % re.escape(string.punctuation), ' ', description)

    # clean up the spacing
    description = re.sub('\s{2,}', " ", description)

    # remove urls
    #description = re.sub("https*\S+", " ", description)

    # remove newlines
    description = description.replace("\n", " ")

    # remove all numbers
    #description = re.sub('\w*\d+\w*', '', description)

    # split on capitalized words
    description = " ".join(re.split('(?=[A-Z])', description))

    # clean up the spacing again
    description = re.sub('\s{2,}', " ", description)

    # make all words lowercase
    description = description.lower()

    return description.strip()

In [6]:
# Apply the cleaner method on both title and abstract
texts = df.apply(lambda r: clean_description(r['title'] + ' ' + r['abstract']), axis=1).tolist()

## 3 - Creating Hugging Face Embeddings

First up, we will use the built-in RedisVL vectorizer to create embeddings from huggingface.

In [8]:
from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

In [9]:
# Create embeddings from the title and abstract
embeddings = hf.embed_many(texts)

In [13]:
embeddings[0][:10]

[0.005483315791934729,
 0.06285043805837631,
 -0.04415518790483475,
 -0.07510741800069809,
 -0.020236646756529808,
 0.03126000240445137,
 0.03337767347693443,
 0.03219094127416611,
 -0.023321175947785378,
 0.02857762947678566]

In [14]:
# Add embeddings to df
df = df.reset_index().drop('index', axis=1)
df['huggingface'] = embeddings

## OpenAI Embeddings

Next, we will use OpenAI Embeddings for our arXiv papers. You will need to set your OpenAI API Key below!

In [16]:
from redisvl.utils.vectorize import OpenAITextVectorizer

oai = OpenAITextVectorizer(api_config={"api_key": "YOUR API KEY HERE"})

embeddings = await oai.aembed_many(texts)

In [17]:
len(embeddings)

1000

In [18]:
embeddings[0][:10]

[-0.034895237535238266,
 0.0013622459955513477,
 0.025790950283408165,
 -0.031307876110076904,
 -0.0186705831438303,
 0.027937931939959526,
 0.008866488933563232,
 0.01263049989938736,
 -0.03155247122049332,
 -0.022502535954117775]

In [19]:
df['openai'] = embeddings

## Cohere Embeddings

In [20]:
from redisvl.utils.vectorize import CohereTextVectorizer

co = CohereTextVectorizer(
    model="embed-multilingual-v3.0",
    api_config={"api_key": "YOUR API KEY HERE"}
)

In [22]:
embeddings = co.embed_many(texts, input_type="search_document")

In [23]:
len(embeddings)

1000

In [24]:
len(embeddings[0])

1024

In [25]:
df['cohere'] = embeddings

In [26]:
df.head()

Unnamed: 0,id,title,year,authors,categories,abstract,huggingface,openai,cohere
0,1812.02855,Progressive Sampling-Based Bayesian Optimizati...,2017,"Xueqiang Zeng, Gang Luo","cs.LG,stat.ML",Purpose: Machine learning is broadly used fo...,"[0.005483315791934729, 0.06285043805837631, -0...","[-0.034895237535238266, 0.0013622459955513477,...","[-0.0146865845, -0.023620605, 0.009109497, 0.0..."
1,1708.01422,Exploring the Function Space of Deep-Learning ...,2018,Bo Li and David Saad,"cond-mat.dis-nn,cs.LG",The function space of deep-learning machines...,"[-0.022667571902275085, 0.04551266133785248, -...","[-0.017496585845947266, -0.009123609401285648,...","[0.020828247, -0.004623413, -0.009017944, 0.06..."
2,2001.00561,PrivacyNet: Semi-Adversarial Networks for Mult...,2020,"Vahid Mirjalili, Sebastian Raschka, Arun Ross","cs.CV,cs.CR,cs.LG",Recent research has established the possibil...,"[-0.004289142321795225, 0.1055050864815712, -0...","[-0.0183021891862154, 0.004181693773716688, 0....","[0.024093628, 0.0047302246, -0.032104492, 0.04..."
3,2003.013,Few-Shot Relation Learning with Attention for ...,2020,"Sion An, Soopil Kim, Philip Chikontwe and Sang...","eess.SP,cs.LG",Brain-Computer Interfaces (BCI) based on Ele...,"[-0.033013615757226944, 0.05156606808304787, -...","[-0.036494333297014236, 0.003500517923384905, ...","[-0.045196533, -0.009727478, 0.016662598, 0.00..."
4,1902.03896,Reconstructing dynamical networks via feature ...,2019,Marc G. Leguia and Zoran Levnajic and Ljupco T...,"math.DS,cs.LG,cs.SI,physics.soc-ph,stat.ML",Empirical data on real complex systems are b...,"[-0.07013913244009018, 0.05345052108168602, -0...","[-0.017974110320210457, 0.010659225285053253, ...","[0.05041504, 0.03857422, -0.00983429, 0.029983..."


In [27]:
d = df.to_dict('records')

In [29]:
import json

# Write to file

with open("arxiv-papers-1000.json", "w") as f:
    json.dump(d, f)