# Create arXiv Embeddings

In this notebook we will:

1) Pull the arXiv dataset from Kaggle
2) Perform data preprocessing and cleanup
3) Create HuggingFace embeddings
4) Create OpenAI embeddings

## 1 - Pull the arXiv dataset from Kaggle
You will need to get a free API key from kaggle.com in order to [download this dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). You can also manually download it as long as the `.json` file ends up in this directory.


In [2]:
!pip install kaggle



In [1]:
import kaggle

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Unzip the file and there you have it!

## 2 - Perform data preprocessing and cleanup

In [1]:
import json
import pandas as pd
import os
import re
import string


DATA_PATH = "arxiv-metadata-oai-snapshot.json"
YEAR_CUTOFF = 2012
YEAR_PATTERN = r"(19|20[0-9]{2})"
ML_CATEGORY = "cs.LG"
DATASET_SIZE=1000

In [2]:
# Preprocessing function to clean data
def process(paper: dict):
    paper = json.loads(paper)
    if paper['journal-ref']:
        years = [int(year) for year in re.findall(YEAR_PATTERN, paper['journal-ref'])]
        years = [year for year in years if (year <= 2022 and year >= 1991)]
        year = min(years) if years else None
    else:
        year = None
    return {
        'id': paper['id'],
        'title': paper['title'],
        'year': year,
        'authors': paper['authors'],
        'categories': ','.join(paper['categories'].split(' ')),
        'abstract': paper['abstract']
    }

# Data loading function
def papers():
    with open(DATA_PATH, 'r') as f:
        for paper in f:
            paper = process(paper)
            if paper['year']:
                if paper['year'] >= YEAR_CUTOFF and ML_CATEGORY in paper['categories']:
                    yield paper

In [3]:
# Load dataset into Pandas dataframe and take a sample
df = pd.DataFrame(papers()).sample(n=DATASET_SIZE)

In [4]:
# Avg length of the abstracts - num tokens
df.abstract.apply(lambda a: len(a.split())).mean()

170.974

In [5]:
# Helper function to clean the description!
def clean_description(description: str):
    if not description:
        return ""
    # remove unicode characters
    description = description.encode('ascii', 'ignore').decode()

    # remove punctuation
    description = re.sub('[%s]' % re.escape(string.punctuation), ' ', description)

    # clean up the spacing
    description = re.sub('\s{2,}', " ", description)

    # remove urls
    #description = re.sub("https*\S+", " ", description)

    # remove newlines
    description = description.replace("\n", " ")

    # remove all numbers
    #description = re.sub('\w*\d+\w*', '', description)

    # split on capitalized words
    description = " ".join(re.split('(?=[A-Z])', description))

    # clean up the spacing again
    description = re.sub('\s{2,}', " ", description)

    # make all words lowercase
    description = description.lower()

    return description.strip()

In [6]:
# Apply the cleaner method on both title and abstract
texts = df.apply(lambda r: clean_description(r['title'] + ' ' + r['abstract']), axis=1).tolist()

## 3 - Creating Hugging Face Embeddings

First up, we will use the `SentenceTransformer` library from Hugging Face to create embeddings for our arXiv papers.

In [7]:
from sentence_transformers import SentenceTransformer

provider = "huggingface"
model_name = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(model_name)

In [8]:
# Create embeddings from the title and abstract
embeddings = model.encode(
    texts,
    batch_size=32,
    normalize_embeddings=True,
    show_progress_bar=True
)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [9]:
embeddings.shape

(1000, 768)

In [10]:
# Add embeddings to df
df = df.reset_index().drop('index', axis=1)
df['vector'] = embeddings.tolist()

In [11]:
import pickle

def export(provider: str, df: pd.DataFrame):
    # Export to file!
    with open(f'arxiv_{provider}_embeddings_{len(df)}.pkl', 'wb') as f:
        data = pickle.dumps(df)
        f.write(data)

In [12]:
export(provider, df)

## OpenAI Embeddings

Next, we will use OpenAI Embeddings for our arXiv papers. You will need to set your OpenAI API Key below!

In [13]:
import openai

provider = "openai"
model_name = "text-embedding-ada-002"
openai.api_key = "YOUR API KEY HERE"

In [14]:
import time

embeddings = []

def batchify(seq: list, size: int):
    for pos in range(0, len(seq), size):
        yield seq[pos:pos + size]

In [15]:
for i, batch in enumerate(batchify(texts, size=25)):
    st = time.time()
    response = await openai.Embedding.acreate(
        input=batch,
        engine=model_name
    )
    embeddings += [r["embedding"] for r in response["data"]]
    print(f"Finished batch {i} in {time.time()-st} sec")

Finished batch 0 in 1.3301560878753662 sec
Finished batch 1 in 0.6993160247802734 sec
Finished batch 2 in 0.5338308811187744 sec
Finished batch 3 in 0.5452241897583008 sec
Finished batch 4 in 0.5871119499206543 sec
Finished batch 5 in 0.5499382019042969 sec
Finished batch 6 in 0.6116580963134766 sec
Finished batch 7 in 0.6274292469024658 sec
Finished batch 8 in 0.607694149017334 sec
Finished batch 9 in 0.5203940868377686 sec
Finished batch 10 in 0.5557739734649658 sec
Finished batch 11 in 0.5200791358947754 sec
Finished batch 12 in 0.6593611240386963 sec
Finished batch 13 in 0.7762067317962646 sec
Finished batch 14 in 0.5594639778137207 sec
Finished batch 15 in 0.6422548294067383 sec
Finished batch 16 in 0.8352680206298828 sec
Finished batch 17 in 0.8164877891540527 sec
Finished batch 18 in 0.5289170742034912 sec
Finished batch 19 in 0.7132449150085449 sec
Finished batch 20 in 0.5894589424133301 sec
Finished batch 21 in 0.6650261878967285 sec
Finished batch 22 in 0.6269240379333496 sec

In [16]:
len(embeddings)

1000

In [17]:
embeddings[:1]

[[-0.009211593307554722,
  -0.004272043239325285,
  0.007776453625410795,
  -0.04344134032726288,
  -0.022561728954315186,
  0.01425794418901205,
  -0.008757689036428928,
  -0.005420154891908169,
  -0.029049893841147423,
  -0.027688179165124893,
  -0.008050131611526012,
  0.03855518996715546,
  0.021947622299194336,
  0.007749753538519144,
  -0.010793584398925304,
  -0.0011164050083607435,
  0.033535540103912354,
  -0.0012323843548074365,
  0.011227463372051716,
  -0.022628478705883026,
  -0.022641828283667564,
  0.015953412279486656,
  -0.015846610069274902,
  0.00011618789721978828,
  0.0016111944569274783,
  -0.014765249565243721,
  0.028168784454464912,
  -0.028676090762019157,
  0.01906399242579937,
  -0.010733508504927158,
  0.00943187065422535,
  0.002012532902881503,
  -0.01627381518483162,
  -0.021213363856077194,
  -0.03190682455897331,
  0.001890712883323431,
  0.0032574329525232315,
  0.0013558730715885758,
  0.03732697665691376,
  0.002474781358614564,
  0.0076029021292924

In [18]:
df['vector'] = embeddings

In [19]:
df.head()

Unnamed: 0,id,title,year,authors,categories,abstract,vector
0,1203.2987,Mining Education Data to Predict Student's Ret...,2012,"Surjeet Kumar Yadav, Brijesh Bharadwaj and Sau...","cs.LG,cs.DB",The main objective of higher education is to...,"[-0.009211593307554722, -0.004272043239325285,..."
1,1501.06794,Computing Functions of Random Variables via Re...,2015,"Bernhard Sch\""olkopf, Krikamol Muandet, Kenji ...","stat.ML,cs.DS,cs.LG",We describe a method to perform functional o...,"[-0.021475376561284065, -0.018714837729930878,..."
2,2104.01713,Identification of Nonlinear Dynamic Systems Us...,2015,"Erkan Kayacan, Erdal Kayacan and Mojtaba Ahmad...","eess.SY,cs.LG,cs.SY",In order to achieve faster and more robust c...,"[-0.005838060285896063, 0.01266573928296566, -..."
3,2110.04745,Reinforcement Learning for Systematic FX Trading,2022,Gabriel Borrageiro and Nick Firoozye and Paolo...,"q-fin.TR,cs.LG",We explore online inductive transfer learnin...,"[-0.01577484980225563, 0.0025767202023416758, ..."
4,2209.00783,TypoSwype: An Imaging Approach to Detect Typo-...,2021,"Joon Sern Lee, Yam Gui Peng David","cs.CR,cs.LG",Typo-squatting domains are a common cyber-at...,"[-0.0010055192979052663, 0.004883701913058758,..."


In [20]:
# Export to file!
export(provider, df)

## Cohere Embeddings

In [23]:
!pip install cohere

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting cohere
  Downloading cohere-4.0.6-py3-none-any.whl (27 kB)
Collecting backoff<3.0,>=2.0
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Installing collected packages: backoff, cohere
Successfully installed backoff-2.2.1 cohere-4.0.6


In [24]:
import cohere

provider = "cohere"
model_name = "small"
co = cohere.Client("YOUR API KEY HERE")


def batch_embed(texts, model, batch_size=480):
   embeddings = []
   for start_idx in range(0, len(texts), batch_size):
       embeddings.extend(co.embed(texts[start_idx:start_idx+batch_size], model=model).embeddings)
   return embeddings

In [25]:
embeddings = batch_embed(texts, model=model_name)

In [26]:
len(embeddings)

1000

In [28]:
len(embeddings[0])

1024

In [29]:
df['vector'] = embeddings

In [30]:
df.head()

Unnamed: 0,id,title,year,authors,categories,abstract,vector
0,1203.2987,Mining Education Data to Predict Student's Ret...,2012,"Surjeet Kumar Yadav, Brijesh Bharadwaj and Sau...","cs.LG,cs.DB",The main objective of higher education is to...,"[1.1582031, -1.0996094, -1.4960938, 0.60498047..."
1,1501.06794,Computing Functions of Random Variables via Re...,2015,"Bernhard Sch\""olkopf, Krikamol Muandet, Kenji ...","stat.ML,cs.DS,cs.LG",We describe a method to perform functional o...,"[-2.0566406, -1.5654297, -0.33422852, 1.765625..."
2,2104.01713,Identification of Nonlinear Dynamic Systems Us...,2015,"Erkan Kayacan, Erdal Kayacan and Mojtaba Ahmad...","eess.SY,cs.LG,cs.SY",In order to achieve faster and more robust c...,"[-0.52783203, 0.11566162, 1.2568359, 3.6484375..."
3,2110.04745,Reinforcement Learning for Systematic FX Trading,2022,Gabriel Borrageiro and Nick Firoozye and Paolo...,"q-fin.TR,cs.LG",We explore online inductive transfer learnin...,"[2.1269531, 1.3740234, 0.2944336, 1.7099609, -..."
4,2209.00783,TypoSwype: An Imaging Approach to Detect Typo-...,2021,"Joon Sern Lee, Yam Gui Peng David","cs.CR,cs.LG",Typo-squatting domains are a common cyber-at...,"[0.2220459, 0.38134766, 0.36010742, -2.8125, -..."


In [31]:
export(provider, df)