# Create arXiv Embeddings

In this notebook we will:

1) Pull the arXiv dataset from Kaggle
2) Perform data preprocessing and cleanup
3) Create HuggingFace embeddings
4) Create OpenAI embeddings

## 1 - Pull the arXiv dataset from Kaggle
You will need to get a free API key from kaggle.com in order to [download this dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). You can also manually download it as long as the `.json` file ends up in this directory.


In [2]:
!pip install kaggle



In [1]:
import kaggle

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Unzip the file and there you have it!

## 2 - Perform data preprocessing and cleanup

In [8]:
import json
import pandas as pd
import os
import re
import string


DATA_PATH = "arxiv-metadata-oai-snapshot.json"
YEAR_CUTOFF = 2012
YEAR_PATTERN = r"(19|20[0-9]{2})"
ML_CATEGORY = "cs.LG"
DATASET_SIZE=1000

In [9]:
# Preprocessing function to clean data
def process(paper: dict):
    paper = json.loads(paper)
    if paper['journal-ref']:
        years = [int(year) for year in re.findall(YEAR_PATTERN, paper['journal-ref'])]
        years = [year for year in years if (year <= 2022 and year >= 1991)]
        year = min(years) if years else None
    else:
        year = None
    return {
        'id': paper['id'],
        'title': paper['title'],
        'year': year,
        'authors': paper['authors'],
        'categories': ','.join(paper['categories'].split(' ')),
        'abstract': paper['abstract']
    }

# Data loading function
def papers():
    with open(DATA_PATH, 'r') as f:
        for paper in f:
            paper = process(paper)
            if paper['year']:
                if paper['year'] >= YEAR_CUTOFF and ML_CATEGORY in paper['categories']:
                    yield paper

In [11]:
# Load dataset into Pandas dataframe and take a sample
df = pd.DataFrame(papers()).sample(n=DATASET_SIZE)

In [12]:
# Avg length of the abstracts - num tokens
df.abstract.apply(lambda a: len(a.split())).mean()

170.91

In [13]:
# Helper function to clean the description!
def clean_description(description: str):
    if not description:
        return ""
    # remove unicode characters
    description = description.encode('ascii', 'ignore').decode()

    # remove punctuation
    description = re.sub('[%s]' % re.escape(string.punctuation), ' ', description)

    # clean up the spacing
    description = re.sub('\s{2,}', " ", description)

    # remove urls
    #description = re.sub("https*\S+", " ", description)

    # remove newlines
    description = description.replace("\n", " ")

    # remove all numbers
    #description = re.sub('\w*\d+\w*', '', description)

    # split on capitalized words
    description = " ".join(re.split('(?=[A-Z])', description))

    # clean up the spacing again
    description = re.sub('\s{2,}', " ", description)

    # make all words lowercase
    description = description.lower()

    return description.strip()

In [14]:
# Apply the cleaner method on both title and abstract
texts = df.apply(lambda r: clean_description(r['title'] + ' ' + r['abstract']), axis=1).tolist()

## 3 - Creating Hugging Face Embeddings

First up, we will use the `SentenceTransformer` library from Hugging Face to create embeddings for our arXiv papers.

In [15]:
from sentence_transformers import SentenceTransformer

provider = "huggingface"
model_name = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(model_name)

In [16]:
# Create embeddings from the title and abstract
embeddings = model.encode(
    texts,
    batch_size=32,
    normalize_embeddings=True,
    show_progress_bar=True
)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [17]:
embeddings.shape

(1000, 768)

In [18]:
# Add embeddings to df
df = df.reset_index().drop('index', axis=1)
df['vector'] = embeddings.tolist()

In [19]:
import pickle

def export(provider: str, df: pd.DataFrame):
    # Export to file!
    with open(f'arxiv_{provider}_embeddings_{len(df)}.pkl', 'wb') as f:
        data = pickle.dumps(df)
        f.write(data)

In [20]:
export(provider, df)

## OpenAI Embeddings

Next, we will use OpenAI Embeddings for our arXiv papers. You will need to set your OpenAI API Key below!

In [29]:
import openai

provider = "openai"
model_name = "text-embedding-ada-002"
openai.api_key = "YOUR API KEY HERE"

In [22]:
import time

embeddings = []

def batchify(seq: list, size: int):
    for pos in range(0, len(seq), size):
        yield seq[pos:pos + size]

In [23]:
for i, batch in enumerate(batchify(texts, size=25)):
    st = time.time()
    response = await openai.Embedding.acreate(
        input=batch,
        engine=model_name
    )
    embeddings += [r["embedding"] for r in response["data"]]
    print(f"Finished batch {i} in {time.time()-st} sec")

Finished batch 0 in 0.8815181255340576 sec
Finished batch 1 in 0.932546854019165 sec
Finished batch 2 in 0.574897050857544 sec
Finished batch 3 in 0.7541701793670654 sec
Finished batch 4 in 0.615044116973877 sec
Finished batch 5 in 0.8264923095703125 sec
Finished batch 6 in 0.6802268028259277 sec
Finished batch 7 in 0.51546311378479 sec
Finished batch 8 in 0.6050238609313965 sec
Finished batch 9 in 0.5370199680328369 sec
Finished batch 10 in 0.7593579292297363 sec
Finished batch 11 in 0.7050559520721436 sec
Finished batch 12 in 0.5193271636962891 sec
Finished batch 13 in 0.7762241363525391 sec
Finished batch 14 in 0.583622932434082 sec
Finished batch 15 in 0.5284409523010254 sec
Finished batch 16 in 0.5984358787536621 sec
Finished batch 17 in 0.5663950443267822 sec
Finished batch 18 in 0.5583038330078125 sec
Finished batch 19 in 0.7001559734344482 sec
Finished batch 20 in 0.6820611953735352 sec
Finished batch 21 in 0.6424939632415771 sec
Finished batch 22 in 0.7226591110229492 sec
Fini

In [24]:
len(embeddings)

1000

In [25]:
embeddings[:1]

[[-0.006390967406332493,
  -0.009409022517502308,
  0.006029081996530294,
  0.0010320763103663921,
  -0.0024752262979745865,
  0.033054549247026443,
  -0.012915446422994137,
  -0.009043623693287373,
  -0.03985659033060074,
  -0.040053341537714005,
  0.007771753706037998,
  0.023441746830940247,
  0.003411561017856002,
  0.009127945639193058,
  0.004353166092187166,
  -0.008488497696816921,
  0.021994203329086304,
  0.030665401369333267,
  -0.012093299068510532,
  0.009753339923918247,
  -0.03229564428329468,
  0.02608386054635048,
  -0.01790454611182213,
  -0.03715825825929642,
  0.023793091997504234,
  0.0032042674720287323,
  0.026913035660982132,
  -0.03735501319169998,
  0.01715969480574131,
  -0.0033518322743475437,
  0.017131587490439415,
  0.01585269160568714,
  -0.011559254489839077,
  -0.030356217175722122,
  -0.013414356857538223,
  -0.023385530337691307,
  0.01783427782356739,
  -0.02788274735212326,
  0.007163926959037781,
  -0.0040018209256231785,
  0.0038893905002623796,


In [26]:
df['vector'] = embeddings

In [27]:
df.head()

Unnamed: 0,id,title,year,authors,categories,abstract,vector
0,1809.07098,Novelty-organizing team of classifiers in nois...,2015,"Danilo Vasconcellos Vargas, Hirotaka Takano, J...","cs.AI,cs.LG,cs.MA,cs.NE,cs.SY","In the real world, the environment is consta...","[-0.006390967406332493, -0.009409022517502308,..."
1,1104.5617,Learning high-dimensional directed acyclic gra...,2012,"Diego Colombo, Marloes H. Maathuis, Markus Kal...","stat.ME,cs.LG,math.ST,stat.TH",We consider the problem of learning causal i...,"[-0.0033850811887532473, 0.018416495993733406,..."
2,1910.00722,Comparing Deep Learning Models for Multi-cell ...,2019,"Sudhir Sornapudi, G. T. Brown, Zhiyun Xue, Rod...","eess.IV,cs.AI,cs.CV,cs.LG",Liquid-based cytology (LBC) is a reliable au...,"[-0.020038487389683723, 0.026860982179641724, ..."
3,1811.0838,The Effect of Explicit Structure Encoding of D...,2019,"Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia,...","cs.SD,cs.AI,cs.LG,eess.AS",With recent breakthroughs in artificial neur...,"[-0.018991515040397644, -0.005144109483808279,..."
4,2009.03362,Topological Data Analysis for Portfolio Manage...,2019,"Rodrigo Rivera-Castro, Polina Pilyugina, Evgen...","q-fin.PM,cs.LG,q-fin.ST",Portfolio management is essential for any in...,"[0.007545717526227236, 0.007572934031486511, 0..."


In [28]:
# Export to file!
export(provider, df)